Blog

Projects: Commute Time & Linear Regression

See below for some instructions about the two projects–both are due this Friday (Dec 18), but the linear regression project is optional and will be counted as extra credit:

  • Commute time project: this project is required. Either print out your spreadsheet to hand in Wednesday with the final, or email me a link to your spreadsheet before Friday (if you’re using Google, it’s best to use the “Share” button to make sure I can view it with the link; if you’re using Excel, you can email it as an attachment). Look at this post for what your commute time spreadsheet should include (including a link to a sample project spreadsheet I created).

 

  • Linear regression project: this project will count as extra credit.  Again, you can either print out a hardcopy to hand in, or email me your spreadsheet.  Your project should include the following:
    • a scatterplot of your paired data set, including the “trendline” (i.e., the linear regression line)
    • the correlation coefficient and the linear regression parameters (slope and y-intercept)
    • a brief written description (1-2 paragraphs) about the scatterplot and statistics (e.g., how strongly are the variables correlated? is the linear regression line a good model? are there any outliers?)
    • you can refer to the solutions to the exam questions about paired data sets and linear regression for examples of how to write about these topics (Exam #1, Question #4 & Exam #2, Question #5)

Final Exam – Wednesday Dec 18

A reminder that our final exam is this Wednesday (Dec 18), at the usual class time.  Below is a list of topics/exercises to review for the final exam:

  • frequency distributions/relative frequencies:
    • Quiz #1
    • Exam #1, Question #1
    • Exam #3, Question #2
  • paired data sets: scatterplots, correlation coefficient, linear regression
      • Exam #1, Question #4
      • Exam #2, Question #5
  • calculating probabilities
    • Exam #2, Questions #2 & #3
    • Quiz #2
  • random variables & probability distributions; expected value
    • Exam #3, Question #4
  • binomial experiments/random variables
    • Exam #3, Question #3
  • writing about statistics & probability:
    • Exam #1, Question #2
    • Exam #2, Question #1(a)
    • Exam #3, Question #3(b)

Exam #3 – Wednesday, Dec 11

We will take our second midterm exam this Wednesday (Dec 11).  The exam will mostly be on the material on random variables and probability distributions that we’ve covered since the 2nd exam (including a question on binomial random variables).  There will also be a question on combinations and permutations.

To prepare for the exam:

    • review the class outline pdfs (and your class notes) on:
      • “Random variables and probability distributions”
      • “Expected value and variance of a discrete random variable”
      • “Binomial random variables & binomial distributions”
      • see the Schedule page for the class outlines
    • review Exam #2, Exercises #2, #3 & #4 (Exam #2 solutions have been uploaded to Files; think about Exercises #2 & #3 in the context of a random variable and its probability distribution)
  • review the following WebWork exercises:
    • HW9-RandomVariables: all
    • HW10-ExpectedValue: #5-7, 10, 11, 15
    • HW11-BinomialDistribution: #1, #2(a)(b), #3, #4,

Google Spreadsheet: Binomial Random Variable/Distribution

Here is the spreadsheet we worked on in class together yesterday:

https://docs.google.com/spreadsheets/d/1hpuqDeJ7vjYFOup8qjfMJ5KIJ5OE-KsfHJpNOI0ufqI/edit?usp=sharing

We used it to answer HW9, Problem 4 (here’s my version of the question). But this is also an example of a binomial random variable.

OpenLab Assignment: Post your linear regression project topic (part 1)

As I discussed in class and posted on here last week, you should choose a topic for your linear regression project today.

To encourage you to do this, I’m making this an OpenLab assignment; completing this simple assignment will earn you one point towards the participation component of your course grade:

  • decide whether you want to work on this project individually or together with a partner
  • decide on a topic (broadly speaking) that you’re interested in studying statistically
    • some examples: economics, sports, public health, law/crime, business, finance, entertainment (movies, music, etc), demographics (population, race, gender, etc), politics/elections, transit/transportation, weather, environment, energy, …
  • post your topic in the comments below (if you are working with a partner, only one of you has to post, but then mention in the comment who you’re working with)
  • this should just be one or two sentences. e.g., “I would like to work on a dataset related to the environment and energy consumption.”

This assignment is due this Friday (November 29).  Late submissions will receive partial credit. (But it should only take 10minutes to complete, so just get it done today!)

There will be a “part 2” to this assignment next week, when I will ask you to decide on a specific topic, e.g., “I will analyze a paired dataset regarding CO2 emissions and wealth (GDP per capita), at the country-level.”  You can start thinking about that over the long weekend.

Here are some websites you can browse for ideas for specific topics:

Linear Regression Project

Yesterday in class, I introduced the 2nd project for the semester (please remember to continue collecting your commute time data for that project!)

This project will involve:

  • finding a paired data set on a topic you’re interested in
  • creating a scatterplot with the linear regression trendline
  • computing the correlation coefficient and linear regression parameters
  • writing up a short (1-2pp) discussion of the data and your findings.

Here is a timeline for the first steps for this project:

  • by Mon Nov 25: decide whether you want to work on this project individually or together with a partner
    • if the latter, find a research partner in the class!
  • by Wed Nov 27: decide on a topic (broadly speaking) that you’re interested in studying statistically
  • by Wed Dec 4: decide on a specific topic & find an appropriate paired data set (we will spend some class time on this during which I will help you individually!)

Job Opportunity: CUNY Census Corps

Here is another job opportunity via CUNY, which is actually related to statistics. The US is conducting a census in 2020, and CUNY is putting together is Census Corps:

CUNY Census Corps students will educate, engage, and mobilize their neighbors, friends, family, and other students to complete the 2020 census.

Here are the types of things Census Corps students will do:

  • Give presentations
  • Table at events
  • Collaborate with community or student groups
  • Track and use data

CUNY Census Corps starts in mid-January and lasts for 5 to 7 months. Students will earn $15/hour and work approximately 12 hours per week.

The deadline to apply is this Sunday (Nov 17). Click here to apply–the application should take only 30 minutes to complete.

image from CUNY Census Corps website
CUNY Census Corps

Watch this short video to learn more about the 2020 Census:

CST Colloquium: “Sports Data Science” – Thurs, Nov 14

There is a CST Colloquium talk on Thursday which is related to statistics. I will be there, and I strongly encourage you to attend if you can.

(To incentivize you to attend, you will earn 1pt towards your participation grade if you do attend! Also, you get free pizza.)

Here are the details:

CST Colloquium Series

Title: Sports Data Science

Presented by: Claudio T. Silva,  NYU

Thursday November 14 from 12:00 to 1pm

Room N918

Refreshment (pizza & soda) will be served.

CST Colloquium Series Title: Sports Data Science Presented by: Claudio T. Silva,  NYU Thursday November 14 from 12:00 to 1pm Room N918 Refreshment (pizza & soda) will be served.
CST Colloquium – Sports Data Science

“HW8-Counting” – Questions/Hints

I received a couple followup questions about Problem 7 in the”HW8-Counting” WebWork set. We discussed this exercise in class on Wednesday, but I thought I would post my reply in case it helps explain the reasoning:

This version of the question has the following numbers:

A bag contains 7 red marbles, 5 white marbles, and 9 blue marbles. You draw 3 marbles out at random, without replacement.

You are asked to compute 3 different probabilities:

What is the probability that all the marbles are red?

As usual, you have to compute how many outcomes are in the given event (in this case, the event that all the chosen 3 marbles are red), and divide by how many outcomes are in the sample space.

For this 1st part, you can compute in terms of permutations: there are P(7,3)=7*6*5 different permutations of choosing 3 red marbles (from the 7 in the bag), and P(21,3)=21*20*19 different permutations of choosing 3 marbles at random (from the 7+5+9=21 total marbles in the bag).

Hence, the probability that all 3 chosen marbles are red is

P(7,3)/P(21,3)=7*6*5/21*20*19

The next part asks:

What is the probability that exactly two of the marbles are red?

This part is trickier. I think it’s easier to do the computations in terms of combinations instead of permutations:

If we don’t care about the order in which there are chosen, there are C(21,3)=(21*20*19)/(3*2*1) different combinations of 3 marbles chosen from the 21 (“21 choose 3”). This is the size of the sample space if we think in terms of combinations; so that will be the denominator for calculating the probability.

For the numerator, we need to figure out how many combinations there are of 2 red marbles and 1 non-red marble (to get exactly 2 red marbles). There are “7 choose 2” combinations  of 2 red marbles (chosen from the 7 red marbles in the bag) and “14 choose 1” choices for the 1 non-red ball. You need to multiply these two numbers to get the total number of ways to get 2 red marbles and 1 non-red marble, i.e., C(7,2)*C(14,1)=[(7*6)/(2*1)]*[14/1].

(Note that it should be obvious that C(14,1)=14, or indeed C(n,1)=n for any positive integer n: there are n different ways of choosing 1 object from a set of n object!)

So the probability that exactly 2 of the marbles are red (and hence 1 is non-red) is

[C(7,2)*C(14,1)]/C(21,3)

You can figure out the 3rd part using the same techniques:

What is the probability that none of the marbles are red?

 

PS: It looks like there some of the later exercises that most of you are still working through. I’ll try to post some hints about these in the comments below tomorrow (and we can also discuss them in class on Monday).

PPS: Here’s an outline of how to approach Problem 9:

A box contains 55 balls numbered from 1 to 55. If 6 balls are drawn with replacement, what is the probability that at least two of them have the same number?

The key here, which happens with some probability computations, is to think in terms of the complement of the event you’re being asked about: we are trying to calculate P(E), where the event E = “at least two of the 6 balls  drawn have the same number”; consider instead the complement of E:

E^C = “no two of the 6 balls drawn have the same number”, i.e., “all 6 numbers drawn are different from each other.”

It’s easier to calculate the probability of E^C: there are 55*54*53*52*51*50 different outcomes in E^C (we have 55 choices for the first number; since the 2nd number chosen must be different from the 1st, there are 54 remaining choices; the 3rd number must be different from the 1st two, so there are 53 remaining choices; and so on).

Note that since the balls are chose with replacement, there are 55 choices on each of the 6 draws, meaning there are 55^6 different outcomes in the sample space.  Hence,

P(E^C) = 55*54*53*52*51*50/(55^6)

Therefore, P(E) = 1 – 55*54*53*52*51*50/(55^6)

Exam #2 – Wednesday, Nov 13

As I announced in class yesterday, we will take our second midterm exam this Wednesday (Nov 13).  The exam will mostly be on the material on probability that we’ve covered since the 1st exam, but will also include an exercise on linear regression.

To prepare for the exam:

    • review the class outline pdfs (and your class notes) going back to “Linear Regression” through the most recent one that we discussed last week (“Counting Principles: Permutation and combinations”) (see the Schedule page)
    • review Quizzes #2 & #3 (solutions have been uploaded to Files)
    • review Exam #1, Exercise #4 (the last exercise on the exam, especially the parts about linear regression; Exam #1 solutions have been uploaded to Files)
  • review the following WebWork exercises and their solutions:
    • HW4-PairedData: #3, 20, 21, 22
    • HW5-Probability: #1-5, 10
    • HW6-EqualProbabilities:#1, 2, 4, 5, 7, 9
    • HW7-ConditionalProbability: #3, 4, 6, 7
    • HW8-Counting: #4, 7, 8, 9 (solutions to HW8 will be available after it closes; the solutions to the other HW sets are available now)