七夕快乐,今天你布谷了吗?什么是抽样分布?什么是样本方差?为什么样本方差很重要?点进week 4-5 了解一下吧

Start with Me | Coursera - Understanding and Visualizing Data with Python week 4-5 - Sampling Distribution and Sampling Variance

Lecture Overview

  • What is a sampling distribution ?
  • What is sampling variance ?
  • Why is sampling variance so important for making population inferences based on probability samples ?

What is a Sampling Distribution ?

  • Recall : Distribution of values on a variable of interest

    Example : Normal distribution (bell curve)

  • Assume values on variable of interest would follow certain distribution if we could measure entire population

  • Recall : When we select probability samples to make inferential statements about larger populations

    -> we refer to a sampling distribution

  • Sampling distribution = distribution of survey estimates we would see if we selected many random samples using same sampling design, and computed an estimate from each

  • Sampling distribution ≠ Distribution of values on a variable of interest

  • Key properties of sampling distributions:

    • Hypothetical !

      What would happen if we had luxury of drawing thousands of probability samples and measuring each of them?

    • Generally very different in appearance from distribution of values on a single variable of interest ...

    • With large enough probability sample size, sampling distribution of estimates will look like a normal distribution, regardless of what estimates are being computed

      Central Limit Theorem : CLT

什么是抽样分布 ?

感兴趣的变量的数值分布

正态分布
  • 以正态分布为例
  • 如果绘制一个直方图,其中包含了一个感兴趣的变量/密度函数的所有可能的值,如上图所示,会得到一个关于该特定变量上的值是如何分布的概念
  • 我们感兴趣的许多不同的变量都有可能遵循正态分布
  • 上图有两个不同的人口子群,蓝色组和红色组,这些是他们在某个变量上的数值分布
  • 一般假设,如果能测量整个群体,感兴趣的变量上的值会遵循一定的分布
  • 所以,我们对感兴趣的变量上的值的分布进行假设

抽样分布

抽样分布
  • 当选择概率样本来对更大的群体做出推断时,是根据抽样分布来进行推断陈述
  • 区分 抽样分布 和 感兴趣的变量的数值分布 是很重要的
    • 抽样分布是我们会看到的调查估计的分布
    • 抽样分布不是一个感兴趣变量上的数值分布,而是一个调查估计的分布
    • 如果使用相同的概率抽样设计反复选择许多随机样本,并从每个概率样本中计算出一个估计值,就会看到这些估计值的分布
    • 在谈论抽样分布时,谈论的是估计值的分布
    • 如果我们选择了成千上万个假设的重复的随机概率样本,并且都使用完全相同的概率抽样设计,并且从每个概率样本中计算出感兴趣的估计值,也许是一个平均值/比例,估计值的分布就会出现
  • 如果我们要抽取成千上万的概率样本,我们会看到一个估计的分布,我们抽取的任何一个样本,我们都会计算出一个估计值,如果我们一遍又一遍地重复这个概率抽样过程,然后这个分布将是所有这些估计值的样子,这就是抽样分布的概念

抽样分布的关键特征

  • 假设性

    • 抽样分布描述了如果有机会抽取成千上万的概率样本,我们会看到什么
    • 每一个都使用相同的设计,并测量每一个概率样本中的单位,然后根据收集的所有这些测量结果计算一个估计值
    • 一遍又一遍地进行,然后绘制所有这些估计值的分布
    • 这是一个假设性思维,不会在实践中真正做到这一点,但是抽样分布理论描述了如果有机会这样做,估计的分布会是什么样
    • 只是根据一个样本来估计这个抽样分布的特征
  • 抽样分布一般与一个感兴趣变量上的数值分布的外观有很大的不同

  • 样本规模越大,在理论上不断地反复抽取估计值,那么这个抽样分布就会看起来像一个正态分布

    在概率样本量足够大的情况下,选择概率样本的时候,样本量越大,无论计算的是什么估计值,估计值的抽样分布都会看起来像一个正态分布

  • 中心极限定理 CLT

    样本量越大,随着抽取越来越多的估计值,估计值的分布就越趋向于正态分布 ## What is Sampling Variance ?

  • Sampling variance = variability in the estimates described by the sampling distribution

  • Because we select a sample (do not measure everyone in a population), a survey estimate based on a single sample will not be exactly equal to population quantity of interest (cases are randomly selected !)

  • Sampling Error

  • Across hypothetical repeated samples, these sampling errors will randomly vary (some positive, some negative ...)

  • Variability of these sampling errors decribes the variance of the sampling distribution

  • If every sample estimate was equal to population quantity of interest (e.g., in the case of s Census), there would be no sampling error, and no sampling variance !

  • With a larger probability sample size, sampling more from a given population -> in theory there will be less sampling rror, and sampling errors will be less variable

  • Larger samples -> Less sampling variance

    More precise estimates, more confidence in inferential statements (but more costly)

  • Spread of sampling distribution becomes smaller as sample size become larger

  • Simulated Sampling Distributions

    • As sample size increases, sampling distributions shrink (less variance)
    • With cluster sampling, distributions spread out (more variance)

什么是样本方差 ?

样本方差

  • 样本方差是由抽样分布所描述的估计值的变异性
  • 假设如果用同样的设计抽取成千上万个概率样本,并绘制出从每个概率样本中计算出来的估计值,就可以描述这个抽样分布的变异性,即抽样方差
  • 因为我们选择的是一个样本,而不是特定人群中的每一个人,所以基于单一样本的调查估计值不会完全等于感兴趣的人群数量
  • 在概率抽样中,是随机选择,所以我们选择的是群体中所有个体的一个子样本,而在任何一个样本中,任何一个假设的重复样本中,计算出的估计值不会完全等于感兴趣的人群数量
  • 这个子样本虽然希望它在我们的概率抽样设计中具有代表性,但是计算出来的估计值不会完全等于总体人群值,这就是所谓的抽样误差

抽样误差

  • 抽样误差:虽然希望子样本具有代表性,但是计算的估计值不会完全等于群体值

  • 事实上,我们并不是测量某个人群中的所有人,我们只测量其中的一个个体样本,基于这个样本的估计值不会完全等于群体值

  • 只是希望在这些假设的重复样本中的估计值等于该群体平均值

  • 如果要把这些假设重复样本计算出来的所有估计值平均起来,任何一个估计值都不会完全等于我们感兴趣的群体值,但是所有这些估计值的平均值将等于这个群体值

    -> 无偏估计

  • 但是任何一个样本估计值并不等于真正平均群体值,这就是抽样误差

  • 在假设的重复样本中,抽样误差是会随机变化的

    一个子样本的平均数可能比人口平均数低一点;

    另一个假设的样本,平均值可能会高一点;

    另一个假设的样本,那个平均值可能非常接近人口数量

    估计值会在这些假设的重复样本中变化,有些会是负数,有些会是正数,有些正好在估计值上

  • 抽样误差的变异性描述了抽样分布的变异性

  • 如果每一个样本估计值都等于感兴趣的人口数量,比如人口普查,我们试图测量人口中的每一个人,假设性地一遍又一遍地这样做,每一个估计值都会完全等于真实的人口数量,那么不会有抽样误差,也不会有抽样方差,因为我们每一个假设的重复样本其实都是在试图测量人口中的每一个人

  • 在概率样本量越大的情况下,从特定人群中抽样越多,也就是我们的样本量越大。理论上,抽样误差会更小,抽样误差的变化会更小

  • 样本量越大,越接近人口规模,抽样误差就越小,所以抽样分布会缩小

Test1

Choose the response that correctly fills in the four blanks in the statement about sampling distributions below:

A sampling distribution is the distribution of all possible ____ that would arise from ____ , and larger sample sizes (closer to the size of the population) will result in a sampling distribution with ____ variance, meaning that estimates are ____ precise.

A. Values of a variable, a single sample, less, more

B. Values of a variable, hypothetical repeated sampling, more, less

C. Estimates, a single sample, less, less

D. Estimates, hypothetical repeated sampling, less, more

参考答案

Test1:D

A sampling distribution is the distribution of all possible estimates that would arise from hypothetical repeated sampling, and larger sample sizes will result in a sampling distribution with less variance, meaning that estimates are more precise.

模拟抽样分布

模拟抽样分布
  • 从上往下看不同的行 -- 样本量

    • 第一行有最小的样本量500
    • 下一行样本量是1000
    • 最后一行有最大的样本量5000
  • 从左往右看不同的列

    • 第一列是简单随机样本
    • 第二列是有10个采样单位的聚类样本
    • 第三列是有100个采样单位的聚类样本
  • 随着样本量的增加,从第一行到第三行,分布缩小了,不会随着样本量的增加而变宽

    随着样本量的增加,估计值的变化就会变小

    样本量越大,抽样误差越小,抽样方差越小

  • 在选择聚类样本时,抽样分布往往会分散开来,随着样本量的变大,抽样分布会趋于缩小,但是,当选择规模越来越大的聚类时,从10个采样单位增加到50个采样单位时,从左到右抽样分布估计值变得更加可变

    当选择较少的簇时,每一簇中都有更多的个体,抽样分布就会分散很多

Why is Sampling Variance Important ?

  • In practice, we only have the resources to select one sample

  • Important sampling theory (developed in early 1900s) allows us to estimate features of sampling distribution (including variance) based on one sample

  • "Magic" of probability sampling

    can select one probability sample and features of that design tell us what we need to know about the expected sampling distribution

  • Because we can estimate variance of sampling distribution based only one sample, we can make inferential statements about where most estimates based on a particular sample design will fall

    -> Can make statements about likely values of population parameters that account for variability in sampling errors that arises from random probability sampling

为什么样本方差很重要 ?

  • 在实践中,只有选择一个样本的资源

  • 重要抽样理论允许我们根据一个样本来估计抽样分布的特征(包括方差)

  • 概率抽样的魔法:

    • 可以选择一个概率样本,而该设计的特征告诉我们需要知道的关于预期抽样分布的信息
  • 由于只能根据一个样本来估计抽样分布的方差,所以可以对基于特定样本设计的大多数估计值将落在哪里做出推断

    --> 能对人口参数的可能值作出说明,说明随机概率抽样产生的抽样误差的可变性

Test2

You speculate that the mean test performance in an undergraduate psychology class is 80 / 100. Based on a single sample of 30 students, you estimate the mean test score to be 90. The estimated sampling variance based on that one sample suggests that most estimates from repeated hypothetical samples of size 30 will lie between 85 and 95. How confident are you about your speculation?

A. We have fairly strong evidence against our speculation: a mean of 80 seems unrealistic.

B. We have evidence in support of our speculation: 80 seems like a plausible mean value.

C. We need to draw several more samples of size 30 before we can make a conclusion.

参考答案

Test2:A

The estimating sampling variance suggests that most estimates of the mean test performance will lie between 85 and 95, which makes 80 seem like an unrealistic value for the mean. We do not need to draw several samples of size 30, because sampling theory allows us to estimate the variance of sampling distribution based on one sample only!

What's Next ?

  • Work with a Web App to visualize sampling distributions when selecting random samples from a population with certain features
  • See how random sampling generally produces sampling distributions with means close to true population quantity of interest, and how larger samples produce sampling distributions with less variance
  • See how biased, non-representative samples can produce sampling distributions that paint misleading pictures of the larger population