Sunday, July 1, 2012

Simulating Central Limit Theorem

In this post, the Central Limit Theorem (CLT) will be simulated using Python, SciPy and matplotlib. The CLT gives the following two theorems:

Theorem 1: If the sampled population is normally distributed with population mean = $\mu$ and standard deviation = $\sigma$, then for any sample size n, sampling distribution of the mean for simple random samples is normally distributed, with mean ($\mu_\overline{x}$) = $\mu$ and standard deviation ($\sigma_\overline{x}$) = $\frac{\sigma}{\sqrt {n}}$.

Theorem 2: For large sample sizes $(n\geq 30)$, even if the sampled population is not normally distributed, the sampling distribution of the mean for simple random samples is approximately normally distributed, with mean ($\mu_\overline{x}$) = $\mu$ and standard deviation ($\sigma_\overline{x}$) = $\frac{\sigma}{\sqrt {n}}$.

The standard deviation of sampling mean ($\sigma_\overline{x}$) is also known as the standard error of mean, standard error of estimate or simply as standard error as the sampling standard deviation gives the average deviation of the sample means from the actual population mean.

The following Python script simulates Theorem 1.
#----------------------------------------------------------------------------
# By Ram Limbu @ ramlimbu.com
# Copyright 2012 Ram Limbu
# License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
#----------------------------------------------------------------------------

# import required packages
import random
import matplotlib.pylab as pylb

def plotDist(t, val='Values'):
'''plot histogram of distribution'''
pylb.hist(t, bins=50, color='R')
pylb.title('Central Limit Theorem Simulation')
pylb.ylabel('frequency')
pylb.xlabel(val)
pylb.show()

def simulateSampDist(t_pop):
'''simulate sampling distributions'''
samp_sizes = (5,15,25)
t_samp_mean = []
for i in range(0,len(samp_sizes)):
for j in range(0,1000):
t_samp_mean.append(pylb.mean(random.sample(t_pop, samp_sizes[i])))

# plot the population distribution
samp_mean = round(pylb.mean(t_samp_mean), 2)
samp_stddev = round(pylb.std(t_samp_mean), 2)
val = 'mean = ' + str(samp_mean) + ' stddev = ' + str(samp_stddev) \
+ ' n=' + str(samp_sizes[i])
plotDist(t_samp_mean, val)

def main():
'''simulate central limit theorem'''

# generate a population of 10,000 normally distributed random numbers
# with mean = 50 and standard deviation = 10
t_pop = []
mu = 50
sigma = 10
pop_size = 10000

for i in range(0,pop_size):
t_pop.append(random.gauss(mu, sigma))

# plot a histogram of the population
plotDist(t_pop)

# simulate sampling distributions by drawing and replacing
# samples of various sizes from this population
simulateSampDist(t_pop)

if __name__ == '__main__':
main()

First, it creates a normally distributed population of 10,000 pseudo-random numbers with $\mu$ = 50 and $\sigma$ = 10. Then, it takes a sample of size 5, calculates its mean and appends it to a list, repeating this process 1,000 times. Finally, it plots the histogram of the sample means. Then, it repeats the whole sampling process with samples of size 15 and 25.

Histogram of normally distributed population of 10,000 random numbers.


The following histogram shows the distribution of sampling means of size 5. It has mean of 49.92, which is close to the population mean of 50, and the standard error of 4.51. The latter figure decreases as the sample size increases.


The next two figures show the distribution of means of samples of size 15 and 25. Note that in each case, the distribution has a mean close to 50, with the standard error decreasing as the sample size increases.

histogram of sample means of size 15


 


 The following Python script simulates Theorem 2, generating means, standard errors and histograms of samples of size 30, 50 and 100 from a population of exponentially distributed pseudo-random numbers with $\mu$=50.
#----------------------------------------------------------------------------
# By Ram Limbu @ ramlimbu.com
# Copyright 2012 Ram Limbu
# License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
#----------------------------------------------------------------------------

# import required packages
import random
import matplotlib.pylab as pylb

def plotDist(t, val='Values'):
'''plot histogram of distribution'''
pylb.hist(t, bins=30, color='R')
pylb.title('Central Limit Theorem Simulation')
pylb.ylabel('frequency')
pylb.xlabel(val)
pylb.show()

def simulateSampDist(t_pop):
'''simulate sampling distributions'''
samp_sizes = (30,50,100)
t_samp_mean = []
for i in range(0,len(samp_sizes)):
for j in range(0,1000):
t_samp_mean.append(pylb.mean(random.sample(t_pop, samp_sizes[i])))

# plot the population distribution
samp_mean = round(pylb.mean(t_samp_mean), 2)
samp_stddev = round(pylb.std(t_samp_mean), 2)
val = 'mean = ' + str(samp_mean) + ' stddev = ' + str(samp_stddev) \
+ ' n=' + str(samp_sizes[i])
plotDist(t_samp_mean, val)

def main():
'''simulate central limit theorem'''

# generate a population of 10,000 exponentially distributed random numbers
# with mean = 10
t_pop = []
mu = 50.00
pop_size = 10000

for i in range(0,pop_size):
t_pop.append(random.expovariate(1/mu))

# plot a histogram of the population
plotDist(t_pop)

# simulate sampling distributions by drawing and replacing
# samples of various sizes from this population
simulateSampDist(t_pop)

if __name__ == '__main__':
main()

The following figure shows the histogram of a population of exponentially distributed 10, 000 pseudo-random numbers. The distribution is centred on 50, and is positively skewed.



The next three figures show the distributions of means of samples of size 30, 50 and 100. Even though the samples were drawn from a non-normal distribution, the sample distributions approximate normal distribution as the sample size increases.







The importance of the CLT lies in the fact that given normally distributed populations or sufficiently large sample sizes ($n\geq 30$), it shows that (a) the sample statistic ($\mu_\overline{x}$) approximates population parameter ($\mu$) and (b) sampling distributions approximate normal distribution. Once a distribution approximates normality, the properties of normal distribution can be used to make inferences about the sampled population.

 

No comments:

Post a Comment