Showing posts with label Python. Show all posts
Showing posts with label Python. Show all posts

Sunday, July 1, 2012

Simulating Central Limit Theorem

In this post, the Central Limit Theorem (CLT) will be simulated using Python, SciPy and matplotlib. The CLT gives the following two theorems:

Theorem 1: If the sampled population is normally distributed with population mean = $\mu$ and standard deviation = $\sigma$, then for any sample size n, sampling distribution of the mean for simple random samples is normally distributed, with mean ($\mu_\overline{x}$) = $\mu$ and standard deviation ($\sigma_\overline{x}$) = $\frac{\sigma}{\sqrt {n}}$.

Theorem 2: For large sample sizes $(n\geq 30)$, even if the sampled population is not normally distributed, the sampling distribution of the mean for simple random samples is approximately normally distributed, with mean ($\mu_\overline{x}$) = $\mu$ and standard deviation ($\sigma_\overline{x}$) = $\frac{\sigma}{\sqrt {n}}$.

The standard deviation of sampling mean ($\sigma_\overline{x}$) is also known as the standard error of mean, standard error of estimate or simply as standard error as the sampling standard deviation gives the average deviation of the sample means from the actual population mean.

The following Python script simulates Theorem 1.
#----------------------------------------------------------------------------
# By Ram Limbu @ ramlimbu.com
# Copyright 2012 Ram Limbu
# License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
#----------------------------------------------------------------------------

# import required packages
import random
import matplotlib.pylab as pylb

def plotDist(t, val='Values'):
'''plot histogram of distribution'''
pylb.hist(t, bins=50, color='R')
pylb.title('Central Limit Theorem Simulation')
pylb.ylabel('frequency')
pylb.xlabel(val)
pylb.show()

def simulateSampDist(t_pop):
'''simulate sampling distributions'''
samp_sizes = (5,15,25)
t_samp_mean = []
for i in range(0,len(samp_sizes)):
for j in range(0,1000):
t_samp_mean.append(pylb.mean(random.sample(t_pop, samp_sizes[i])))

# plot the population distribution
samp_mean = round(pylb.mean(t_samp_mean), 2)
samp_stddev = round(pylb.std(t_samp_mean), 2)
val = 'mean = ' + str(samp_mean) + ' stddev = ' + str(samp_stddev) \
+ ' n=' + str(samp_sizes[i])
plotDist(t_samp_mean, val)

def main():
'''simulate central limit theorem'''

# generate a population of 10,000 normally distributed random numbers
# with mean = 50 and standard deviation = 10
t_pop = []
mu = 50
sigma = 10
pop_size = 10000

for i in range(0,pop_size):
t_pop.append(random.gauss(mu, sigma))

# plot a histogram of the population
plotDist(t_pop)

# simulate sampling distributions by drawing and replacing
# samples of various sizes from this population
simulateSampDist(t_pop)

if __name__ == '__main__':
main()

First, it creates a normally distributed population of 10,000 pseudo-random numbers with $\mu$ = 50 and $\sigma$ = 10. Then, it takes a sample of size 5, calculates its mean and appends it to a list, repeating this process 1,000 times. Finally, it plots the histogram of the sample means. Then, it repeats the whole sampling process with samples of size 15 and 25.

Histogram of normally distributed population of 10,000 random numbers.


The following histogram shows the distribution of sampling means of size 5. It has mean of 49.92, which is close to the population mean of 50, and the standard error of 4.51. The latter figure decreases as the sample size increases.


The next two figures show the distribution of means of samples of size 15 and 25. Note that in each case, the distribution has a mean close to 50, with the standard error decreasing as the sample size increases.

histogram of sample means of size 15


 


 The following Python script simulates Theorem 2, generating means, standard errors and histograms of samples of size 30, 50 and 100 from a population of exponentially distributed pseudo-random numbers with $\mu$=50.
#----------------------------------------------------------------------------
# By Ram Limbu @ ramlimbu.com
# Copyright 2012 Ram Limbu
# License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
#----------------------------------------------------------------------------

# import required packages
import random
import matplotlib.pylab as pylb

def plotDist(t, val='Values'):
'''plot histogram of distribution'''
pylb.hist(t, bins=30, color='R')
pylb.title('Central Limit Theorem Simulation')
pylb.ylabel('frequency')
pylb.xlabel(val)
pylb.show()

def simulateSampDist(t_pop):
'''simulate sampling distributions'''
samp_sizes = (30,50,100)
t_samp_mean = []
for i in range(0,len(samp_sizes)):
for j in range(0,1000):
t_samp_mean.append(pylb.mean(random.sample(t_pop, samp_sizes[i])))

# plot the population distribution
samp_mean = round(pylb.mean(t_samp_mean), 2)
samp_stddev = round(pylb.std(t_samp_mean), 2)
val = 'mean = ' + str(samp_mean) + ' stddev = ' + str(samp_stddev) \
+ ' n=' + str(samp_sizes[i])
plotDist(t_samp_mean, val)

def main():
'''simulate central limit theorem'''

# generate a population of 10,000 exponentially distributed random numbers
# with mean = 10
t_pop = []
mu = 50.00
pop_size = 10000

for i in range(0,pop_size):
t_pop.append(random.expovariate(1/mu))

# plot a histogram of the population
plotDist(t_pop)

# simulate sampling distributions by drawing and replacing
# samples of various sizes from this population
simulateSampDist(t_pop)

if __name__ == '__main__':
main()

The following figure shows the histogram of a population of exponentially distributed 10, 000 pseudo-random numbers. The distribution is centred on 50, and is positively skewed.



The next three figures show the distributions of means of samples of size 30, 50 and 100. Even though the samples were drawn from a non-normal distribution, the sample distributions approximate normal distribution as the sample size increases.







The importance of the CLT lies in the fact that given normally distributed populations or sufficiently large sample sizes ($n\geq 30$), it shows that (a) the sample statistic ($\mu_\overline{x}$) approximates population parameter ($\mu$) and (b) sampling distributions approximate normal distribution. Once a distribution approximates normality, the properties of normal distribution can be used to make inferences about the sampled population.

 

Sunday, June 28, 2009

Interesting Python Functions

This post will contain a collection of Python library functions that I found interesting. I will keep adding to the collection as I encounter more interesting functions.

#----------------------------------------------------------------------------
# range() - this function generates a list of numbers
#----------------------------------------------------------------------------

upperBound = 5
range(upperBound) # this generates [0, 1, 2, 3, 4]

lowerBound = 2
range(lowerBound, upperBound) # this generates [2, 3, 4]
range(1, 11, 3) # generates [1, 4, 7, 10].
# range(lowerbound, uppperbound, step)

# an example of range() in action
monthList = ("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
for i in range(12):
print monthList[i], # prints all the months of the year

#-----------------------------------------------------------------------------------------
# randrange() - generates a random number from a range of numbers
#-----------------------------------------------------------------------------------------

import random # this module is required to use randrange()
random.randrange(6) # generates a random number between 0 to 5 inclusive
random.range(6) + 1 # generates a random number between 1 to 6 inclusive

#--------------------------------------------------------------------------------
# choice() - returns a random element from a sequence
#--------------------------------------------------------------------------------

vowels = ('a','e','i','o','u')
osList = ["linux", "BSD", "UNIX", "NT", "Solaris"]
random.choice(vowels) # randomly selects a vowel
random.choice(osList) # randomly selects an os


Monday, June 22, 2009

Tuple, List and Dictionary

This series will feature short notes on Python topics. The idea is to reinforce my learning through "note taking".

Tuple



  • Tuple is just like an array. It is immutable, and holds a sequence of values. However, unlike a C/C++ array, a tuple can store values of mixed types. An example:
      myTuple = ("Python", "PHP", "Ruby", 3.1428, 2009, "Django")


  • Just like a string, a tuple can be indexed, sliced and concatenated with another tuple. When called on a tuple, the len() function returns the total number of elements in the tuple.
    totElements = len(myTuple)
    print totElements # It prints 6



List



  • A list is like a C/C++ dynamic array. Elements can be added, deleted and sorted

  • Just like a tuple, it can store data of mixed types.

  • A list is enclosed in square brackets. For example:
    myList = ["Python", "PHP", "Ruby", "Perl", ('a', 'e', 'i', 'o', 'u')]
    emptyList = [] # this creates an empty list


  • All the tuple operations are applicable to lists


Dictionary



  • A dictionary is like a hash table in other programming languages. It stores data as key-value pairs.

  • A dictionary is enclosed in curly braces.

  • The key must be an immutable data type, i.e. string or tuple. For example,
     
    myDict = {"name" : "Charles Martel", "occupation" : "Palace Mayor, coup leader", "country" : "France"}
    emptyDict = {} # This creates an empty dictionary