Tuesday, July 17, 2012

Luhn Algorithm in Teradata SQL

Luhn algorithm is used, among others, to calculate the checksum digit of credit cards and mobile handset IMEIs. The following is my attempt to implement this algorithm in Teradata sql. It flags each IMEI as valid or not. Needless to say, IMEIs would typically be read from a table rather than hard-coded as in this example.
SELECT dt3.IMEI
,CASE
WHEN (dt3.dig1 + dt3.dig2 + dt3.dig3 + dt3.dig4
+ dt3.dig5 + dt3.dig6 + dt3.dig7
+ dt3.dig8 + dt3.dig9 + dt3.dig10 + dt3.dig11
+ dt3.dig12 + dt3.dig13
+ dt3.dig14 + dt3.dig15) MOD 10 = 0 THEN 'Y'
ELSE 'N'
END AS VALID_IMEI
FROM
(
SELECT dt2.IMEI
,dt2.dig1
,CASE
WHEN dt2.dig2 = 0 THEN 0
WHEN dt2.dig2 MOD 9 = 0 THEN 9
ELSE dt2.dig2 MOD 9
END AS dig2
,dt2.dig3
,CASE
WHEN dt2.dig4 = 0 THEN 0
WHEN dt2.dig4 MOD 9 = 0 THEN 9
ELSE dt2.dig4 MOD 9
END AS dig4
,dt2.dig5
,CASE
WHEN dt2.dig6 = 0 THEN 0
WHEN dt2.dig6 MOD 9 = 0 THEN 9
ELSE dt2.dig6 MOD 9
END AS dig6
,dt2.dig7
,CASE
WHEN dt2.dig8 = 0 THEN 0
WHEN dt2.dig8 MOD 9 = 0 THEN 9
ELSE dt2.dig8 MOD 9
END AS dig8
,dt2.dig9
,CASE
WHEN dt2.dig10 = 0 THEN 0
WHEN dt2.dig10 MOD 9 = 0 THEN 9
ELSE dt2.dig10 MOD 9
END AS dig10
,dt2.dig11
,CASE
WHEN dt2.dig12 = 0 THEN 0
WHEN dt2.dig12 MOD 9 = 0 THEN 9
ELSE dt2.dig12 MOD 9
END AS dig12
,dt2.dig13
,CASE
WHEN dt2.dig14 = 0 THEN 0
WHEN dt2.dig14 MOD 9 = 0 THEN 9
ELSE dt2.dig14 MOD 9
END AS dig14
,dt2.dig15
FROM
(
SELECT dt1.IMEI
,SUBSTR(dt1.IMEI, 1, 1) AS dig1
,SUBSTR(dt1.IMEI, 2, 1) * 2 AS dig2
,SUBSTR(dt1.IMEI, 3, 1) AS dig3
,SUBSTR(dt1.IMEI, 4, 1) * 2 AS dig4
,SUBSTR(dt1.IMEI, 5, 1) AS dig5
,SUBSTR(dt1.IMEI, 6, 1) * 2 AS dig6
,SUBSTR(dt1.IMEI, 7, 1) AS dig7
,SUBSTR(dt1.IMEI, 8, 1) * 2 AS dig8
,SUBSTR(dt1.IMEI, 9, 1) AS dig9
,SUBSTR(dt1.IMEI, 10, 1) * 2 AS dig10
,SUBSTR(dt1.IMEI, 11, 1) AS dig11
,SUBSTR(dt1.IMEI, 12, 1) * 2 AS dig12
,SUBSTR(dt1.IMEI, 13, 1) AS dig13
,SUBSTR(dt1.IMEI, 14, 1) * 2 AS dig14
,SUBSTR(dt1.IMEI, 15, 1) AS dig15
FROM
(
SELECT '999999999999999' AS IMEI
FROM SYS_CALENDAR.CALENDAR
WHERE calendar_date = CURRENT_DATE

UNION

SELECT '352651010278244' AS IMEI
FROM SYS_CALENDAR.CALENDAR
WHERE calendar_date = CURRENT_DATE

) AS dt1
) AS dt2
) AS dt3
;

It returns the following answerset:

Saturday, July 7, 2012

Some Parameters And Their Estimators And Standard Errors

[table class = "table-bordered"] PARAMETER ($\Theta$), ESTIMATOR ($\hat{\Theta}$), STD ERR ($\sigma_{\hat{\Theta}})$, ESTIMATE OF STD ERR ($s_{\hat{\Theta}}$)

$\mu$, $\hat{y}$, $\frac{\sigma}{\sqrt{n}}$, $\frac{s}{\sqrt{n}}$

$\mu_1 - \mu_2$, $\hat{y_1} - \hat{y_2}$, $\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$,"$\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2} {n_2}},  n_1 \ge 30,  n_2 \ge 30$"

$\mu_1 - \mu_2$, $\hat{y_1} - \hat{y_2}$, $\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$,"$\sqrt{s_p^{2*}(\frac{1}{n_1} + \frac{1}{n_2})},  n_1 < 30  or  n_2 < 30$"

$\frac{\sigma_1^2}{\sigma_2^2}$,$\frac{s_1^1}{s_2^2}$," " ," " , [/table]

*$s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}$

Confidence Intervals for a Population Parameter $\Theta$ and Test Statistics for $H_0: \Theta = \Theta_0$, where    $\Theta = \mu$  or  $(\mu_1 - \mu_2)$:

[table class = "table-bordered"] SAMPLE SIZE, CONFIDENCE INTERVAL, TEST STATISTIC

Large, $\hat{\Theta} \pm z_{\alpha/2}s_{\hat{\Theta}}$, $z = \frac{\hat{\Theta} - \Theta_0}{s_{\hat{\Theta}}}$

Small,$\hat{\Theta} \pm t_{\alpha/2}s_{\hat{\Theta}}$,$t = \frac{\hat{\Theta} - \Theta_0}{s_{\hat{\Theta}}}$ [/table]

The test statistic for testing the null hypothesis $(H_0: \frac{\sigma_1^2}{\sigma_2^2} = 1)$ is $F = \frac{s_1^2}{s_2^2}$

Tuesday, July 3, 2012

SQL Query Tuning

SQL query tuning is a dark art but here are some simple tips that anyone can use.

Sunday, July 1, 2012

Summation Rules

Here is a list of summation rules, with 'k' denoting a constant:

  1. \[\sum_{i=1}^{n}{(x_i + y_i)} = \sum_{i={1}}^{n}{x_i} + \sum_{i=1}^{n}{y_i}\]

  2. \[\sum_{i=1}^{n}{(x_i - y_i)} = \sum_{i=1}^{n}{x_i}  - \sum_{i=1}^{n}{y_i}\]

  3. \[\sum_{i=1}^{n}{x_iy_i} \neq \sum_{i=1}^{n}{x_i} \times \sum_{i=1}^{n}{y_i}\]

  4. \[\sum_{i=1}^{n}{x_i}^2 \neq (\sum_{i=1}^{n}{x_i})^2\]

  5. \[\sum_{i=1}^{n}{k} = nk\]

  6. \[\sum_{i=1}^n{(x_i + k)} = \sum_{i=1}^{n}{x_i} + \sum_{i=1}^{n}{k} = \sum_{i=1}^{n}{x_i}+nk\]

  7. \[\sum_{i=1}^{n}{(x_i-k)} = \sum_{i=1}^{n}{x_i} - nk\]

Simulating Central Limit Theorem

In this post, the Central Limit Theorem (CLT) will be simulated using Python, SciPy and matplotlib. The CLT gives the following two theorems:

Theorem 1: If the sampled population is normally distributed with population mean = $\mu$ and standard deviation = $\sigma$, then for any sample size n, sampling distribution of the mean for simple random samples is normally distributed, with mean ($\mu_\overline{x}$) = $\mu$ and standard deviation ($\sigma_\overline{x}$) = $\frac{\sigma}{\sqrt {n}}$.

Theorem 2: For large sample sizes $(n\geq 30)$, even if the sampled population is not normally distributed, the sampling distribution of the mean for simple random samples is approximately normally distributed, with mean ($\mu_\overline{x}$) = $\mu$ and standard deviation ($\sigma_\overline{x}$) = $\frac{\sigma}{\sqrt {n}}$.

The standard deviation of sampling mean ($\sigma_\overline{x}$) is also known as the standard error of mean, standard error of estimate or simply as standard error as the sampling standard deviation gives the average deviation of the sample means from the actual population mean.

The following Python script simulates Theorem 1.
#----------------------------------------------------------------------------
# By Ram Limbu @ ramlimbu.com
# Copyright 2012 Ram Limbu
# License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
#----------------------------------------------------------------------------

# import required packages
import random
import matplotlib.pylab as pylb

def plotDist(t, val='Values'):
'''plot histogram of distribution'''
pylb.hist(t, bins=50, color='R')
pylb.title('Central Limit Theorem Simulation')
pylb.ylabel('frequency')
pylb.xlabel(val)
pylb.show()

def simulateSampDist(t_pop):
'''simulate sampling distributions'''
samp_sizes = (5,15,25)
t_samp_mean = []
for i in range(0,len(samp_sizes)):
for j in range(0,1000):
t_samp_mean.append(pylb.mean(random.sample(t_pop, samp_sizes[i])))

# plot the population distribution
samp_mean = round(pylb.mean(t_samp_mean), 2)
samp_stddev = round(pylb.std(t_samp_mean), 2)
val = 'mean = ' + str(samp_mean) + ' stddev = ' + str(samp_stddev) \
+ ' n=' + str(samp_sizes[i])
plotDist(t_samp_mean, val)

def main():
'''simulate central limit theorem'''

# generate a population of 10,000 normally distributed random numbers
# with mean = 50 and standard deviation = 10
t_pop = []
mu = 50
sigma = 10
pop_size = 10000

for i in range(0,pop_size):
t_pop.append(random.gauss(mu, sigma))

# plot a histogram of the population
plotDist(t_pop)

# simulate sampling distributions by drawing and replacing
# samples of various sizes from this population
simulateSampDist(t_pop)

if __name__ == '__main__':
main()

First, it creates a normally distributed population of 10,000 pseudo-random numbers with $\mu$ = 50 and $\sigma$ = 10. Then, it takes a sample of size 5, calculates its mean and appends it to a list, repeating this process 1,000 times. Finally, it plots the histogram of the sample means. Then, it repeats the whole sampling process with samples of size 15 and 25.

Histogram of normally distributed population of 10,000 random numbers.


The following histogram shows the distribution of sampling means of size 5. It has mean of 49.92, which is close to the population mean of 50, and the standard error of 4.51. The latter figure decreases as the sample size increases.


The next two figures show the distribution of means of samples of size 15 and 25. Note that in each case, the distribution has a mean close to 50, with the standard error decreasing as the sample size increases.

histogram of sample means of size 15


 


 The following Python script simulates Theorem 2, generating means, standard errors and histograms of samples of size 30, 50 and 100 from a population of exponentially distributed pseudo-random numbers with $\mu$=50.
#----------------------------------------------------------------------------
# By Ram Limbu @ ramlimbu.com
# Copyright 2012 Ram Limbu
# License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
#----------------------------------------------------------------------------

# import required packages
import random
import matplotlib.pylab as pylb

def plotDist(t, val='Values'):
'''plot histogram of distribution'''
pylb.hist(t, bins=30, color='R')
pylb.title('Central Limit Theorem Simulation')
pylb.ylabel('frequency')
pylb.xlabel(val)
pylb.show()

def simulateSampDist(t_pop):
'''simulate sampling distributions'''
samp_sizes = (30,50,100)
t_samp_mean = []
for i in range(0,len(samp_sizes)):
for j in range(0,1000):
t_samp_mean.append(pylb.mean(random.sample(t_pop, samp_sizes[i])))

# plot the population distribution
samp_mean = round(pylb.mean(t_samp_mean), 2)
samp_stddev = round(pylb.std(t_samp_mean), 2)
val = 'mean = ' + str(samp_mean) + ' stddev = ' + str(samp_stddev) \
+ ' n=' + str(samp_sizes[i])
plotDist(t_samp_mean, val)

def main():
'''simulate central limit theorem'''

# generate a population of 10,000 exponentially distributed random numbers
# with mean = 10
t_pop = []
mu = 50.00
pop_size = 10000

for i in range(0,pop_size):
t_pop.append(random.expovariate(1/mu))

# plot a histogram of the population
plotDist(t_pop)

# simulate sampling distributions by drawing and replacing
# samples of various sizes from this population
simulateSampDist(t_pop)

if __name__ == '__main__':
main()

The following figure shows the histogram of a population of exponentially distributed 10, 000 pseudo-random numbers. The distribution is centred on 50, and is positively skewed.



The next three figures show the distributions of means of samples of size 30, 50 and 100. Even though the samples were drawn from a non-normal distribution, the sample distributions approximate normal distribution as the sample size increases.







The importance of the CLT lies in the fact that given normally distributed populations or sufficiently large sample sizes ($n\geq 30$), it shows that (a) the sample statistic ($\mu_\overline{x}$) approximates population parameter ($\mu$) and (b) sampling distributions approximate normal distribution. Once a distribution approximates normality, the properties of normal distribution can be used to make inferences about the sampled population.