The Central Limit Theorem and Inference

This project consists of two parts:

A simulation exercise demonstrating the CLT.
Basic inferential data analysis.

Part 1: The Central Limit Theorem

In this project we investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. We set lambda = 0.2 for all of the simulations. We investigate the distribution of samples with size n=40 drawn from the exponential distribution. We simulate drawing 40 samples from the exponential distribution 1000 times and investigate the mean and standard deviation of these 1000 sets of 40 samples. We find that they obey the Central Limit Theorem.

The code below simulates drawing samples of size n=1 and n=40 to illustrate the difference between the concepts of sample mean and population mean. We expect (the mean of) samples of size n=1 drawn from the exponential distribution to follow an exponential distribution, while the mean of samples of size n=40 should be Gaussian shaped.

lambda <- 0.2
n = 40
mns = NULL
pop_distribution = NULL
for (i in 1 : 1000) pop_distribution = c(pop_distribution, mean(rexp(1, lambda)))
for (i in 1 : 1000) mns = c(mns, mean(rexp(n, lambda)))

We display here the sample mean and compare it to the theoretical mean of the distribution with vertical bars.

mean_simulation <- mean(mns)
mean_theory <- 1/lambda

library(ggplot2)

n_40_samples <- data.frame(mns)
names(n_40_samples) = c("values")
n_40_samples$dis <- "Gaussian"

ggplot(n_40_samples, aes(values, fill = dis)) + geom_density(alpha = 0.2) + geom_vline(xintercept = c(mean_theory, mean_simulation))

The means (vertical bars) can barely be distinguished because they are so close!

difference <- mean_theory - mean_simulation
difference
## [1] 0.02461718

Here we show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

var_theory <- 1/lambda**2/40
var_simulation <- var(mns)

error_bars_sim <- mean_simulation + c(-1,1)*sqrt(var_simulation)
error_bars_theory <- mean_theory + c(-1,1)*sqrt(var_theory)

ggplot(n_40_samples, aes(values, fill = dis)) + geom_density(alpha = 0.2) + geom_vline(xintercept = c(error_bars_sim, error_bars_theory))

difference_var <- var_theory - var_simulation

difference_var
## [1] 0.001984882

Again, the differences in the simulation and theoretical properties barely register.

Show that the distribution is approximately normal.
n_1_samples <- data.frame(pop_distribution)
names(n_1_samples) = c("values")
n_1_samples$dis <- "Exponential"

distributions <- rbind(n_40_samples, n_1_samples)

ggplot(distributions, aes(values, fill = dis)) + geom_density(alpha = 0.2) + geom_vline(xintercept = mean_theory)

The sample distribution is somewhat sharply peaked due to the sample size n=40 being somewhat large and looks far more Gaussian than the original exponential distribution! The vertical line shows the theoretical population mean. The population distribution constructed from sampling with sample size n=1 (the smallest sample size, equivalent to reconstructing the underlying population distribution as the number of size n=1 samples goes to infinity), looks exponential, just as we would expect.

Part 2: Basic Inferential Data Analysis

Part 2: Basic Inferential Data Analysis Instructions

Now in the second portion of the project, we’re going to analyze the ToothGrowth data in the R datasets package.

suppressMessages(library(dplyr))
suppressMessages(library(tidyr))

We load the ToothGrowth data and perform some basic exploratory data analyses

head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Data consists of 3 columns, length of tooth growth in response to a supplement taking an unknown number of values and a dose, which takes a numeric value. We look at the content of this variable:

table(ToothGrowth$supp)
## 
## OJ VC 
## 30 30

We see there are two kinds of supplements, one named OJ, the other names VC. Note we could have used str as well to get this property.

Now to see what is in dose

hist(ToothGrowth$dose)

So we see there is a half dose, a standard dose, and a double dose. It seems there are six groups total, grouped by the interaction of the variables dose and supp.

Provide a basic summary of the data.
summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

We now use confidence intervals and hypothesis tests to compare tooth growth by supp and dose.

We do a pairwise comparison between the different supplements and doses independently. That is, we do not look for interactions between supplements and doses even though they might be there.

Isolating for supplement type OJ and VC:

OJ <- ToothGrowth %>% filter(supp=="OJ") %>% select(len)
VC <- ToothGrowth %>% filter(supp=="VC") %>% select(len)

Isolating for dosage:

doses <- split(ToothGrowth, ToothGrowth$dose)

We now apply a t-test based on supplements. For all t-tests, we assume a two sided test because the null hypothesis is that the treatment outcomes are equal, while the alternative hypothesis is that they are not equal. The variances are not assumed to be equal, as the case of unequal variances reduces to the case of equal variances when the variances are equal. We lose nothing by assuming they are unequal. We adopt a 95% percent threshold, thus the null hypothesis only rejected when the p-value is less than 0.05.

t.test(VC,OJ,alternative="two.sided",var.equal=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  VC and OJ
## t = -1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.5710156  0.1710156
## sample estimates:
## mean of x mean of y 
##  16.96333  20.66333

p-value is 0.061, thus we do not reject the null hypothesis at the 95% confidence level.

Apply t-tests pairwise based on doses

t.test(doses[[1]]$len,doses[[2]]$len,alternative="two.sided",var.equal=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  doses[[1]]$len and doses[[2]]$len
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean of x mean of y 
##    10.605    19.735

p-value ~ 0, thus we reject the null hypothesis

t.test(doses[[1]]$len,doses[[3]]$len,alternative="two.sided",var.equal=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  doses[[1]]$len and doses[[3]]$len
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.15617 -12.83383
## sample estimates:
## mean of x mean of y 
##    10.605    26.100

p-value ~ 0, thus we reject the null hypothesis

t.test(doses[[2]]$len,doses[[3]]$len,alternative="two.sided",var.equal=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  doses[[2]]$len and doses[[3]]$len
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean of x mean of y 
##    19.735    26.100

p-value ~ 0, thus we reject the null hypothesis

It appears that supplement type may not influence tooth growth, but there is evidence that dosage does.