<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Posts | Marko Budinich Abarca</title><link>https://marko.budinich.cl/post/</link><atom:link href="https://marko.budinich.cl/post/index.xml" rel="self" type="application/rss+xml"/><description>Posts</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>2025</copyright><image><url>https://marko.budinich.cl/images/icon_hu2e99c64734a113c17030ca4f379673e0_25276_512x512_fill_lanczos_center_2.png</url><title>Posts</title><link>https://marko.budinich.cl/post/</link></image><item><title>Testing the Central Limit Theorem</title><link>https://marko.budinich.cl/post/clt/</link><pubDate>Tue, 12 Jan 2021 11:04:13 +0100</pubDate><guid>https://marko.budinich.cl/post/clt/</guid><description>&lt;p>In my first post ever, I would like to illustrate the central limit theorem, which pinpoints the importance of Normal distribution. I took the main idea from &lt;strong>&lt;a href="https://learning.oreilly.com/library/view/statistics-in-a/9781449361129/" target="_blank" rel="noopener">Statistics in a Nutshell, 2nd edition&lt;/a>&lt;/strong>, chapter 3.&lt;/p>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>Central Limit Theorem (CLT) establishes that in many situations, &amp;ldquo;when independent random variables are added, their properly normalized sum tends toward a normal distribution&amp;rdquo; &lt;strong>&lt;a href="https://en.wikipedia.org/wiki/Central_limit_theorem" target="_blank" rel="noopener">(Wikipedia)&lt;/a>&lt;/strong>. It means that, if $X_1,\ldots,X_n$ is a random sample drawn from a distribution with mean $\mu$ and variance $\sigma^2$, then&lt;/p>
&lt;center>$\bar{X} \stackrel{\cdot}{\sim} N(\mu,\frac{\sigma^2}{n}), n \rightarrow \infty $&lt;/center>
&lt;p>where $\bar{X}$ is the sample mean. Interestingly, CLT doesn&amp;rsquo;t specify an underlying distribution!&lt;/p>
&lt;h2 id="simulations">Simulations&lt;/h2>
&lt;p>The approach we&amp;rsquo;ll be taking here is to draw random values from uniform, lognormal and inverse gamma distributions, calculate their means and then compare them with the theoretical value given by the CLT. For this, we are going to use &lt;code>NumPy&lt;/code>, &lt;code>seaborn&lt;/code>, and &lt;code>scipy.stats&lt;/code> modules&lt;/p>
&lt;pre>&lt;code class="language-python">import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm, uniform, lognorm, invgamma
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-python">seed = 12345 # Fix the random seed to guarantee reproducibility
rng = np.random.default_rng(seed)
&lt;/code>&lt;/pre>
&lt;h3 id="normal-distribution">Normal distribution&lt;/h3>
&lt;p>&lt;code>scipy.stats&lt;/code> gives access to a variety of probability distributions. They are parametrized by location $L$ and scale $S$ parameters, and occasionally by a shape $s$ (&lt;a href="https://docs.scipy.org/doc/scipy/reference/tutorial/stats/continuous.html)">https://docs.scipy.org/doc/scipy/reference/tutorial/stats/continuous.html)&lt;/a>. For instance, the normal distribution has a probability density function&lt;/p>
&lt;center>$f(x) = \frac{exp(-(x-L)^2/2S)}{S\sqrt(2\pi)}$&lt;/center>
&lt;p>So, in this case, $L=\mu$ and $S=\sigma$ for the usual parametrization of normal distribution. Besides, we can instantiate different normal distributions with different $L$ and $S$ parameters. So, for reproducing &lt;a href="https://en.wikipedia.org/wiki/Normal_distribution#/media/File:Normal_Distribution_PDF.svg" target="_blank" rel="noopener">this&lt;/a>, we can instantiate 4 different distributions:&lt;/p>
&lt;pre>&lt;code class="language-python">norm_L0S02 = norm(loc=0,scale=np.sqrt(0.2)) # $\mu=0,\sigma^2=0.2$
norm_L0S10 = norm(loc=0,scale=np.sqrt(1.0)) # $\mu=0,\sigma^2=0.2$
norm_L0S50 = norm(loc=0,scale=np.sqrt(5.0)) # $\mu=0,\sigma^2=5.0$
norm_Ln2S05 = norm(loc=-2,scale=np.sqrt(0.5)) # $\mu=-2,\sigma^2=0.5$
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-python">x = np.linspace(-5,5,100)
y_1 = norm_L0S02.pdf(x)
y_2 = norm_L0S10.pdf(x)
y_3 = norm_L0S50.pdf(x)
y_4 = norm_Ln2S05.pdf(x)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-python">sns.lineplot(x=x,y=y_1,color='blue');
sns.lineplot(x=x,y=y_2,color='red');
sns.lineplot(x=x,y=y_3,color='orange');
sns.lineplot(x=x,y=y_4,color='green');
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="CLT_17_0.png" alt="png">&lt;/p>
&lt;h3 id="unifrom-distribution">Unifrom Distribution&lt;/h3>
&lt;p>FFirst, we create a &lt;strong>uniform&lt;/strong> distribution $U(2,3)$&lt;/p>
&lt;pre>&lt;code class="language-python">uniform_23 = uniform(loc=2,scale=1)
&lt;/code>&lt;/pre>
&lt;p>A unifrom distribution looks like this:&lt;/p>
&lt;pre>&lt;code class="language-python">sns.histplot(uniform_23.rvs(1000),stat='density');
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="CLT_22_0.png" alt="png">&lt;/p>
&lt;p>Again, &lt;code>scipy.stats&lt;/code> provides a convenient way to extract the mean and variance of a distribution&lt;/p>
&lt;pre>&lt;code class="language-python">u_23_mean, u_23_var = uniform_23.stats(moments='mv')
print(&amp;quot;Mean: {:0.2f}, Variance: {:0.3f}&amp;quot;.format(u_23_mean,u_23_var))
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>Mean: 2.50, Variance: 0.083
&lt;/code>&lt;/pre>
&lt;p>Ok, so far, so good. Next, let&amp;rsquo;s take 1000 samples of size $n = 10$ and calculate their means. Function &lt;code>rvs&lt;/code> draws random values from a given distribution:&lt;/p>
&lt;pre>&lt;code class="language-python">uni_means = [uniform_23.rvs(size=10).mean() for x in range(1000)]
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-python">sns.histplot(uni_means,stat='density');
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="CLT_27_0.png" alt="png">&lt;/p>
&lt;p>Well, that &lt;em>really looks like&lt;/em> a normal distribution of $\mu = 2.5$ and $\frac{\sigma^2}{n} = 0.0083$, right?. Let&amp;rsquo;s do a visual inspection using the Probability Density Function (pdf) of a normal distribution (better safe than sorry)&lt;/p>
&lt;pre>&lt;code class="language-python">norm_uni = norm(loc=u_23_mean,scale=np.sqrt(u_23_var/10))
x = np.linspace(2,3,100)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-python">sns.histplot(uni_means,stat='density');
sns.lineplot(x=x,y=norm_uni.pdf(x),color='red');
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="CLT_30_0.png" alt="png">&lt;/p>
&lt;h3 id="lognormal-distribution">Lognormal Distribution&lt;/h3>
&lt;p>Ok, but uniform distribution is kind of a &amp;ldquo;classic,&amp;rdquo; right? What about a less known distribution?. We are going to try with the &lt;strong>lognormal&lt;/strong> distribution.&lt;/p>
&lt;pre>&lt;code class="language-python">lnr_05_2 = lognorm(s=1/2,scale=2)
&lt;/code>&lt;/pre>
&lt;p>This distribution looks like this:&lt;/p>
&lt;pre>&lt;code class="language-python">sns.histplot(lnr_05_2.rvs(1000),stat='density');
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="CLT_35_0.png" alt="png">&lt;/p>
&lt;p>Again, we retrieve the parameters from the distribution:&lt;/p>
&lt;pre>&lt;code class="language-python">lnr_05_2_mean, lnr_05_2_var = lnr_05_2.stats(moments='mv')
print(&amp;quot;Mean: {:0.2f}, Variance: {:0.3f}&amp;quot;.format(lnr_05_2_mean,lnr_05_2_var))
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>Mean: 2.27, Variance: 1.459
&lt;/code>&lt;/pre>
&lt;p>As in the previous example, let&amp;rsquo;s take 1000 samples of size 10 and calculate their means&lt;/p>
&lt;pre>&lt;code class="language-python">lnr_means_10 = [lnr_05_2.rvs(size=10).mean() for x in range(1000)]
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-python">sns.histplot(lnr_means_10,stat='density');
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="CLT_40_0.png" alt="png">&lt;/p>
&lt;p>Hmmm&amp;hellip; Now it doesn&amp;rsquo;t look &lt;em>very&lt;/em> normal, isn&amp;rsquo;t it?. Let&amp;rsquo;s see&lt;/p>
&lt;pre>&lt;code class="language-python">norm_lnr = norm(loc=lnr_05_2_mean,scale=np.sqrt(lnr_05_2_var/10))
x = np.linspace(0.5,4.5,100)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-python">sns.histplot(lnr_means_10,stat='density');
sns.lineplot(x=x,y=norm_lnr.pdf(x),color='red');
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="CLT_43_0.png" alt="png">&lt;/p>
&lt;p>It &lt;em>kinds of&lt;/em> fits but is not very convincing. The tails seem a little bit off. CLT is valid for $n \rightarrow \infty$, so there is no fixed n that works for every distribution. Let&amp;rsquo;s be bold and check with $n$=100&lt;/p>
&lt;pre>&lt;code class="language-python">lnr_means_100 = [lnr_05_2.rvs(size=100).mean().mean() for x in range(1000)]
sns.histplot(lnr_means_100,stat='density');
norm_lnr = norm(loc=lnr_05_2_mean,scale=np.sqrt(lnr_05_2_var/100))
x = np.linspace(1.5,3,100) #norm.ppf: Percent point function (inverse of `cdf`)
sns.lineplot(x=x,y=norm_lnr.pdf(x),color='red');
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="CLT_45_0.png" alt="png">&lt;/p>
&lt;p>More likely, isn&amp;rsquo;t it?&lt;/p>
&lt;h3 id="inverse-gamma">Inverse Gamma&lt;/h3>
&lt;p>Let&amp;rsquo;s try the last distribution, an inverse gamma. It takes an $a$
parameter:&lt;/p>
&lt;pre>&lt;code class="language-python">invgamma_407 = invgamma(a=4.07)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-python">sns.histplot(invgamma_407.rvs(1000),stat='density');
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="CLT_50_0.png" alt="png">&lt;/p>
&lt;p>As before, let&amp;rsquo;s recover the parameters:&lt;/p>
&lt;pre>&lt;code class="language-python">invgamma_407_mean, invgamma_407_var = invgamma_407.stats(moments='mv')
print(&amp;quot;Mean: {:0.2f}, Variance: {:0.3f}&amp;quot;.format(invgamma_407_mean,invgamma_407_var))
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>Mean: 0.33, Variance: 0.051
&lt;/code>&lt;/pre>
&lt;p>And generate the histograms of 100 and 1000 samples means with their respective normal estimations:&lt;/p>
&lt;pre>&lt;code class="language-python">invgamma_means_0100 = [invgamma_407.rvs(size= 100).mean() for x in range(1000)]
invgamma_means_1000 = [invgamma_407.rvs(size=1000).mean() for x in range(1000)]
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-python">sns.histplot(invgamma_means_0100,stat='density');
norm_invg_1 = norm(loc=invgamma_407_mean,scale=np.sqrt(invgamma_407_var/100))
x = np.linspace(0.2,0.45,100)
sns.lineplot(x=x,y=norm_invg_1.pdf(x),color='red');
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="CLT_55_0.png" alt="png">&lt;/p>
&lt;pre>&lt;code class="language-python">sns.histplot(invgamma_means_1000,stat='density');
norm_invg_2 = norm(loc=invgamma_407_mean,scale=np.sqrt(invgamma_407_var/1000))
x = np.linspace(0.25,0.40,100) #norm.ppf: Percent point function (inverse of `cdf`)
sns.lineplot(x=x,y=norm_invg_2.pdf(x),color='red');
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="CLT_56_0.png" alt="png">&lt;/p>
&lt;p>Again, as for the lognormal distribution, for $n=100$ normal approximation seems a little bit off in the tails, probably because of the skewed shape of inverse gamma distribution.&lt;/p>
&lt;p>And that&amp;rsquo;s it! CLT is a useful tool from both theoretical and practical perspectives because it allows us to model the mean of &lt;strong>any distribution&lt;/strong> for (sufficiently) large samples.&lt;/p></description></item></channel></rss>