6  Law of large numbers and the central limit theorem

\[ \renewcommand{\P}{\mathbb{P}} \renewcommand{\E}{\mathbb{E}} \newcommand{\R}{\mathbb{R}} \newcommand{\var}{\mathrm{Var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\dx}{\,\mathrm{d}x} \newcommand{\dy}{\,\mathrm{d}y} \newcommand{\eps}{\varepsilon} \]

6.1 Joint behaviour of random variables

We discussed with you discrete and continuous random variables. For a random variable \(X\), you know now how to calculate some of its characteristics: expected value \(\E(X)\) and variance \(\var(X)\). Now we consider how to characterise a pair of random variables.

ImportantMemorize

Let \(X,Y:\Omega\to\R\) be two random variables. Their joint cumulative distribution function (joint CDF) is the function \(F_{X,Y}:\R^2\to\R\) defined by \[ F_{X,Y}(x,y)=\P(X\leq x, Y\leq y). \]

Remark. In some cases, the joint CDF can be calculated manually from the description of the problem. However, in general, to calculate the joint CDF, we need to have additional information: joint probaility mass function in discrete case and joint probability density function in continuous case.

ImportantMemorize

Let \(X:\Omega\to\{x_1,x_2,\ldots\}\) and \(Y:\Omega\to\{y_1,y_2,\ldots\}\) be two discrete random variable with the joint CDF \(F_{X,Y}\). Their joint probability mass function (joint PMF) is the function \[ p_{X,Y}(x_i,y_j)=\P(X=x_i,Y=y_j) \] (we can also say that \(p_{X,Y}(x,y)=0\) for all other \(x\) and \(y\)). Then \[ F_{X,Y}(x,y) =\sum_{x_i\leq x}\sum_{y_j\leq y}p_{X,Y}(x_i,y_j). \]

ImportantMemorize

Let \(X,Y:\Omega\to\R\) be two continuous random variable with the joint CDF \(F_{X,Y}\). Their joint probability density function (joint PDF) is the function \(f_{X,Y}:\R^2\to\R\) such that \[ F_{X,Y}(x,y) = \int_{-\infty}^x\biggl(\int_{-\infty}^y f_{X,Y}(u,v)\,\mathrm{d}v \biggr) \mathrm{d}u. \] Note that \[ \int_{-\infty}^\infty\biggl(\int_{-\infty}^\infty f_{X,Y}(u,v)\,\mathrm{d}v \biggr) \mathrm{d}u = 1. \]

Example 6.1 If joint PMF (for the discrete case) or joint PDF (for the continuous case) are not given explicitly, the joint CDF can be usually calculated only in very special cases, e.g. when one of variable is defined in terms of another one. For example, consider \(X\sim U(0,1)\) and \(Y=X^2\), then \(F_{X,Y}(x,y)=0\) if \(x<0\) or \(y<0\), and for \(x\geq0, y\geq0\), we have \[ \begin{aligned} F_{X,Y}(x,y)& = \P(X\leq x, X^2\leq y)=\P(0\leq X\leq x, X^2\leq y) \\ &= \P(0\leq X\leq x, -\sqrt{y}\leq X\leq \sqrt{y})\\ &=\P(0\leq X\leq \min\{x,\sqrt{y}\})\\ & = \begin{cases} 1, & \text{if } \min\{x,y\}\geq 1,\\ \min\{x,\sqrt{y}\}, & \text{if } \min\{x,y\}< 1. \end{cases} \end{aligned} \] However, if we just have two random variables, e.g. \(X\sim U(0,1)\) and \(Y=\sim(0,1)\), then we can’t calculate \(F_{X,Y}\), unless we explicitly define the function \(f_{X,Y}\).

ImportantMemorize
  • In the discrete case: for any \(g:\R^2\to\R\), \[ \E(g(X,Y)) = \sum_{i}\sum_{j}g(x_i,y_j)p_{X,Y}(x_i,y_j), \] in particular, \[ \E(XY) = \sum_{i}\sum_{j} x_i y_j p_{X,Y}(x_i,y_j). \]

  • In the continuous case: for any \(g:\R^2\to\R\), \[ \E(g(X,Y)) = \int_{-\infty}^\infty \biggl(\int_{-\infty}^\infty g(x,y)\cdot f_{X,Y}(x,y)\dy\biggr)\dx, \] in particular, \[ \E(X\,Y) = \int_{-\infty}^\infty\biggl(\int_{-\infty}^\infty x\cdot y\cdot f_{X,Y}(x,y)\dy\biggr)\dx. \]

TipRemember

For the given joint PDF \(f_{X,Y}\) (for the continuous case), we can calculate PDFs of \(X\) and \(Y\) (so-called marginal PDFs): \[ \begin{aligned} f_X(x) &= \int_{-\infty}^\infty f_{X,Y}(x,y)\dy,\\ f_Y(y) &= \int_{-\infty}^\infty f_{X,Y}(x,y)\dx. \end{aligned} \] Stress that, however, for given \(f_X\) and \(f_Y\) one can’t uniquely recover \(f_{X,Y}\).

Similarly, for the discrete case, we can define the marginal PMFs, e.g. \[ \begin{aligned} p_X(x_i) = \sum_{j}p_{X,Y}(x_i,y_j),\\ p_Y(y_j) = \sum_{i}p_{X,Y}(x_i,y_j). \end{aligned} \] Again, one can’t uniquely recover \(p_{X,Y}\) by the pair of \(p_X\) and \(p_Y\).

ImportantMemorize

Recall that two random variables \(X\) and \(Y\) are independent if, for any \(a,b\in\R\), the events \(\{X\leq a\}\) and \(\{Y\leq b\}\) are independent, i.e. if \[ \P(X\leq a, Y\leq b) = \P(X\leq a) \P(Y\leq b), \] i.e. for all \(x\) and \(y\), \[ F_{X,Y}(x,y) = F_X(x) F_Y(y). \] We also have then that: in the discrete case, \[ p_{X,Y}(x_i,y_j)=p_X(x_i)p_Y(y_j), \] and, in the continuous case, \[ f_{X,Y}(x,y) = f_X(x) f_Y(y). \]

Therefore, in both cases, we have that, for independent random variables, \[ \E(XY) = \E(X)\E(Y). \]

ImportantMemorize

Let \(X:\Omega\to\R\) and \(Y:\Omega\to\R\) be two random variables (discrete or continuous). Covariance \(\cov(X,Y)\) describes the joint variability of these random variables, and it is defined by \[ \begin{aligned} \cov(X,Y) :&= \E\Bigl(\bigl(X-\E(X)\bigr) \cdot \bigl(Y-\E(Y)\bigr)\Bigr) \\& = \E(XY) - \E(X)\E(Y). \end{aligned} \]

TipRemember

For any \(X,Y,V,W:\Omega\to\R\), \(a,b,c,d\in\R\),

  • \(\cov(X,Y)=\cov(Y,X)\)

  • \(\cov(X,X)=\var(X)=\sigma^2(X)\)

  • \(\cov(X,a)=0\)

  • \(\cov(aX, bY)=ab\cov(X,Y)\)

  • \(\cov(X+a,Y+b)=\cov(X,Y)\)

  • \(\cov(aX+bY,cV+dW)=ac\cov(X,V)\) \({}\quad + ad\cov(X,W)+bc\cov(Y,V)+bd\cov(Y,W)\)

  • \(\var(aX+bY)=a^2\var(X)+b^2\var(Y) +2ab\cov(X,Y)\)

ImportantMemorize

For any random variables \(X,Y:\Omega\to\R\), we define their correlation as follows \[ \corr(X,Y)=\dfrac{\cov(X,Y)}{\sigma(X)\cdot\sigma(Y)}. \] It can be proved that \[ \bigl\lvert \corr(X,Y)\bigr\rvert \leq 1, \] i.e.  \[ -1\leq \corr(X,Y)\leq 1. \]

ImportantMemorize

Two random variables, \(X\) and \(Y\), are called uncorrelated if their covariance is zero: \(\cov(X,Y)=0\) (and, hence, their correlation is also zero: \(\corr(X,Y)=0\)).

TipRemember

For uncorrelated random variables \(X\) and \(Y\) and for any \(a,b\in\R\), \[ \var(a X + b Y) = a^2\var(X)+b^2\var(Y). \]

TipReminder

Recall, that for independent random variables \(X\) and \(Y\), \(\E(XY)=\E(X)\E(Y)\), and hence \(\cov(X,Y)=0\). Therefore, independent random variables are uncorrelated. The opposite statement is wrong that is shown by the following example.

Example 6.2 Let \(X\sim U(-1,1)\) and \(Y=X^2\). Then \(XY=X^3\), and hence \[ \cov(X,Y)=\E(XY)-\E(X)\E(Y)=\E(X^3)-\E(X)\E(X^2). \] We know that \[ \E(X)= \frac{(-1)+1}{2}=0. \] Next, since \(f_X(x)=\frac12\) for \(x\in(-1,1)\) and \(f_X(x)=0\) otherwise, we have \[ \begin{aligned} \E(X^3)&=\int_{-\infty}^\infty x^3 f_X(x)\dx & = \frac12\int_{-1}^1 x^3dx=\frac12\biggl\lfloor \frac{x^4}{4}\biggr\rfloor_{-1}^1 = 0. \end{aligned} \] Therefore, \(\cov(X,Y)=0\) i.e. \(X\) and \(Y\) are uncorrelated. However, clearly, \(X\) and \(Y=X^2\) are not independent.

6.2 Law of large numbers (LLN)

TipRemember

Let \(X_1, \ldots, X_n:\Omega\to\R\) be random variables. They are called independent if, for any \(a_1,\ldots,a_n\in\R\) the events \(\{X_1\leq a_1\}\), , \(\{X_n\leq a_n\}\) are independent. Or, equivalently, if their joint CDF \[ F_{X_1,\ldots,X_n}(x_1,\ldots,x_n):=\P(X_1\leq x_1, \ldots, X_n\leq x_n) \] is the product of the CDFs for each \(X_i\): \[ F_{X_1,\ldots,X_n}(x_1,\ldots,x_n) = F_{X_1}(x_1)\ldots F_{X_n}(x_n). \]

ImportantMemorize

Random variable \(X_1, X_2, \ldots, X_n, \ldots\) are called independent and identically distributed random variables (in brief, i.i.d. r.v.) if any finite group of them \(X_1,\ldots,X_n\) are independent and they all have the same distribution: \(F_{X_1}=F_{X_2}=\ldots=F_{X_n}=\ldots =F_X\), where \(X\) is their joint distribution; i.e. \(X_1\sim X\), \(X_2\sim X\), .

TipRemember

Let \(X_1, X_2,\ldots,X_n,\ldots\) be i.i.d. r.v. with \(\E(X)=\mu\) and \(\var(X)=\sigma^2<\infty\). Consider the sample average \[ \bar{X}_n = \frac{X_1+\ldots+X_n}{n}. \] Then \[ \E(\bar{X}_n) = \mu,\qquad\var(\bar{X}_n) = \frac{\sigma^2}{n}. \]

ImportantLaw of Large Numbers (LLN)

Let \(X_1, X_2,\ldots,X_n,\ldots\) be i.i.d. r.v. with \(\E(X)=\mu\) and \(\var(X)=\sigma^2<\infty\). Then \(\bar{X}_n\to \mu\) stochastically (or, it is also called in probability): namely, for each \(\varepsilon>0\), \[ \lim_{n\to\infty} \P(|\bar{X}_n-\mu|>\varepsilon) = 0. \]

Remark. In other words, the bigger \(n\) you take, the smaller chances are for the event \(\{|\bar{X}_n-\mu|>\varepsilon\}\). Equivalently, one can state that \[ \lim_{n\to\infty} \P(|\bar{X}_n-\mu|\leq \varepsilon) = 1, \] i.e. the bigger \(n\) you take, the larger chances are for the event \(|\bar{X}_n-\mu|\leq \varepsilon\) that is equivalent to \(\mu-\varepsilon<\bar{X}_n<\mu+\varepsilon\). Thus, informally speaking, with \(n\) growing, there are good chances to find \(\bar{X}_n\) around \(\mu\). We can choose \(\varepsilon\) arbitrary small, i.e. we can require that \(\bar{X}_n\) must very close to \(\mu\), and the law of large numbers states that there is high probability (close to \(1\)) to achieve this if we take \(n\) alrge enough.

TipRemember

Let \(A\) be a random event as a result of an experiment; let \(\P(A) =p\). Consider the Bernoulli random variable \(X\) with \(X=1\) if \(A\) holds and \(X=0\) otherwise. Let \(X_1,\ldots,X_n,\ldots\) be i.i.d. r.v. with \(X_n\sim X\). Then \[ \mu=\E(X) = 1\cdot p +0\cdot (1-p)=p. \] Next, the sample average \(\bar{X}_n=\frac1n(X_1+\ldots+X_n)\) is the number of times when \(A\) took place when we repeated the experiment \(n\) times. (Note that \(X_1+\ldots+X_n\sim Bin(n,p)\).) Therefore, \(\bar{X}_n\) is the frequency of the event \(A\) took place among \(n\) trials. Then LLN states that \[ \frac{\text{number of trials when $A$ happened}}{\text{number $n$ of all trial}}\to \P(A) \] in a proper sense (as \(n\to\infty\)). This corresponds to our “intuitive” understanding of the probability.

Example 6.3 Consider many rolls of a fair \(6\)-sides dice. Let \(X_j\) be the score of the \(j\)-th roll, and \(S_n=X_1+\ldots+X_n\) be the sum of the scores in the first \(n\) rolls. All \(X_j\) are i.i.d. r.v. with \[ \E(X)=1\cdot \frac16 + 2\cdot \frac16 + \ldots + 6\cdot \frac16=\frac{7\cdot 6}{2}\cdot \frac16 = \frac72. \] Therefore, by LLN, for any small \(\eps>0\), \[ \lim_{n\to\infty}\P\Biggl( \biggl\lvert \frac{S_n}{n} - \frac72\biggr\rvert \leq \eps \Biggr) = 1, \] or equivalently, \[ \lim_{n\to\infty}\P\biggl( \Bigl(\frac72-\eps\Bigr)n \leq S_n\leq \Bigl(\frac72+ \eps \Bigr)n \biggr) = 1. \]

6.3 Central limit theorem (CLT)

As we could see, LLN states that, for i.i.d. r.v. \(X_n\sim X\), \(n\geq1\), \[ \overline{X}_n=\frac{X_1+\ldots+X_n}{n}\to \E(X) \] stochastically (in probability) as \(n\to\infty\). We have also shown that \(\E(\overline{X}_n)=\E(X)\) for each \(n\), i.e. we can reformulate LLN as follows: \[ \overline{X}_n - \E(\overline{X}_n)\to 0, \qquad n\to\infty. \]

TipPreparation

The Central Limit Theorem (CLT) shows how fast \(\overline{X}_n\) converges to \(\E(X)\). To formulate it, we recall that \(\var(\overline{X}_n )=\frac{\sigma^2(X)}{n}\). Hence, \[ \sigma(\overline{X}_n ) = \frac{\sigma(X)}{\sqrt{n}}. \] We define, for \(\mu:=\E(X)\), \(\sigma:=\sigma(X)\), \[ Z_n:= \frac{\overline{X}_n - \E(\overline{X}_n)}{\sigma(\overline{X}_n )} = \frac{\overline{X}_n - \mu}{\frac{\sigma}{\sqrt{n}}}=\frac{\sqrt{n}}{\sigma}(\overline{X}_n-\mu). \] Note that \[ \E(Z_n) =0, \qquad \var(Z_n) = 1. \]

ImportantCentral Limit Theorem (CLT)

Let \(X_1,\ldots,X_n,\ldots\) be i.i.d. r.v. with \(X_n\sim X\), \(\mu:=\E(X)\), \(\sigma^2:=\var{X}<\infty\). Let \(Z_n\) be defined as above. Then \[ Z_n\to Z\sim \mathcal{N}(0,1), \quad n\to\infty, \] where the convergence is in distribution; the latter means that \[ \lim_{n\to\infty}\P(Z_n\leq z)= \Phi(z), \quad z\in\R, \] where \(\Phi(z)=F_Z(z)=\P(Z\leq z)\). As a corollary, \[ \lim_{n\to\infty}\P(a\leq Z_n\leq b)= \Phi(b)-\Phi(a), \quad a,b\in\R, \] and \[ \lim_{n\to\infty}\P(Z_n\geq c)= 1-\Phi(c), \quad c\in\R. \]

Remark. The central limit theorem shows, in particular, that \(\overline{X}_n\) fluctuates around its expected value \(\E(\overline{X}_n)=\mu\) with the standard deviation \(\sigma(\overline{X}_n)=\frac{\sigma}{\sqrt{n}}\) which is significantly less than the standard deviation \(\sigma\) for each of \(X_n\) around their expected value \(\E(X_n)=\mu\). And this is tru regardless of the distribution of \(X_n\). Consider this in an example.

Example 6.4 The average teacher’s salary in New Jersey in 2023 is $63178. Suppose that the salaries are distributed normallly with standard deviation $7500. Hence, we have that \(X\sim \mathcal{N}(63178,7500^2)\).

Lets first find the probability that a randomly selected teacher makes less than $60000 per year. We have

\[ \begin{aligned} \P(X<60000)&=\P\biggl(\frac{X-63178}{7500}<\frac{60000-63178}{7500}\biggr)\\ &=\P(Z<-0.42)=\Phi(-0.42), \end{aligned} \] where \(Z=\frac{X-63178}{7500}\sim \mathcal{N}(0,1)\).

Using statistical tables (and the equality \(\Phi(-0.42)=1-\Phi(0.42)\)) or Python commands

from scipy.stats import norm
norm.cdf(-0.42)
0.3372427268482495

we conclude that \[ \P(X<60000)\approx 0.337, \] i.e. one out of three randomly picked teachers may have the salary less than $60000.

Consider now a sample of \(100\) teacher salaries. The sample mean (the average salary) is then \[ \overline{X}_{100}=\frac{X_1+\ldots+X_{100}}{100} \] where all \(X_j\sim X\) are i.i.d. r.v. We know that \[ \E(\overline{X}_{100}) = \E(X)=63178 \] and \[ \sigma(\overline{X}_{100})=\frac{\sigma(X)}{\sqrt{100}}=750. \] Therefore, the probability that the average salary of any sample of \(100\) teachers is less than $60000 per year is \[ \begin{aligned} \P(\overline{X}_{100}<60000)&= \P\biggl( \frac{\overline{X}_{100}-63178}{750} < \frac{60000-63178}{750} \biggr) \\ &= \P(\overline{Z}_{100}<-4.2)\approx\Phi(-4.2) \end{aligned} \] where the latter approximate equality is accroding to CLT. Since

norm.cdf(-4.2)
1.3345749015906314e-05

we have that \[ \P(\overline{X}_{100}<60000)\approx 0.0000133, \] i.e., informally speaking, this is very unlikely.