8  Maximum likelihood estimation

\[ \renewcommand{\P}{\mathbb{P}} \renewcommand{\E}{\mathbb{E}} \newcommand{\R}{\mathbb{R}} \newcommand{\var}{\mathrm{Var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\dx}{\,\mathrm{d}x} \newcommand{\dy}{\,\mathrm{d}y} \newcommand{\eps}{\varepsilon} \]

ImportantMemorize

Let \(X\) be a discrete random variable whose distribution depends on a parameter \(\theta\in\R\). Suppose that we observe the data \(x_1,\ldots,x_n\) which is the output of this random variable \(X\) in course of \(n\) independent trials. In other words, we can say that we observe that i.i.d.r.v. \(X_1,\ldots, X_n\) with \(X_i\sim X\), \(1\leq i\leq n\), take certain values: \(X_1=x_1,\ldots, X_n=x_n\). The likelihood, or likelihood function, is the function \(\mathcal{L}(\theta)=\mathcal{L}(\theta\mid x_1,\ldots,x_n)\) of the unknown parameter \(\theta\) (given the observed data \(x_1,\ldots,x_n\)) which is equal to the probability to observe this data (given the value of the parameter \(\theta\)): \[ \begin{aligned} \mathcal{L}(\theta)&=\mathcal{L}(\theta\mid x_1,\ldots,x_n)\\ :&=\P(X_1=x_1,\ldots,X_n=x_n\mid \theta) \\&= \P(X_1=x_1\mid \theta)\ldots \P(X_n=x_n\mid \theta). \end{aligned} \]

ImportantMemorize

The maximum likelihood estimator \(\theta_*\) of the parameter \(\theta\) is the argument of the maximum of the likelihppd function: \[ \theta_*=\mathop{\mathrm{argmax}}_\theta\mathcal{L}(\theta), \] that means that \[ \mathcal{L}(\theta_*) = \max_{\theta}\mathcal{L}(\theta). \]

TipRemember

The standard approach to find \(\theta_*\) is to consider the **log-likelihood* function \[ \begin{aligned} L(\theta):&=L(\theta\mid x_1,\ldots,x_n)=\ln \mathcal{L}(\theta\mid x_1,\ldots,x_n) \\& = \ln \P(X_1=x_1\mid \theta)+ \ldots + \ln \P(X_n=x_n\mid \theta). \end{aligned} \] Then \(\theta_*\) is the point of maximum for both \(\mathcal{L}\) and \(L\): \[ \theta_*=\mathop{\mathrm{argmax}}_\theta L(\theta)=\mathop{\mathrm{argmax}}_\theta\mathcal{L}(\theta). \]

Remark. The reason for this is the fact that the logarigthm \(y=\ln x\) is an increasing function, and then \[ \mathcal{L}(\theta)\leq \mathcal{L}(\theta_*) \Longleftrightarrow L(\theta)\leq L(\theta_*). \]

ImportantReminder

To check that \(\theta_*\) is the point of maximum of \(L(\theta)\), it is enough to check that \[ L'(\theta_*)=0 \quad \text{and} \quad L''(\theta_*)<0. \]

Example 8.1 Let \(X:\Omega\to\{0,1\}\) be a Bernouilli random variable with \(\P(X=1)=\theta\) and \(\P(X=0)=1-\theta\), where \(\theta\in[0,1]\) is a parameter. Suppose that we are given a sample of the length \(n\) of values of \(X\) which contain \(k\) ones and \(n-k\) zeros (the sample has a particular order, e.g. \(010010111001\ldots\)). Then the probability to get this particular sample, for any \(\theta\in[0,1]\), is \(\theta^k(1-\theta)^{n-k}\), i.e. the likelihood function for the given data is \[ \mathcal{L}(\theta) = \theta^k(1-\theta)^{n-k}. \] Hence, the log-likehood fgunction for the given data is \[ \begin{aligned} L(\theta)&=\ln \mathcal{L}(\theta) =\ln\bigl(\theta^k(1-\theta)^{n-k}\bigr)\\ &= \ln \theta^k + \ln (1-\theta)^{n-k}\\ & =k\ln\theta +(n-k)\ln(1-\theta). \end{aligned} \] Then \[ \begin{aligned} L'(\theta)& =\bigl( k\ln\theta +(n-k)\ln(1-\theta)\bigr)'\\ & = \frac{k}{\theta}-\frac{n-k}{1-\theta}\\ & = \frac{k(1-\theta)-(n-k)\theta}{\theta(1-\theta)}\\ &= \frac{k-n\theta}{\theta(1-\theta)}. \end{aligned} \] Therefore, \(L'(\theta)=0\) iff \(k-n\theta=0\), i.e.  \[ \theta=\frac{k}{n}. \] Moreover, \[ \begin{aligned} L''(\theta)&=(L'(\theta))' = \biggl( \frac{k}{\theta}-\frac{n-k}{1-\theta}\biggr)' \\& = -\frac{k}{\theta^2}-\frac{n-k}{(1-\theta)^2}<0 \end{aligned} \] for all \(\theta\in[0,1]\), in particular, for \(\theta_*=\dfrac{k}{n}\in[0,1]\) (as \(0\leq k\leq n\)). Therefore, \(\theta_*=\dfrac{k}{n}\) is the point of maximum of \(L(\theta)\), and hence, it is the maximum likelihood estimator for the parameter \(\theta\).

Remark. Note that \(S_n:=X_1+\ldots+X_n\sim Bin(n,\theta)\) is the binomial random variable, and \(k\) ones in \(n\) Bernoulli trials means \(S_n=k\). Then the sample \(\overline{X_n}=\frac1{n}(X_1+\ldots+X_n)=S_n/n\) takes the value \(\frac{k}{n}\). We have that \[ \E(X) = 1\cdot \theta+ 0\cdot(1-\theta)=\theta, \] and the law of large numbers says that (in certain sense) \[ \overline{X}_n\to \theta, \quad n\to\infty. \] In other words, the maximum likelihood estimator converges to the theoretical value is the size of the sample converges to infinity.

Example 8.2 Let \(X\sim Po(\lambda)\), i.e.  \[ \P(X=k)=\frac{\lambda^k}{k!}e^{-\lambda}, \quad k\geq0. \] Suppose we have a sample of \(n\) values of \(X\): \(k_1,\ldots,k_n\). Then \[ \begin{aligned} \mathcal{L}(\lambda)&=\P(X=k_1\mid\lambda)\ldots \P(X=k_n\mid\lambda)\\ & = \frac{\lambda^{k_1}}{k_1!}e^{-\lambda}\cdot\ldots\cdot \frac{\lambda^{k_n}}{k_n!}e^{-\lambda}\\ & = \underbrace{\frac{1}{k_1!\ldots k_n!}}_{=: c>0}\lambda^{k_1+\ldots+k_n}e^{-\lambda n}, \end{aligned} \] and therefore, \[ \begin{aligned} L(\lambda)&=\ln\mathcal{L}(\lambda) \\& = \ln c + (k_1+\ldots+k_n)\ln\lambda-\lambda n. \end{aligned} \] Then \[ L'(\lambda) = \frac{k_1+\ldots+k_n}{\lambda}-n, \] and hence, \(L'(\lambda)=0\) iff \[ \lambda = \frac{k_1+\ldots+k_n}{n}. \] Since \[ L''(\lambda) = (L'(\lambda))'=-\frac{k_1+\ldots+k_n}{\lambda^2}<0, \] the found value \(\lambda_* = \frac{k_1+\ldots+k_n}{n}\) is the point of maximum of \(L\), hence, it is the maximum likelihood estimator for the parameter \(\lambda\).