The Method of Maximum Likelihood Estimation

Conditional Probability: For any given events \(A\) and \(B\) in a probability space \( (\Omega, \mathcal{F}, \mathbf{P}) \), with probability \(\mathbf{P}(B)>0\), the conditional probability of event \(A\), given event \(B\), is 
$$ \mathbf{P}(A|B)=\frac{\mathbf{P}(A \cap B)}{\mathbf{P}(B)}. $$
Bayes' Theorem.
For any events A and B with \(\mathbf{P}(B)>0\),

$$ \mathbf{P}(A|B)=\frac{\mathbf{P}(B|A) \mathbf{P}(A)}{\mathbf{P}(B)}, $$
where

  •  \(\mathbf{P}(A)\) and \(\mathbf{P}(B)\) are Prior Probabilities.
  • \(\mathbf{P}(A|B)\) is a Posterior Probabilities. Bayes' theorem is used to update the parameters of a distribution.


Example:


  • 10 balls and 6 cubes are in a box.
  • 5 balls are blue and the other 5 are red.
  • Two cubes are blue and the other 4 are red.
  • An item has been randomly chosen from the box.
  • What is the probability that the chosen item is a cube, given that you have spotted a red item, but was not able to identify its shape?

Solution.
One of the ways to solve the problem, is to take the fact that the chosen item is red and exclude all the information related to the blue items as can be seen on the above figure. 


A general solution is to use Bayes' theorem as follows

Red Blue Sub-total
Ball 5 5 10
Cube 4 2 6
Sub-total 9 7 16


$$ \mathbf{P}(A)= \frac{\text{# of times } A ~\text{ occurs}}{\text{# replications}}. $$ Let event A: the chosen item is cube. Let event B: the chosen item is red. 


\( \mathbf{P}(B)=\frac{9}{16} \)    \(\mathbf{P}(B \cap A)=\frac{4}{16} =\frac{1}{4}.\) 


This gives \( \mathbf{P}(A|B)=\frac{1/4}{9/16}=\frac{4}{9}\) ,     while \( \mathbf{P}(A) =\frac{6}{16}=\frac{3}{8}\).

Independence: Two subsets \( A,B \in \mathcal{F} \), where \( \mathcal{F} \) is a σ-algebra on a sample space \(\Omega\), are called independent, if and only if $$ \mathrm{P}(A\cap B) = \mathrm{P}(A) \times \mathrm{P}(B). $$

Example: Let \(x\) and \(y\) be two independent and exponentially distributed random variables with parameters \(\lambda_1\) and \(\lambda_2\), respectively. The probability that both random variables are greater than \(M\) is

 $$\mathbf{P}(x>M \cap y>M)=\mathbf{P}(x>M) \mathbf{P}(y>M)=e^{-(\lambda_1+\lambda_2)M}$$

Example: Suppose that we are about to take an observation, which we know it has come from a Poisson distribution with mean \(\lambda =3, 4\) or \(5\).
  • Suppose that we think the mean \(\lambda\) is either 3, 4 or 5 with corresponding probabilities of 0.1, 0.5, 0.4. 
  • We now perform the experiment and get \(x=7\). 
  • What are the posterior probabilities for \(\lambda\), given that \(x=7\). 

Solution.
This has more likely come from \(\lambda=5\), since $$ \mathbf{P}(x=7| \lambda =3) =\frac{\lambda^x e^{-\lambda}} {x!}=0.0216. $$ $$ \mathbf{P}(x=7| \lambda =4)=0.0595. $$ $$ \mathbf{P}(x=7| \lambda =5)=0.1044. $$ We then have $$\mathbf{P}(\lambda=3|x=7) =\frac{\mathbf{P}(x=7|\lambda=3) \mathbf{P}(\lambda=4)}{\mathbf{P}(x=7)} =0.0293, $$ where \(\mathbf{P}(x=7)=\sum_{i=3}^5 \mathbf{P}(x=7|\lambda =i) \mathbf{P}( \lambda =i) \).

Similarly, $$ \mathbf{P}(\lambda=4|x=7) =0.4038 $$ $$ \mathbf{P}(\lambda=5|x=7) =0.5669 $$
 
Maximum Likelihood Estimation: 

Introduction

  • R. Fisher, early in the last century, proposed the method of maximum likelihood estimation.
  • The method finds the best parameters for a given family of distributions that the data has most likely come from.
  • It assumes that the family of distributions, which the data belongs to, is known, but the parameters of the distribution are unknown.
  • Let \(\Theta=\{\theta_1, \theta_2, \dots, \theta_n \}\) be the set of parameters of a statistical distribution, then for any event \(x\) with \(\mathbf{P}(x)>0\), $$\mathbf{P}(x|\Theta=\hat{\Theta})=\frac{ \mathbf{P}(\Theta=\hat{\Theta}|x) \mathbf{P}(x)}{\mathbf{P}(\Theta=\hat{\Theta})}.$$ If event \( \displaystyle{x= \bigcap_{i=1}^m x_i}\) , then $$ \mathbf{P}(x_1,x_2, \dots, x_m|\Theta=\hat{\Theta})=\frac{\mathbf{P}(x_1,x_2, \dots, x_m; \Theta=\hat{\Theta})}{\mathbf{P}(\Theta=\hat{\Theta})}. $$
  • We have no prior information about the value of \(\theta\), so we can maximise $$\mathbf{P}(x_1,x_2, \dots, x_m|\Theta=\hat{\Theta}), $$ in order to maximise $$\mathbf{P}(x_1,x_2, \dots, x_m; \Theta=\hat{\Theta}). $$





  • If we know that the observations have come from a family of distributions with density function \(f(x|\Theta)\), we need to find the value of \(\hat{\Theta}\) such that \(f(x_1,x_2, \dots, x_m|\Theta=\hat{\Theta})\) is maximised.

  • \(L(\Theta)=f(x_1,x_2, \dots, x_m|\Theta=\hat{\Theta})\) is called the likelihood function.

  • The maximum likelihood estimator (MLE) is \(\hat{\Theta}\) that maximises the likelihood function.

  • If the observations \(x_1,x_2, \dots, x_m\) are independent identically distributed (iid) and have come from a continuous distribution, we can write the joint density function of the observed data as a product of their marginal density functions. $$L(\Theta)=\prod_{i=1}^m{f(x_i|\Theta=\hat{\Theta})}, $$ where \(f(x|\Theta)\) is the probability density function (or probability mass function) of the continuous (discrete) random variable \(x\). Since \(\mathrm{log}\) is a monotonic function, we can find the MLE \(\hat{\Theta}\) by maximising the log-likelihood function $$l(\Theta)=\mathrm{log} \left( L(\Theta) \right)=\sum_{i=1}^m{ \mathrm{log} \left(f(x_i|\Theta=\hat{\Theta})\right)}. $$ 

    Maximising the likelihood

    We are looking for \(\Theta=\hat{\Theta}\) that maximises \(\mathrm{log} \left( L(\Theta) \right)\). Optimal \(\hat{\Theta}\) is found from solving the set of equations $$\frac{\partial f(x, \Theta)}{\partial \theta_i} = 0, \quad i=1,\dots, m,$$ given that the square symmetric matrix $$M=\Big[ -\frac{\partial^2 f(x, \Theta)}{\partial \theta_i \partial \theta_j} \Big]_{i,j=1}^m,$$ is positive definite at \(\Theta=\hat{\Theta}\). Note that $$\Big[\frac{\partial^2 f(\mathbf{x}, \Theta)}{\partial \theta_i \partial \theta_j} \Big]_{i,j=1}^m$$ is the Hessian Matrix. 
    A square real matrix \(M\) is positive definite if for all non-zero real vectors \(y\) we have $$y' M y >0,$$ where \(y'\) is the transpose of \(y\). It is also sufficient to verify that all the eigenvalues of \(M\) are positive.

    Example 1: Suppose that \(x_1, x_2, \dots, x_n\) are observed from an Exponential distribution with parameter \(\lambda\), find the MLE of the sample.
    Solution.
    The Likelihood function is $$ L(\lambda)=\prod_{i=1}^n{f(x_i|\lambda=\hat{\lambda})}=\prod_{i=1}^n{\lambda e^{-\lambda x_i}}. $$ The log-likelihood function is $$ l(\lambda)=\mathrm{log} \left( L(\lambda) \right)=\sum_{i=1}^n{ \left( \mathrm{log}(\lambda) -\lambda x_i \right) }= n \mathrm{log}(\lambda) - \lambda \sum_{i=1}^n{ x_i}. $$ Differentiate with respect to \(\lambda\), $$ \frac{\partial l}{ \partial \lambda}=\frac{n}{\lambda} -\sum_{i=1}^n{ x_i}. $$ Set the derivative to zero and solve for \(\lambda\) to get $$ \hat{\lambda}=\displaystyle{\frac{n}{\sum_{i=1}^n{ x_i}}= \frac{1}{\bar{x}}}, $$ which indeed gives a maximum, since $$ \frac{\partial^2 l}{\partial \lambda^2}=\frac{-n}{\lambda^2}< 0.$$

    Example 2: Suppose that \(x_1, x_2, \dots, x_n\) are observed from a Normal distribution with a mean of \(\mu\) and a standard deviation of \(\sigma\), find the MLE of the sample.
    Solution.
    The Likelihood function is $$ L(\mu, \sigma)=\prod_{i=1}^n{\frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x_i-\mu)^2}{2 \sigma^2}}}=\frac{1}{\left(\sigma \sqrt{2 \pi}\right)^n} ~\displaystyle{exp \left(-\frac{1}{2 \sigma^2} \sum_{i=1}^n{(x_i-\mu)^2}\right)}. $$ The log-likelihood function is $$ l(\mu, \sigma)=-n ~\mathrm{log} (\sigma \sqrt{2 \pi}) -\frac{1}{2 \sigma^2} \sum_{i=1}^n{(x_i-\mu)^2}. $$ Differentiate with respect to \(\mu\) and with respect to \(\sigma\) to get $$ \frac{\partial l}{ \partial \mu}= \frac{1}{\sigma^2} \sum_{i=1}^n (x_i-\mu); $$ $$ \frac{\partial l}{ \partial \sigma}= \frac{-n}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^n (x_i-\mu)^2. $$ Equating the previous two equations to zero to get

    \( \hat{\mu}=\frac{1}{n}\sum_{i=1}^n{x_i}=\bar{x},\) \(\hat{\sigma}=\sqrt{\frac{1}{n}\sum_{i=1}^n{(x_i-\mu)^2}}.\) 

    The likelihood function achieves its maximum at the estimated parameters. 

    Example 3: Suppose that \(x_1< K, x_2< K, \dots, x_n < K,\) \(x_{n+1}=x_{n+2}=\dots=x_{n+m}=K\) are observed as final marks of a re-sit exam, for which all the pass marks are capped at \(K\). Find MLE for the original scores \(y_1, y_2, \dots, y_{n+m}\) before the marks are capped, given that $$ x_i= \left \{ \begin{array}{cl} y_i& : y_i< K\\ K& : y_i\geq K \end{array} \right \}. $$ A mark of \(K\) indicates a score of at least \(K\).
    The Likelihood function is $$ L(\Theta)=\left(1-F(K) \right)^m \prod_{i=1}^n{f(x_i|\Theta=\hat{\Theta})}, $$ where \(F(x)=\mathbf{P}(\mathbf{X} \leq x)\) is the cumulative distribution function of the fitted distribution.


    Numerical approximations. Sometimes, it is not easy to evaluate the MLE analytically and a numerical method such as Newton-Raphson method can be used to approximate the value of the MLE. \(F(K)\), if it is unknown, it can be numerically evaluated by approximating the integral $$ F(K)=\int_{-\infty}^{K}f(x) \mathrm{d} x. $$


    Graphs



    Figure 1: A fitted Normal distribution with parameters \(\mu=1.65\) and \(\sigma=1.17\) and a fitted Weibull distribution with parameters \(\lambda=0.411\) and \(\gamma=1.47\).


    Figure 2: A fitted four-parameter model using design and experiment based on only six points that are shown in blue.

    Figure 3: A fitted four-parameter model using design and experiment.




    References
    1. Charles M.Goldie , Probability and Statistics, Lecture notes, University of Sussex, 2011-2012.
    2. CT6 reading provided by the Institute and Faculty of Actuaries, 2013.
    3. Glasserman P., Monte Carlo Methods in Financial Engineering, Springer, 2004.
    4. John E. Freund's, Mathematical Statistics with Applications, Seventh Edition, 2004.

    No comments:

    Post a Comment