MLE with Linear Regression

5 min readMar 3, 2021

Introduction

In this article, we will walk through what is MLE, it’s usefulness and how one can derive/find a model parameter for Linear Regression. MLE stands for Maximum Likelihood Estimation, it’s a generative algorithm that helps in figuring out the model parameters which maximize the chance of observing the data we have already observed.

In basic terms, let’s say we have a Data (D) = {(x1,y1),(x2,y2),(x3,y3)….(xₙ,yₙ)} drawn from some distribution which is unknown to us. For example, an email writer has a magic die which he rolls m times, whichever word pop up on the face he writes in the email and send it to us. Another example would be, you are tossing a coin n times as H, H, T, T, T, T, T, T. You do not know how the distribution of head and tails looks like given the sample data.

Now, if we somehow figure out the magic die rolling word estimator or how this ancient coin is outputting the output as we are observing then that’s great. Why? Because we can input X to the approximated distribution function and get a likely y as an expected label which when drawn from original P(X,y) look alike.

More formally, we need to find 𝛳 which is a model parameter that maximizes the likelihood of observed data P(D; 𝛳). Note: Do not get confused with P(D|𝛳), whatever after the semi-colon is a model parameter whereas P(D|𝛳) means given that we have the parameters with us, now what’s the probability of data getting fit.

L(𝛳;D) = P(D;𝛳) are same. P(D;𝛳) can be read as the probability of observing the data coming from some distribution when we set model parameters as 𝛳. L(𝛳;D) is a likelihood 𝛳 taking certain values given that we have observed the data beforehand such it describes the data well. Hope you can convince yourself of this!

Assumptions:

Now, MLE will not work fine when we have limited data samples in hand. For example, looking at the coin toss example, one would not say that P(Head) is the same as P(Tail), But in expectation when we draw many samples then eventually P(Head) = P(Tail) if it’s a fair coin.
Data samples are independent and identical in nature (i.i.d) meaning there is no dependency of one data point on another when drawn. In reality that may not be true like in the email, words will not be randomly written but if we go with this assumption, it turns out math is easy to calculate and works well in the real-world scenario as well.
Another thing, we make an initial assumption about the data distribution i.e from where data is getting sampled. In the coin case, we say it’s binomial (binary) distribution this is our belief, which may come from domain expertise or doing EDA on the data.
Once we assume the distribution, then we figure out the model parameters of that distribution, which maximize the P(D;𝛳).

Basically, it means, if we plug in different values for 𝛳 we get the probability of observing the data D, so pick that value of 𝛳 that gives max probability.

Cool! Now let’s work on how this MLE will help us to estimate the values of w(model parameter) in linear regression. I have skipped intercept term, either we can absorb the intercept by adding extra one dimension → w → (w,b) and x → (x,1) such that when we do w ᵗ x → wᵗx + b OR we can make our data zero mean such that our line always passes from the origin.

Linear Regression:

As you all know, Linear Regression is a simple model for predicting continuous value-dependent label given certain features for a data point, after learning from a bunch of observations. Here it implicitly follows the assumption that X and y have some linear relationship among each other. In other words, X is drawn from some linearly looking distribution that produces value y (coming from different distribution). The task in hand: Find the parameter w for X s.t we get an accurate prediction (y).

Now, we assume data is drawn from the ideal line w ᵗ x passing from the origin. For each x having d features, we are drawing label y from a Gaussian Distribution. Confused?

Let’s walk through a visual representation:

In the ideal case, we fit the linear (w ᵗ x) s.t every x value is perfectly present on that line, gives the true value of y. But in reality, the data points are noisy: yᵢ = f(xᵢ) + 𝜖. We need to estimate what values does 𝜖 (error) take in such a way the expected y value gets close to the ideal y value. With our assumption, we say 𝜖 has been drawn from a Gaussian distribution having a mean 0 and some variance 𝞼² as shown in the figure above. OR in another word, we could say: y has been drawn from a Gaussian distribution having mean as w ᵗ x and variance 𝞼².

Finally, we have reached an end with derivation intending to find: argmax P((x,y|w) [Note: we have made the assumption for linear distribution that’s why P(x,y | w )]

Credit: https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote08.html

Here at line 5, the product of Probabilities is turned into a sum of log probabilities because the log function is a monotonically increasing function. So to maximize P is equivalent to maximizing log(P). The last expression denotes the square error loss which is used in Linear Regression as a measure of goodness! Things are not yet completed, we want to estimate the value of w → Take the derivative of this expression, equate to 0 and get the closed-form of w, that’s it!

Conclusion:

MLE finds the data distribution by estimating the best parameter choice after model selection.
Works well when data samples are large enough. MLE is a generative approach where we do not directly predict P(y|X) instead learn underlying distribution.
We got the Least Square Error loss for Linear Regression with MLE because we assumed the Gaussian process and got the closed-form of w.

MLE with Linear Regression

Introduction

Assumptions:

Linear Regression:

Conclusion:

Recommended Materials:

Written by Darshan solanki