Suppose you are the ML expert among a team of healthcare professionals. The project they are working on is to determine the life expectancy of patients. This would help them in downstream tasks. One of these healthcare professionals suggests that you can use Linear Regression. Is this approach appropriate here?

For this answer, it is clear that Linear Regression would be appropriate since we are predicting a continuous value (i.e. age that a person will live until).

Strong Candidate:

To confirm that linear regression is really appropriate, it must follow these 4 assumptions:

1. Linearity: this means that the relationship must be linear between the independent variables and dependent variables. For example, if we take a look at the left diagram below we see that there is no linear relationship between \(x\) and \(y\), however, looking at the plot on the right, there is a clear linear relationship between x and y. Note, an equation like \(y = \beta_0 + \beta_1 \log(x) + \beta_2 cos(x)\) also satisfies this relationship as we can rewrite this equation to \(y = \beta_0 + \beta_1 x_1+ \beta_2 x_2\) where \(x_1 = \log(x)\) and \(x_2= cos(x) \), which looks like a normal linear equation. This means that once the coefficients are linear it will satisfy linearity.

2. Homoscedasticity: refers to a situation in which the residuals or errors of a statistical model exhibit constant variance across the range of predictor variables. Look at the diagram below, the image on the left shows a plot of constant variance of the residuals (homoscedasticity) while the image on the right is a plot showing heteroscedasticity. You can see that as the fitted value increases, the residuals spread out more, this is a classic sign of heteroscedasticity.

3. Independence: independent variables (observations) are not highly correlated, in other words, they are independent of each other.

4. Normality: refers to the characteristic of a statistical distribution in which the observations or data points cluster around the mean, and the shape of the distribution follows a symmetrical bell curve. This property holds true for any fixed value of the observations, indicating that the data tends to be well-distributed around the mean, forming a typical Gaussian or normal distribution.

​Follow Up 1: Suppose the features you get access to are: year a person was born, BMI, country at birth, units of alcohol consumed per week and nationality at birth. However, there is a problem with these features, what is it?

Follow Up 2: What is the difference between correlation and covariance?

Follow Up 3: Which metric would you choose for this problem? Are there alternatives?

Follow Up 4: What are the two methods to solve Linear Regression?

Follow Up 5: You mentioned to solve Linear Regression we can use optimization techniques. Should we use Gradient Descent or Stochastic Gradient Descent?

Your colleague is creating a polynomial linear regression model with thousands of features and she got a training RMSE is 0.2 and a test RMSE is 100.5. Why is there this discrepancy?

In most questions that you see online for overfitting, we use the metric accuracy and so the training accuracy is high usually close to 100% and the testing is poor (~60%). This question uses RMSE as the metric and thus a smaller value indicates better performance.

Since the training RMSE is very low (i.e. accurate) while the test RMSE is very high there seems to be the issue of overfitting. In overfitting, the model almost “memorizes” the training data and so is very accurate (almost perfect) on the training data but this comes at the price of generalization, i.e., the model is only good at predicting the training data and any data that is different it would perform poorly on. The polynomial model may be too complex for this dataset (high degree polynomials).

Below we have two tables to help you remember when a model is overfitting depending on the metrics the interviewer gave you.

Strong Candidate:

In order to prevent this from happening we can take the following actions:

  • Regularization: will reduce overfitting by penalizing the large coefficients which would lead to generalization.
  • Train with more data: by having more data the model can better detect the signal and it reduces the chances of the model memorizing the data.
  • Cross-Validation: split the data into k groups and let one group be the validation set while the others are used for training the model (we interchange which group becomes the validation set).
  • Reduce the number of features: We can use some feature selection methods – filter-based (chi-square), wrapper based (Recursive Feature Elimination) or embedded like Lasso regularization.
  • Change the model for e.g. ensemble learning techniques: by having multiple weak models instead of one model, we hope to capture the signal in the data better and so we can generalize to unseen datasets.

Follow Up 1: One way to prevent overfitting is regularization as you have mentioned. What are good regularization methods that can be applied to your colleague’s model?

Follow Up 2: What if we wanted to use L1 and L2 regularization together in your colleague’s model?

Follow Up 3: Now that we have a model that is performing well, how can we interpret the coefficients?

Follow Up 4: After speaking with your colleague, she mentions that she needs to think about the bias-variance tradeoff. Why?