Understanding Word2vec embedding with Tensorflow implementation
2020-05-09
Detecting elbow/knee points in a graph using Python
2020-06-13
Show all

How to interpret logistic regression coefficients?

15 mins read

Logistic Regression is a fairly simple yet powerful Machine Learning model that can be applied to various use cases. It’s been widely explained and applied, and yet, I haven’t seen many correct and simple interpretations of the model itself. Let’s crack that now.

I won’t dive into the details of what Logistic Regression is, where it can be applied, how to measure the model error, etc. There’s already been lots of good writing about it. This post will specifically tackle the interpretation of its coefficients, in a simple, intuitive manner, without introducing unnecessary terminology.

Interpreting Linear Regression Coefficients

Let’s first start with a Linear Regression model, to ensure we fully understand its coefficients. This will be a building block for interpreting Logistic Regression later.

Here’s a Linear Regression model, with 2 predictor variables and outcome Y:

Y = a+ bX₁ + cX₂ ( Equation * )

Let’s pick a random coefficient, say, b. Let’s assume that b >0. Interpreting b is simple: a 1-unit increase in X₁ will result in an increase in Y by b unitsif all other variables remain fixed (this condition is important to know). Note that if b < 0, then a 1-unit increase in X₁ will decrease Y by b units.

As an example, let’s consider the following model that predicts the house price based on 2 input variables: square footage and age. Please note that the model is “fake”, i.e. I made up the numbers just to illustrate the example.

house_price = a + 50,000* square_footage – 20,000* age

If we increase the square footage by 1 foot square, the house price will increase by $50,000. If we increase the age of the house by 1 year, the house price will decrease by $20,000. For each additional 1 year age increase, the house price will keep on decreasing by an additional $20,000.

Interpreting Logistic Regression Coefficients

Here’s what a Logistic Regression model looks like:

logit(p) = a+ bX₁ + cX₂ ( Equation ** )

You notice that it’s slightly different than a linear model. Let’s clarify each bit of it. logit(p) is just a shortcut for log(p/1-p), where p = P{Y = 1}, i.e. the probability of “success”, or the presence of an outcome. X₁ and X₂ are the predictor variables, and b and c are their corresponding coefficients, each of which determines the emphasis X₁ and X₂ have on the final outcome Y (or p). Last, a is simply the intercept.

We can still use the old logic and say that a 1 unit increase in, say, X₁ will result in b increase in logit(p).

But now we have to dive deeper into the statement “a 1 unit increase in X₁ will result in b increase in logit(p)”. The first portion is clear, but we can’t really sense the b increase in logit(p). What does this mean at all?

To understand this, let’s first unwrap logit(p). As mentioned before, logit(p) = log(p/1-p), where p is the probability that Y = 1Y can take two values, either 0 or 1. P{Y=1} is called the probability of success. Hence logit(p) = log(P{Y=1}/P{Y=0}). This is called the log-odds ratio.

Demystifying the log-odds ratio

We arrived at this interesting term log(P{Y=1}/P{Y=0}) a.k.a. the log-odds ratio. So now back to the coefficient interpretation: 1 unit increase in X₁ will result in b increase in the log-odds ratio of success:failure.

OK, that makes more sense. But let’s fully clarify this new terminology. Let’s start from odds ratios, and then we’ll expand to log-odds ratios.

You’ve probably heard of odds before — e.g. the odds of winning a casino game. People often mistakenly believe that odds & probabilities are the same things. They’re not. Here’s an example:

The probability of getting a 4 when throwing a fair 6-sided dice is 1/6 or ~16.7%. On the other hand, the odds of getting a 4 are 1:5, or 20%. This is equal to p/(1-p) = (1/6)/(5/6) = 20%. So, the odds ratio represent the ratio of the probability of success and probability of failure. Switching from odds to probabilities and vice versa is fairly simple.

Now, the log-odds ratio is simply the logarithm of the odds ratio. The reason logarithm is introduced is simply because the logarithmic function will yield a lovely normal distribution while shrinking extremely large values of P{Y=1}/P{Y=0}. Also, the logarithmic function is monotonically increasing, so it won’t ruin the order of the original sequence of numbers.

That being said, an increase in X₁ will result in an increase in the log-odds ratio log(P{Y=1}/P{Y=0}) by amount b > 0, which will increase the odds ratio itself (since the log is a monotonically increasing function), and this means that P{Y=1} get a bigger proportion of the 100% probability pie. In other words, if we increase X₁, the odds of Y=1 against Y=0 will increase, resulting in Y=1 being more likely than it was before the increase.

The interpretation is similar when b < 0. I’ll leave it up to you to interpret this, to make sure you fully understand this game of numbers.

Phew, that was a lot!

But bear with me — let’s look at another “fake” example to ensure you grasped these concepts.

logit(p) = 0.5 + 0.13 * study_hours + 0.97 * female

In the model above, b = 0.13, c = 0.97, and p = P{Y=1} is the probability of passing a math exam. Let’s pick study_hours and see how it impacts the chances of passing the exam. Increasing the study hours by 1 unit (1 hour) will result in a 0.13 increase in logit(p) or log(p/1-p). Now, if log(p/1–p) increases by 0.13, that means that p/(1 — p) will increase by exp(0.13) = 1.14. This is a 14% increase in the odds of passing the exam (assuming that the variable female remains fixed).

Let’s also interpret the impact of being a female on passing the exam. We know that exp(0.97) = 2.64. That being said, the odds of passing the exam are 164% higher for women.

The logistic regression coefficient β associated with a predictor X is the expected change in log odds of having the outcome per unit change in X. So increasing the predictor by 1 unit (or going from 1 level to the next) multiplies the odds of having the outcome by eβ.

Here’s an example:

Suppose we want to study the effect of Smoking on the 10-year risk of Heart disease. The table below shows the summary of a logistic regression that models the presence of heart disease using smoking as a predictor:

 CoefficientStandard Errorp-value
Intercept-1.930.13<0.001
Smoking0.380.170.03

The question is: How to interpret the coefficient of smoking: β = 0.38?

First notice that this coefficient is statistically significant (associated with a p-value < 0.05), so our model suggests that smoking does in fact influence the 10-year risk of heart disease. And because it is a positive number, we can say that smoking increases the risk of having heart disease.

But by how much?

1. If smoking is a binary variable (0: non-smoker, 1: smoker):

Then: eβ = e0.38 = 1.46 will be the odds ratio that associates smoking with the risk of heart disease.

This means that:

The smoking group has 1.46 times the odds of the non-smoking group of having heart disease.

Alternatively, we can say that:

The smoking group has 46% (1.46 – 1 = 0.46) more odds of having heart disease than the non-smoking group.

And if heart disease is a rare outcome, then the odds ratio becomes a good approximation of the relative risk. In this case, we can say that:

Smoking multiplies by 1.46 the probability of having heart disease compared to non-smokers.

Alternatively, we can say that:

There is a 46% greater relative risk of having heart disease in the smoking group compared to the non-smoking group.

Note for negative coefficients:
If β = – 0.38, then eβ = 0.68 and the interpretation becomes: smoking is associated with a 32% (1 – 0.68 = 0.32) reduction in the relative risk of heart disease.

How to interpret the standard error?

The standard error is a measure of uncertainty of the logistic regression coefficient. It is useful for calculating the p-value and the confidence interval for the corresponding coefficient.

From the table above, we have SE = 0.17.

We can calculate the 95% confidence interval using the following formula:

95% Confidence Interval = exp(β ± 2 × SE) = exp(0.38 ± 2 × 0.17) = [ 1.04, 2.05 ]

So we can say that:

We are 95% confident that smokers have on average 4 to 105% (1.04 – 1 = 0.04 and 2.05 – 1 = 1.05) more odds of having heart disease than non-smokers.

Or, more loosely we say that:

Based on our data, we can expect an increase between 4 and 105% in the odds of heart disease for smokers compared to non-smokers.

Interpret the Logistic Regression Intercept

Here’s the equation of a logistic regression model with 1 predictor X:

Logistic regression equation

Where P is the probability of having the outcome and P / (1-P) is the odds of the outcome.

The easiest way to interpret the intercept is when X = 0:

When X = 0, the intercept β0 is the log of the odds of having the outcome.

From log odds to probability

Because the concept of odds and log odds is difficult to understand, we can solve for P to find the relationship between the probability of having the outcome and the intercept β0.

To solve for the probability P, we exponentiate both sides of the equation above to get:

Solving for P to get the relationship between the probability of getting the outcome and the intercept in a logsitic regression

With this equation, we can calculate the probability P for any given value of X, but when X = 0 the interpretation becomes simpler:

When X = 0, the probability of having the outcome is P = eβ0 / (1 + eβ0).

Without even calculating this probability, if we only look at the sign of the coefficient, we can say that:

  • If the intercept has a negative sign: then the probability of having the outcome will be < 0.5.
  • If the intercept has a positive sign: then the probability of having the outcome will be > 0.5.
  • If the intercept is equal to zero: then the probability of having the outcome will be exactly 0.5.

Let’s illustrate this with an example

Suppose we want to study the effect of Smoking on the 10-year risk of Heart disease. The table below shows the summary of a logistic regression that models the presence of heart disease using smoking as a predictor:

 CoefficientStandard Errorp-value
Intercept-1.930.13< 0.001
Smoking0.380.170.03

So our objective is to interpret the intercept β0 = -1.93.

Using the equation above and assuming a value of 0 for smoking:

P = eβ0 / (1 + eβ0) = e-1.93 / (1 + e-1.93) = 0.13

But what does it mean to set the variable smoking = 0?

1. If smoking is a continuous variable (annual tobacco consumption in Kilograms)

In this context, smoking = 0 means that we are talking about a group that has an annual usage of tobacco of 0 Kg, i.e. non-smokers.

So the interpretation becomes:

The probability that a non-smoker will have heart disease in the next 10 years is 0.13.

1.1. What if smoking was a standardized variable?

A standardized variable is a variable rescaled to have a mean of 0 and a standard deviation of 1. This is done by subtracting the mean and dividing by the standard deviation for each value of the variable. The goal is to force predictors to be on the same scale so that their effects on the outcome can be compared just by looking at their coefficients.

In this case, smoking = 0 corresponds to the mean annual consumption of tobacco in Kg, and the interpretation becomes:

For an average consumer of tobacco, the probability of having heart disease in the next 10 years is 0.13.

1.2. What if all subjects in our study were smokers?

Then setting the Smoking variable equal to 0 does not make sense anymore. Since the non-smoking group is not represented in the data, we cannot expect our results to generalize to this specific group.

In this case, it makes sense to evaluate the intercept at a value of smoking different from 0. For instance, we can take the minimum, maximum or mean of the variable Smoking as a reference point.

Let’s pick the maximum as a reference and calculate the limit of how much smoking can affect the risk of heart disease.

Suppose that in our sample the largest amount of tobacco smoked in a year was 3 Kg, then:

P = eβ0 + β1X / (1 + eβ0 + β1X) where X = 3 Kg

Replacing the numbers, we get P = 0.31.

The interpretation becomes:

The maximum annual tobacco consumption of 3 kg is associated with a 31% risk of having heart disease in the next 10 years.

2. If smoking is a binary variable (0: non-smoker, 1: smoker)

Then assuming a value of 0 for smoking, the equation above is still:

P = eβ0 ÷ (1 + eβ0) = e-1.93 ÷ (1 + e-1.93) = 0.13

And the interpretation also stays the same:

The probability that a non-smoker will have a heart disease in the next 10 years is 0.13.

Note: If smoking was on a scale from 1 to 10 (no zero) Then we can interpret the intercept for one of these values using the equation above.

The intercept is β0 = -1.93 and it should be interpreted assuming a value of 0 for all the predictors in the model. The intercept has an easy interpretation in terms of probability (instead of odds) if we calculate the inverse logit using the following formula:

eβ0 ÷ (1 + eβ0) = e-1.93 ÷ (1 + e-1.93) = 0.13, so:

The probability that a non-smoker will have heart disease in the next 10 years is 0.13.

Without even calculating this probability, if we only look at the sign of the coefficient, we know that:

  • If the intercept has a negative sign: then the probability of having the outcome will be < 0.5.
  • If the intercept has a positive sign: then the probability of having the outcome will be > 0.5.
  • If the intercept is equal to zero: then the probability of having the outcome will be exactly 0.5.

2. If smoking is a numerical variable (lifetime usage of tobacco in Kilograms)

Then: eβ (= e0.38 = 1.46) tells us how much the odds of the outcome (heart disease) will change for each 1 unit change in the predictor (smoking).

Therefore:

An increase of 1 Kg in lifetime tobacco usage multiplies the odds of heart disease by 1.46.

Or equally:

An increase of 1 Kg in lifetime tobacco usage is associated with an increase of 46% in the odds of heart disease.

Interpreting the coefficient of a standardized variable

A standardized variable is a variable rescaled to have a mean of 0 and a standard deviation of 1. This is done by subtracting the mean and dividing by the standard deviation for each value of the variable.

Standardization yields comparable regression coefficients unless the variables in the model have different standard deviations or follow different distributions (for more information, I recommend these articles: standardized versus unstandardized regression coefficients and how to assess variable importance in linear and logistic regression).

Anyway, standardization is useful when you have more than 1 predictor in your model, each measured on a different scale, and your goal is to compare the effect of each on the outcome.

After standardization, the predictor Xi that has the largest coefficient is the one that has the most important effect on the outcome Y.

However, the standardized coefficient does not have an intuitive interpretation on its own. So in our example above, if smoking was a standardized variable, the interpretation becomes:

An increase in 1 standard deviation in smoking is associated with a 46% (eβ = 1.46) increase in the odds of heart disease.

3. If smoking is an ordinal variable (0: non-smoker, 1: light smoker, 2: moderate smoker, 3: heavy smoker)

Sometimes it makes sense to divide smoking into several ordered categories. This categorization allows the 10-year risk of heart disease to change from 1 category to the next and forces it to stay constant within each instead of fluctuating with every small change in the smoking habit.

In this case, the coefficient β = 0.38 will also be used to calculate eβ (= e0.38 = 1.46) which can be interpreted as follows:

Going up from 1 level of smoking to the next multiplies the odds of heart disease by 1.46.

Alternatively, we can say that:

Going up from 1 level of smoking to the next is associated with an increase of 46% in the odds of heart disease.

Important Notes:

About statistical significance and p-values:

If you include 20 predictors in the model, 1 on average will have a statistically significant p-value (p < 0.05) just by chance.

So be aware of:

  • including/excluding variables from your logistic regression model based just on p-values.
  • labeling effects as “real” just because their p-values were less than 0.05.

What if you get a very large logistic regression coefficient?

In our example above, getting a very high coefficient and standard error can occur for instance if we want to study the effect of smoking on heart disease and the large majority of participants in our sample were non-smokers. This is because highly skewed predictors are more likely to produce a logistic model with perfect separation.

Therefore, some variability in the independent variable X is required in order to study its effect on the outcome Y. So make sure you understand your data well enough before modeling them.

Model interpretation has increasingly become an important aspect of Machine Learning & Data Science. Understanding what the model does and how it makes predictions is crucial in the model building & evaluation process. Now that you have a better understanding of how Logistic Regression works, you’ll be able to better understand the models that you build!

References:

https://towardsdatascience.com/a-simple-interpretation-of-logistic-regression-coefficients-e3a40a62e8cf

Leave a Reply

Your email address will not be published. Required fields are marked *