In regression, an interaction effect exists when the effect of an independent variable on a dependent variable changes, depending on the value(s) of one or more other independent variables.
In a regression equation, an interaction effect is represented as the product of two or more independent variables. For example, here is a typical regression equation without an interaction:
ŷ = b0 + b1X1 + b2X2
where ŷ is the predicted value of a dependent variable, X1 and X2 are independent variables, and b0, b1, and b2 are regression coefficients.
And here is the same regression equation with an interaction:
ŷ = b0 + b1X1 + b2X2 + b3X1X2
Here, b3 is a regression coefficient, and X1X2 is the interaction. The interaction between X1 and X2 is called a two-way interaction because it is the interaction between two independent variables. Higher-order interactions are possible, as illustrated by the three-way interaction in the following equation:
ŷ = b0 + b1X1 + b2X2 + b3X3 + b4X1X2 + b5X1X3 + b6X2X3 + b7X1X2X3
Analysts usually steer clear of higher-order interactions, like X1X2X3, since they can be hard to interpret.
An interaction plot is a line graph that reveals the presence or absence of interactions among independent variables. To create an interaction plot, do the following:
To understand potential interaction effects, compare the lines from the interaction plot:
For example, suppose researchers develop a drug to treat anxiety. The dependent variable is anxiety (plotted on the Y-axis). The independent variable is the Dose (plotted on the X-axis). Researchers might hypothesize an interaction effect, based on gender. To visualize the potential interaction, they would plot the mean anxiety score by gender for each dose and connect the means with lines, as shown below:
In the plot above, the lines are parallel. This suggests no interaction effect, based on gender. The drug has the same effect on men as on women. For both men and women, 1 mg of the drug lowers anxiety level by 0.2 units.
Suppose, however, the interaction plot looked like this:
Here, the lines are not parallel. The line for women is steeper. This suggests a possible interaction effect, based on gender. The plot tells us that the drug reduces anxiety more effectively for women than for men. But is the reduction significant? To answer that question, we need to conduct a statistical test.
In this section, we work through two problems to compare regression analysis with and without interaction terms. With each problem, the goal is to examine the effects of drug dosage and gender on anxiety levels.
To conduct the analyses, we will use the following data from eight subjects:
In the table, notice that we’ve expressed gender as a dummy variable, where 1 represents females and 0 represents non-females (in this case, males). Notice also that the variable in the fourth column (DG) is an interaction term, with a value equal to the product of dose times gender.
First, let’s ignore the interaction term. When we regress dose and gender against anxiety, we get the following regression table.
We see that both dose and gender are statistically significant at the 0.05 level. And, with further analysis, we find that the coefficient of multiple determination is a respectable 0.80.
Now, let’s include the interaction term in our analysis. When we regress dose, gender, and the dose-gender interaction against anxiety, we get the following regression table.
We see that the interaction between dose and gender is statistically significant at the 0.001 level. When we examine the main effects, we see that the dose is statistically significant, but gender isn’t. And finally, the coefficient of multiple determination is 0.99.
Typically, when a regression equation includes an interaction term, the first question you ask is: Does the interaction term contribute in a meaningful way to the explanatory power of the equation? You can answer that question by:
If the interaction term is statistically significant, the interaction term is probably important. And if the coefficient of determination is also much bigger with the interaction term, it is definitely important. If neither of these outcomes is observed, the interaction term can be removed from the regression equation.
Results from our sample problem are summarized in the table below:
|Analytical output||Without interaction||With interaction|
The interaction term is statistically significant (p = 0.000), and R2 is much bigger with the interaction term than without it (0.99 versus 0.80). Therefore, we conclude for this problem that the interaction term contributes in a meaningful way to the predictive ability of the regression equation.
When the interaction term is statistically significant, there’s good news and bad news.
If your goal is to understand the relative importance of individual predictors, that goal will be harder to achieve when interaction effects are significant. When an interaction effect exists, the effect of one independent variable depends on the value(s) of one or more other independent variables.
For example, consider the interaction plot for our sample problem.
For males, drug dosage has a minimal effect on anxiety; but for females, the effect is dramatic. The effect of drug dose cannot be understood without accounting for the gender of the person receiving the medication.
Bottom line: When an interaction effect is significant, do not try to interpret the importance of the main effects in isolation.
In a regression model, consider including the interaction between 2 variables when:
Below we will explore each of these points in detail, but first, let’s start with why we need to study interactions in the first place.
A model without interactions assumes that the effect of each predictor on the outcome is independent of other predictors in the model.
We say that 2 variables interact when one influences the effect of the other. In this case, their main effects (the separate effect of each of them on the outcome) should no longer be considered in isolation as it doesn’t make sense anymore to interpret the effect of one while holding the other constant.
So a linear regression equation should be changed from:
Y = β0 + β1X1 + β2X2 + ε
Y = β0 + β1X1 + β2X2 + β3X1X2 + ε
And if the interaction term is statistically significant (associated with a p-value < 0.05), then:
β3 can be interpreted as the increase in effectiveness of X1 for each 1 unit increase in X2 (and vice-versa).
When you include an interaction between 2 independent variables X1 and X2, DO NOT remove the main effects of the variables X1 and X2 from the model even if their p-values were larger than 0.05 (i.e. their effects were not statistically significant).
An interaction can also occur between 3 or more variables making the situation much more complex, but for practical purposes and in most real-world situations, you won’t have to deal with such complexities.
So we’ve established that when an interaction exists between 2 variables, you wouldn’t want to miss out on it. On the other hand, including all possible interactions for all predictors in your model will make it both uninterpretable and statistically flawed (see below).
Next, we discuss how to choose which interaction terms to include in your regression model.
Consider including an interaction term between 2 variables:
Variables that have a large influence on the outcome are more likely to have a statistically significant interaction with other factors that influence this outcome.
Because alcohol is known to be an important factor that increases the risk of liver cirrhosis, when studying the effect of other factors on cirrhosis, many studies consider the interaction between them and alcohol. [Chevillotte et al., Corrao et al., Stroffolini et al.]
This change can be:
In general, you should study the interaction between 2 variables whenever you suspect that a change in one variable will increase (or decrease) the effectiveness of another one in the model.
Here are a few signs that a variable has an influence on the effect of another one:
Gastric bypass surgery is beneficial for extremely obese individuals (BMI > 40) and not for those who are just overweight (25 < BMI < 30). So when studying the effect of this type of surgery on the risk of mortality, including its interaction with BMI is a reasonable decision:
Mortality = Surgery + BMI + Surgery × BMI
A literature review will help you spot important, and sometimes less intuitive interactions.
A meta-analysis showed that cigarette smoking interacts with hepatitis B and C infections on the risk of liver cancer. So whenever we want to study the effect of chronic hepatitis on liver cancer, we should include smoking as a main effect and as an interaction with hepatitis infections:
Liver Cancer = Smoking + Hepatitis + Smoking × Hepatitis
So far we discussed how to choose which interactions to include in your model BEFORE even looking at the data. And this is a good thing because, in general, it is always better to develop your hypotheses before looking at your data in order to avoid multiple testing — which will inflate the risk of having false-positive results.
Here’s why multiple testing is bad:
For a statistical significance threshold of 5% (i.e. in cases where we consider results with p-values < 0.05 statistically significant), if you test 20 random interactions, then on average, 1 of them will have a statistically significant coefficient JUST BY CHANCE. Therefore it would be wrong, from a statistical point of view, to test all possible interactions for all predictors in the model.
However, choosing which interactions to include based only on theory is limited by our intuition and our understanding of the problem which can be very narrow in some cases.
So here are 3 options to select interactions based on data while avoiding multiple testing:
With this option, you test all the interactions at once with a single global test (based on the Wald statistic). [see Regression Modeling Strategies by Frank Harrell]
Here, no matter how many predictors you have, the number of tests to run will be = 1.
For each predictor worth consideration, test all its interactions with a single test.
For p predictors, the number of tests to run will be = p.
The last option is to run a statistical test for each possible interaction alone for all variables in the model. But, in order to avoid the multiple testing problem, you can:
In this case, for p predictors, the number of tests to run will be = p(p-1)/ 2.
Don’t include an interaction between 2 variables just because they are correlated:
Variables that are correlated with each other don’t have a higher chance of interacting with each other in a model. Interaction means that the effect of one on the outcome will depend on the other. While correlation only means that the 2 variables tend to vary together in a linear fashion. And the latter says nothing about the effect of each on the outcome Y. [Source: Clinical Prediction Models – by Ewout Steyerberg]
If you included interactions based on theory (according to points 1, 2, or 3 above), i.e. if you can explain why these terms were included in your model: Then only report the results of the model with interactions.
However, if you included interactions based on statistical testing, i.e. you tried a few and kept those that were statistically significant: then you have to report 2 models. The first includes the main effects only, and the second includes the predictors with interactions. Then compare the R2 of the model with interactions with the R2 of the model without interactions to see if the interactions helped explain more variability of Y.
The interaction effect is present in statistics as well in marketing. In marketing, this same concept is referred to as the synergy effect. Interaction effect means that two or more features/variables combined have a significantly larger effect on a feature as compared to the sum of the individual variables alone. This effect is important to understand in regression as we try to study the effect of several variables on a single response variable.
A linear regression equation can be expressed as follows:
Here, we try to find the linear relation between the independent variables (X₁ and X₂) with the response variable Y and ε being the irreducible error. To check whether there is any significant statistical relation between the predictor and response variables, we conduct hypothesis testing. If we conduct this test for the predictor variable X₁, we will have two hypotheses:
Null hypothesis(H₀): There is no relationship between X₁ and Y ( β₁ = 0)
Alternative hypothesis(H₁): There is a relationship between X₁and Y ( β₁≠ 0)
We then decide whether or not to reject the null hypothesis based on the p-value. P-value is the probability of the results of the test, given the null hypothesis is true.
For example, if we get a non-zero value of β₁ in our test results, this indicates that there is a relationship between X₁ and Y. But if the p-value is large, this indicates that there is a high probability that we might get a non-zero value for β₁ even when the null hypothesis is actually true. In such a case, we fail to reject the null hypothesis and conclude that there is no relation between the predictor and response variable. But if the p-value is low (generally p-value cutoff is considered to be 0.05) then even a small non-zero value of β₁ indicates a significant relationship between the predictor and response variable.
If we conclude that there is a relationship between X₁ and Y, we consider that for each unit increase of X₁, Y increases/decreases by β₁ units. In the linear equation above, we assume that the effect of X₁ on Y is independent of X₂. This is also called the additive assumption in linear regression.
But what if the effect of X₁ on Y is also dependent on X₂? We can see such relations in many business problems. Consider for example we want to find out the return on investment for two different investment types. The linear regression equation for this example will be:
In this example, there is a possibility that there would be greater profit if we invest in both types of investments partially rather than investing in one completely. For example, if we have 1000 units of money to invest, investing 500 units of money in both the investments can lead to greater profit as compared to investing 1000 units completely in either of the investment types. In such a case, investment1’s relation with ROI will be dependent on investment2. This relation can be included in our equation as follows:
In the equation above, we have included the ‘interaction” between investment1 and investment2 for the prediction of total return on investment. We can include such interactions for any linear regression equation
The above equation can be rewritten as:
Here, β₃ is the coefficient of the interaction term. Again, to verify the presence of an interaction effect in regression, we conduct a hypothesis test and check the p-value for our coefficient (in this case β₃).
Now let us see how we can verify the presence of interaction effect in a data set. We will be using the Auto data set as our example. The data set can be downloaded from here. Let us have a look at the data set
import pandas as pd data = pd.read_csv('data/auto-mpg.csv')
Converting the data set to numeric and filling in the missing values
#removing irrelevant 'car name' column data.drop('car name',axis=1,inplace=True) #converting all columns to numeric for col in data.columns: data[col] = pd.to_numeric(data[col], errors ='coerce') #replacing missing values in horsepower with its median horse_med = data['horsepower'].median() data['horsepower'] = data['horsepower'].fillna(horse_med)
Let us fit an OLS(Ordinary Least Squares) model on this data set. This model is present in the statsmodels library.
from statsmodels.regression import linear_model X = data.drop('mpg', axis=1) y = data['mpg'] model = linear_model.OLS(y, X).fit()
From this model, we can get the coefficient values and also if they are statistically significant to be included in the model.
Below is the snapshot of the model summary.
In the above model summary, we can see that except for acceleration, all other features have a p-value less than 0.05 and are statistically significant. Even if acceleration standalone is not helpful in the prediction of mpg, we are interested in finding out whether acceleration after interacting with other variables is having an effect on mpg. Also, we are interested to know the presence of all significant interaction terms.
We first need to create all possible interaction terms. This is possible in python by using PolynomialFeatures from sklearn library:
from sklearn.preprocessing import PolynomialFeatures#generating interaction terms x_interaction = PolynomialFeatures(2, interaction_only=True, include_bias=False).fit_transform(X)#creating a new dataframe with the interaction terms included interaction_df = pd.DataFrame(x_interaction, columns = ['cylinders','displacement','horsepower','weight','acceleration','year','origin', 'cylinders:displacement','cylinders:horsepower','cylinders:weight','cylinders:acceleration', 'cylinders:year','cylinders:origin','displacement:horsepower','displacement:weight', 'displacement:acceleration','displacement:year','displacement:origin','horsepower:weight', 'horsepower:acceleration','horsepower:year','horsepower:origin','weight:acceleration', 'weight:year','weight:origin','acceleration:year','acceleration:origin','year:origin'])
As the new dataframe is created which includes the interaction terms, we can fit a new model to it and see which interaction terms are significant.
interaction_model = linear_model.OLS(y, interaction_df).fit()
Now we need only those interaction terms which are statistically significant (having a p-value less than 0.05):
interaction_model.pvalues[interaction_model.pvalues < 0.05]
As we can see there is a presence of interaction terms. Also, acceleration alone is not significant but its interaction with horsepower and year proves to be very important for the prediction of mpg.
It is important to note that in the example above, the p-value of acceleration is high but it is included in interaction terms. In such a case, we have to include the main effects of acceleration in the model i.e. the coefficient of acceleration even when it is not statistically significant due to the hierarchy principle. The hierarchy principle states that if there are two features X₁ and X₂ in an interaction term, we have to include both of their coefficients(β₁ and β₂) in the model even when the p-values associated with them are very high.
Adding interaction terms to a regression model has real benefits. It greatly expands your understanding of the relationships among the variables in the model. And you can test more specific hypotheses. But interpreting interactions in regression takes an understanding of what each coefficient is telling you.
The example from Interpreting Regression Coefficients was a model of the height of a shrub (Height) based on the amount of bacteria in the soil (Bacteria) and whether the shrub is located in partial or full sun (Sun). Height is measured in cm, Bacteria is measured in thousand per ml of soil, and Sun = 0 if the plant is in partial sun, and Sun = 1 if the plant is in full sun.
The regression equation was estimated as follows:
Height = 42 + 2.3*Bacteria + 11*Sun
It would be useful to add an interaction term to the model if we wanted to test the hypothesis that the relationship between the number of bacteria in the soil on the height of the shrub was different in full sun than in the partial sun.
One possibility is that in full sun plants with more bacteria in the soil tend to be taller. But in partial sun plants with more bacteria in the soil are shorter. Another is that plants with more bacteria in the soil tend to be taller in both full and partial sun. But the relationship is much more dramatic in full than in the partial sun. The presence of interaction indicates that the effect of one predictor variable on the response variable is different at different values of the other predictor variable. Adding a term to the model in which the two predictor variables are multiplied tests this. The regression equation will look like this:
Height = B0 + B1*Bacteria + B2*Sun + B3*Bacteria*Sun
Adding an interaction term to a model drastically changes the interpretation of all the coefficients. Without an interaction term, we interpret B1 as the unique effect of Bacteria on Height.
But the interaction means that the effect of Bacteria on Height is different for different values of Sun. So the unique effect of Bacteria on Height is not limited to B1. It also depends on the values of B3 and Sun. The unique effect of Bacteria is represented by everything that is multiplied by Bacteria in the model: B1 + B3*Sun. B1 is now interpreted as the unique effect of Bacteria on Height only when Sun = 0.
In our example, once we add the interaction term, our model looks like:
Height = 35 + 4.2*Bacteria + 9*Sun + 3.2*Bacteria*Sun
Adding the interaction term changed the values of B1 and B2. The effect of Bacteria on Height is now 4.2 + 3.2*Sun. For plants in partial sun, Sun = 0, so the effect of Bacteria is 4.2 + 3.2*0 = 4.2. So for two plants in partial sun, we expect a plant with 1000 more bacteria/ml in the soil to be 4.2 cm taller than a plant with fewer bacteria.
For plants in full sun, however, the effect of Bacteria is 4.2 + 3.2*1 = 7.4. So for two plants in full sun, a plant with 1000 more bacteria/ml in the soil would be expected to be 7.4 cm taller than a plant with fewer bacteria.
Because of the interaction, the effect of having more bacteria in the soil is different if a plant is in full or partial sun. Another way of saying this is that the slopes of the regression lines between height and bacteria count are different for the different categories of sun. B3 indicates how different those slopes are.
Interpreting B2 is more difficult. B2 is the effect of Sun when Bacteria = 0. Since Bacteria is a continuous variable, it is unlikely that it equals 0 often, if ever. So B2 can be virtually meaningless by itself.
Instead, it is more useful to understand the effect of Sun, but again, this can be difficult. The effect of Sun is B2 + B3*Bacteria, which is different at every one of the infinite values of Bacteria. For that reason, often the only way to get an intuitive understanding of the effect of Sun is to plug a few values of Bacteria into the equation to see how Height, the response variable, changes.