33 mins read
## A trap when selecting features

## Feature Selection Checklist

## The Problem The Feature Selection Solves

## Is this Dimensionality Reduction?

## Dataset In Action

## Feature Selection Methods

### Filter Methods

### Wrapper Methods

### Embedded Methods

## I. Filter Methods

## Statistics for Filter-Based Feature Selection Methods

### Numerical Input, Numerical Output

### Numerical Input, Categorical Output

### Categorical Input, Numerical Output

### Categorical Input, Categorical Output

## Tips and Tricks for Feature Selection

### Correlation Statistics

### Selection Method

### Transform Variables

### What Is the Best Method?

### Worked Examples of Feature Selection

### Regression Feature Selection:

(*Numerical Input, Numerical Output*)

### Classification Feature Selection:

(*Numerical Input, Categorical Output*)

### Classification Feature Selection:

(*Categorical Input, Categorical Output*)

### 1) Missing Values Ratio

### 2) Variance Threshold

### 3) Correlation coefficient

**4) Chi-Square Test of Independence***(for categorical data)*

**5)** **Mutual Information (***for both regression & classification)*

### 6) Analysis of Variance (ANOVA)

## II. Wrapper Methods

### 1) Sequential Feature Selection

### 2) Recursive Feature Elimination (RFE)

## III. Embedded Methods

### 1) L1 ( LASSO) Regularization

### 2) Tree Model (for Regression and Classification)

## More real-world feature selection examples on Pima Indians dataset

### 1. Univariate Selection

### 2. Recursive Feature Elimination

### 3. Principal Component Analysis

### 4. Feature Importance

## End Notes

Considering you are working on high-dimensional data that’s coming from IoT sensors or healthcare with hundreds to thousands of features, it is tough to figure out what subset of features will bring out a good sustaining model. **Feature selection** is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model. Statistical-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable. These methods can be fast and effective, although the choice of statistical measures depends on the data type of both the input and output variables. As such, it can be challenging for a machine learning practitioner to select an appropriate statistical measure for a dataset when performing filter-based feature selection.

Feature selection is another key part of the applied machine learning process, like model selection. You cannot fire and forget. It is important to consider feature selection a part of the model selection process. If you do not, you may inadvertently introduce bias into your models which can result in overfitting.

… should do feature selection on a different dataset than you train [your predictive model] on … the effect of not doing this is you will overfit your training data.

Ben Allison in answer to “Is using the same data for feature selection and cross-validation biased or not?”

For example, you must include feature selection within the inner loop when you are using accuracy estimation methods such as cross-validation. This means that feature selection is performed on the prepared fold right before the model is trained. A mistake would be to perform feature selection first to prepare your data, then perform model selection and training on the selected features.

If we adopt the proper procedure, and perform feature selection in each fold, there is no longer any information about the held out cases in the choice of features used in that fold.

Dikran Marsupial in answer to “Feature selection for final model when performing cross-validation in machine learning”

The reason is that the decisions made to select the features were made on the entire training set, that in turn are passed onto the model. This may cause a mode a model that is enhanced by the selected features over other models being tested to get seemingly better results, when in fact it is biased result.

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

Dikran Marsupial in answer to “Feature selection and cross-validation”

Isabelle Guyon and Andre Elisseeff the authors of “An Introduction to Variable and Feature Selection” (PDF) provide an excellent checklist that you can use the next time you need to select data features for your predictive modeling problem.

I have reproduced the salient parts of the checklist here:

**Do you have domain knowledge?**If yes, construct a better set of ad hoc features**Are your features commensurate?**If not, consider normalizing them.**Do you suspect interdependence of features?**If yes, expand your feature set by constructing conjunctive features or products of features, as much as your computer resources allow you.**Do you need to prune the input variables (e.g. for cost, speed, or data understanding reasons)?**If no, construct disjunctive features or weighted sums of feature**Do you need to assess features individually (e.g. to understand their influence on the system or because their number is so large that you need to do a first filtering)?**If yes, use a variable ranking method; else, do it anyway to get baseline results.**Do you need a predictor?**If no, stop**Do you suspect your data is “dirty” (has a few meaningless input patterns and/or noisy outputs or wrong class labels)?**If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them.**Do you know what to try first?**If no, use a linear predictor. Use a forward selection method with the “probe” method as a stopping criterion or use the 0-norm embedded method for comparison, following the ranking of step 5, construct a sequence of predictors of the same nature using increasing subsets of features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with that subset.**Do you have new ideas, time, computational resources, and enough examples?**If yes, compare several feature selection methods, including your new idea, correlation coefficients, backward selection, and embedded methods. Use linear and non-linear predictors. Select the best approach with model selection**Do you want a stable solution (to improve performance and/or understanding)?**If yes, subsample your data and redo your analysis for several “bootstrap”.

This article is all about feature selection and implementation of its techniques using scikit-learn on the automobile dataset. You can find the jupyter notebook for this tutorial on Github.

Feature selection methods aid you in your mission to create an accurate predictive model. They help you by choosing features that will give you as good or better accuracy whilst requiring fewer data. Feature selection methods can be used to identify and remove unneeded, irrelevant, and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model. Fewer attributes are desirable because it reduces the complexity of the model, and a simpler model is simpler to understand and explain.

The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data.

Guyon and Elisseeff in “An Introduction to Variable and Feature Selection” (PDF)

Dataset when ‘raw’ often comes with many irrelevant features that do not contribute much to the accuracy of your predictive model. Understand this using music analogy — music engineers often employ various techniques to tune their music such that there is no unwanted noise and the voice is crisp and clear. Similarly, even the datasets encounter noise, and it’s crucial to remove them for better model optimization. That’s where feature selection comes into the picture!

Now, keeping the model accuracy aside, theoretically, **feature selection**

‘ The Curse of Dimensionality’ — If your dataset has more features/columns than samples*reduces overfitting***(X)**, the model will be prone to overfitting. By removing irrelevant data/noise, the model gets to focus on essential features, leading to more generalization.— Dimensionality adds many layers to a model, making it needlessly complicated. Overengineering is fun but it may not be better than its simpler counterparts. Simpler models are easier to interpret and debug.*simplifies models*— Lesser features/dimensions reduce the computation speed, speeding up model training.*reduces training time*

Keep in mind that all these benefits depend heavily on the problem. But for sure, it will result in a better model.

Often, feature selection and dimensionality reduction are used interchangeably, credit to their similar goals of reducing the number of features in a dataset. However, there is an important difference between them. Feature selection yields a subset of features from the original set of features, which are the best representatives of the data. While dimensionality reduction is the introduction of a new feature space where the original features are represented. It basically transforms the feature space to a lower dimension, keeping the original features intact. This is done by either combining or excluding a few features. To sum up, you can consider feature selection as a part of dimensionality reduction.

We will be using the automobile dataset from the UCI Machine Learning repository. The dataset contains information on car specifications, its insurance risk rating, and its normalized losses in use as compared to other cars. The goal of the model would be to predict the ‘price’. As a regression problem, it comprises a good mix of continuous and categorical variables, as shown below:

After considerable preprocessing of around 200 samples with 26 attributes each, I managed to get the value of R squared as 0.85. Since our focus is on assessing feature selection techniques, we won’t go deep into the modeling process. **Now, let’s try to improve the model by feature selection!**

There are three general classes of feature selection algorithms: filter methods, wrapper methods, and embedded methods.

Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable. Some examples of filter methods include the Chi-squared test, information gain, and correlation coefficient scores.

Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated, and compared to other combinations. A predictive model is used to evaluate a combination of features and assign a score based on model accuracy. The search process may be methodical such as a best-first search, it may stochastic such as a random hill-climbing algorithm, or it may use heuristics, like forward and backward passes to add and remove features. An example of a wrapper method is the recursive feature elimination algorithm.

Embedded methods learn which features best contribute to the accuracy of the model while the model is being created. The most common type of embedded feature selection is regularization-based methods. Regularization methods are also called penalization methods that introduce additional constraints into the optimization of a predictive algorithm (such as a regression algorithm) that bias the model toward lower complexity (fewer coefficients). Examples of regularization algorithms are the LASSO, Elastic Net, and Ridge Regression.

We can summarize feature selection as follows.

**Feature Selection**: Select a subset of input features from the dataset.**Unsupervised**: Do not use the target variable (e.g. remove redundant variables).- Correlation

**Supervised**: Use the target variable (e.g. remove irrelevant variables).**Wrapper**: Search for well-performing subsets of features.- RFE

**Filter**: Select subsets of features based on their relationship with the target.- Statistical Methods
- Feature Importance Methods

**Intrinsic**: Algorithms that perform automatic feature selection during training.- Decision Trees

**Dimensionality Reduction**: Project input data into a lower-dimensional feature space.

The image below provides a summary of this hierarchy of feature selection techniques.

With filter methods, we primarily apply a statistical measure that suits our data to assign **each feature column** a calculated score. Based on that score, it will be decided whether that feature will be kept or removed from our predictive model. These methods are computationally inexpensive and are best for eliminating redundant irrelevant features. However, one downside is that they don’t take feature correlations into consideration since they work independently on each feature.

It is common to use correlation-type statistical measures between input and output variables as the basis for filter feature selection. As such, the choice of statistical measures is highly dependent upon the variable data types. Common data types include numerical (such as height) and categorical (such as a label), although each may be further subdivided such as integer and floating-point for numerical variables, and boolean, ordinal, or nominal for categorical variables.

Common input variable data types:

**Numerical Variables**- Integer Variables.
- Floating Point Variables.

**Categorical Variables**.- Boolean Variables (dichotomous).
- Ordinal Variables.
- Nominal Variables.

The more that is known about the data type of a variable, the easier it is to choose an appropriate statistical measure for a filter-based feature selection method. In this section, we will consider two broad categories of variable types: numerical and categorical; also, the two main groups of variables to consider: input and output.

Input variables are those that are provided as input to a model. In feature selection, it is this group of variables that we wish to reduce in size. Output variables are those for which a model is intended to predict, often called the response variable.

The type of response variable typically indicates the type of predictive modeling problem being performed. For example, a numerical output variable indicates a regression predictive modeling problem, and a categorical output variable indicates a classification predictive modeling problem.

**Numerical Output**: Regression predictive modeling problem.**Categorical Output**: Classification predictive modeling problem.

The statistical measures used in filter-based feature selection are generally calculated one input variable at a time with the target variable. As such, they are referred to as univariate statistical measures. This may mean that any interaction between input variables is not considered in the filtering process.

Most of these techniques are univariate, meaning that they evaluate each predictor in isolation. In this case, the existence of correlated predictors makes it possible to select important, but redundant, predictors. The obvious consequences of this issue are that too many predictors are chosen and, as a result, collinearity problems arise.

Page 499, Applied Predictive Modeling, 2013.

With this framework, let’s review some univariate statistical measures that can be used for filter-based feature selection.

This is a regression predictive modeling problem with numerical input variables. The most common techniques are to use a correlation coefficient, such as Pearson’s for a linear correlation, or rank-based methods for a nonlinear correlation.

- Pearson’s correlation coefficient (linear).
- Spearman’s rank coefficient (nonlinear)

This is a classification predictive modeling problem with numerical input variables.

This might be the most common example of a classification problem,

Again, the most common techniques are correlation-based, although in this case, they must take the categorical target into account.

- ANOVA correlation coefficient (linear).
- Kendall’s rank coefficient (nonlinear).

Kendall does assume that the categorical variable is ordinal.

This is a regression predictive modeling problem with categorical input variables.

This is a strange example of a regression problem (e.g. you would not encounter it often).

Nevertheless, you can use the same “*Numerical Input, Categorical Output*” methods (described above), but in reverse.

This is a classification predictive modeling problem with categorical input variables.

The most common correlation measure for categorical data is the chi-squared test. You can also use mutual information (information gain) from the field of information theory.

- Chi-Squared test (contingency tables).
- Mutual Information.

In fact, mutual information is a powerful method that may prove useful for both categorical and numerical data, e.g. it is agnostic to the data types.

This section provides some additional considerations when using filter-based feature selection.

The scikit-learn library provides an implementation of most of the useful statistical measures.

For example:

- Pearson’s Correlation Coefficient: f_regression()
- ANOVA: f_classif()
- Chi-Squared: chi2()
- Mutual Information: mutual_info_classif() and mutual_info_regression()

Also, the SciPy library provides an implementation of many more statistics, such as Kendall’s tau (kendalltau) and Spearman’s rank correlation (spearmanr).

The scikit-learn library also provides many different filtering methods once statistics have been calculated for each input variable with the target.

Two of the more popular methods include:

- Select the top k variables: SelectKBest
- Select the top percentile variables: SelectPercentile

I often use *SelectKBest* myself.

Consider transforming the variables in order to access different statistical methods. For example, you can transform a categorical variable to ordinal, even if it is not, and see if any interesting results come out. You can also make a numerical variable discrete (e.g. bins); try categorical-based measures. Some statistical measures assume properties of the variables, such as Pearson’s which assumes a Gaussian probability distribution to the observations and a linear relationship. You can transform the data to meet the expectations of the test and try the test regardless of the expectations and compare the results.

There is no best feature selection method. Just like there is no best set of input variables or best machine learning algorithm. At least not universally. Instead, you must discover what works best for your specific problem using careful systematic experimentation. Try a range of different models fit on different subsets of features chosen via different statistical measures and discover what works best for your specific problem.

This section provides worked examples of feature selection cases that you can use as a starting point.

(

This section demonstrates feature selection for a regression problem as numerical inputs and numerical outputs. A test regression problem is prepared using the make_regression() function. Feature selection is performed using Pearson’s Correlation Coefficient via the f_regression() function.

```
# pearson's correlation feature selection for numeric input and numeric output
from sklearn.datasets import make_regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
# generate dataset
X, y = make_regression(n_samples=100, n_features=100, n_informative=10)
# define feature selection
fs = SelectKBest(score_func=f_regression, k=10)
# apply feature selection
X_selected = fs.fit_transform(X, y)
print(X_selected.shape)
```

Running the example first creates the regression dataset, then defines the feature selection and applies the feature selection procedure to the dataset, returning a subset of the selected input features.

```
(100, 10)
```

(

This section demonstrates feature selection for a classification problem as numerical inputs and categorical outputs. A test regression problem is prepared using the make_classification() function. Feature selection is performed using ANOVA F measure via the f_classif() function.

```
# ANOVA feature selection for numeric input and categorical output
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
# generate dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=2)
# define feature selection
fs = SelectKBest(score_func=f_classif, k=2)
# apply feature selection
X_selected = fs.fit_transform(X, y)
print(X_selected.shape)
```

Running the example first creates the classification dataset, then defines the feature selection and applies the feature selection procedure to the dataset, returning a subset of the selected input features.

```
(100, 2)
```

(

For examples of feature selection with categorical inputs and categorical outputs, see this tutorial.

We have** Univariate filter methods**

Data columns with too many missing values won’t be of much value. Theoretically, 25–30% is the acceptable threshold of missing values, beyond which we should drop those features from the analysis. If you have domain knowledge, it’s always better to make an educated guess if the feature is crucial to the model. In such a case, try imputing the missing values using various techniques listed here. To get missing value percentages per feature, try this one-liner code! Adding a jupyter notebook for each technique was cumbersome, so I’ve added the output side by side using Github gist considering the same automobile dataset.

```
print(df_train.isnull().sum()/len(df_train)*100).nlargest())
#output => Returns the 5 largest values from the series. No missing values in automobile dataset, so all shows 0%.
"""symboling 0.0
doornumber 0.0
wheelbase 0.0
carlength 0.0
carwidth 0.0
dtype: float64"""
```

Features in which identical value occupies the majority of the samples are said to have zero variance. Such features carrying little information will not affect the target variable and can be dropped. You can adjust the threshold value, the default is 0, i.e remove the features that have the same value in all samples. For quasi-constant features, that have the same value for a very large subset, use the threshold as 0.01. In other words, drop the column where 99% of the values are similar.

```
from sklearn.feature_selection import VarianceThreshold
print(df_train.shape) #output (143, 59)
var_filter = VarianceThreshold(threshold = 0.0)
train = var_filter.fit_transform(df_train)
#to get the count of features that are not constant
print(train.shape()) # output (143, 56)
#or
print(len(df_train.columns[var_filter.get_support()])) #output 56
```

Two independent features (X) are highly correlated if they have a strong relationship with each other and move in a similar direction. In that case, you don’t need two similar features to be fed to the model, if one can suffice. It centrally takes into consideration the fitted line, slope of the fitted line, and the quality of the fit. There are various approaches for calculating correlation coefficients and if a pair of columns cross a certain threshold, the one that shows a high correlation with the target variable (y) will be kept and the other one will be dropped.

*Pearson correlation **(for continuous data) *is a parametric statistical test that* *measures the similarity between two variables. Got confused by the parametric term? It means that this test assumes that the observed data follows some distribution pattern( e.g. normal, gaussian). Its coefficient value ‘**r’** ranges between **-1(**negative correlation) to **1(**positive correlation) indicating how well the data fits the model. It also returns a ‘**p-value**’ to determine whether the correlation between variables is significant by comparing it to a significance level ‘alpha’ (α). If the p-value is less than α, it means that the sample contains sufficient evidence to reject the null hypothesis and conclude that the correlation coefficient does not equal zero.

*Spearman rank correlation coefficient**(for continuous + ordinal data) *is a non-parametric statistical test that works similar to Pearson, however, it does not make any assumptions about the data. Denoted by the symbol rho (-1<**ρ**<1**), **this test can be applied for both ordinal and continuous data that has failed the assumptions for conducting Pearson’s correlation. For newbies, ordinal data is categorical data but with a slight nuance of ranking/ordering (e.g low, medium, and high). An important assumption to be noted here is that there should be a monotonic relationship between the variables, i.e. variables increase in value together or if one increases, the other one decreases.

*Kendall correlation coefficient **(for discrete/ordinal data) –*** **Similar to Spearman correlation, this coefficient compares the number of concordant and discordant pairs of data.

*Let’s say we have a pair of observations (xᵢ, yᵢ), (xⱼ, yⱼ), with i < j, they are:*** *concordant if either (xᵢ > xⱼ and yᵢ > yⱼ) or (xᵢ < xⱼ and yᵢ < yⱼ)** *discordant* *if either (xᵢ < xⱼ and yᵢ > yⱼ) or (xᵢ > xⱼ and yᵢ < yⱼ)*** neither if there’s a tie in **x** (xᵢ = xⱼ) or a tie in **y** (yᵢ = yⱼ)

Denoted with the Greek letter tau (**τ**), this coefficient varies between -1 to 1 and is based on the difference in the counts of concordant and discordant pairs relative to the number of x-y pairs.

```
#USING SCIPY
from scipy.stats import spearmanr
from scipy.stats import pearsonr
from scipy.stats import kendalltau
coef, p = pearsonrr(x, y) #Pearson's r
coef, p = spearmanr(x, y) # Spearman's rho
coef, p = kendalltau(x, y) # Kendall's tau
#USING PANDAS
x.corr(y) #Pearson's r
x.corr(y, method='spearman') # Spearman's rho
x.corr(y, method='kendall') # Kendall's tau
```

In the regression jupyter notebook above, I’ve used **Pearson’s correlation** since Spearman and Kendall work best only with ordinal variables and we have 60% continuous variables.

Before diving into chi-square, let’s understand an important concept: hypothesis testing! Imagine XYZ makes a claim, a commonly accepted fact, you call it a *Null Hypothesis*. Now you come up with an alternate hypothesis, one that you think explains that phenomenon better, and then work towards rejecting the null hypothesis.

In our case:*Null Hypothesis*: The two variables are independent.*Alternative Hypothesis*: The two variables are dependent.

So, Chi-Square tests come in two variations – one that evaluates the **goodness-of-fit **and the other one where we will be focusing on is** **the **test of independence**. Primarily, it compares the observed data to a model that distributes the data according to the **expectation** that the variables are independent. Then, you basically need to check where the observed data doesn’t fit the model. If there are too many data points/outliers, there is a huge possibility that the variables are dependent, proving that the null hypothesis is incorrect!

It primarily returns a test statistic **“p-value”** to help us decide! On a high level, if the p-value is less than some critical value- ‘**level of significance**’(usually 0.05), we reject the null hypothesis and believe that the variables are dependent!

Chi-square would not work with the automobile dataset since it needs categorical variables and non-negative values! For that reason, we can use Mutual Information & ANOVA.

```
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X_train = X_train.astype(int)
chi2_features = SelectKBest(chi2 , k=12)
X_kbest_features = chi2_features.fit_transform(X_train, y_train)
```

Mutual information measures the contribution of a variable towards another variable. In other words, how much will the target variable be impacted if we remove or add the feature? MI is 0 if both the variables are independent and ranges between 0 –1 if X is deterministic of Y. MI is primarily the entropy of X, which measures or quantifies the amount of information obtained about one random variable, through the other random variable. The best thing about MI is that it allows one to detect non-linear relationships and works for both regression and classification.

```
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(mutual_info_regression, k=10)
X_train_new = selector.fit_transform(X_train, y_train) #Applying transformation to the training set
#to get names of the selected features
mask = selector.get_support()
# Output array([False, False, True, True, True, False ....])
print(selector.scores_)
#Output array([0.16978127, 0.01829886, 0.45461366, 0.55126343, 0.66081217, 0.27715287 ....])
new_features = X_train.columns[mask]
print(new_features)
#Output Index(['wheelbase', 'carlength', 'carwidth', 'curbweight', 'enginesize','boreratio', 'horsepower', 'citympg', 'highwaympg', 'fuelsystem_2bbl'],dtype='object')
print(train.shape)
#Output (143, 10)
```

Okay honestly, this is a bit tricky but let’s understand it step by step. Firstly, here instead of features we deal with groups/ levels. Groups are different groups within the same independent(categorical) variable. ANOVA is primarily an **extension of a t-test**. With a t-test, you can study only two groups but with ANOVA you need at least three groups to see if there’s a difference in means and determine if they came from the same population.

It assumes the Hypothesis as

H0: Means of all groups are equal.

H1: At least one mean of the groups is different.

Let’s say from our automobile dataset, we use a feature ‘fuel-type’ that has 2 groups/levels — ‘diesel’ and ‘gas’. So, our goal would be to determine if these two groups are statistically different by calculating whether the means of the groups are different from the overall mean of the independent variable i.e ‘fuel-type’. ANOVA uses F-Test for statistical significance, which is the ratio of the **variance between groups** to the **variance within groups** and the larger this number is, the more likely it is that the means of the groups really *are* different, and that you should reject the null hypothesis.

```
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
fvalue_selector = SelectKBest(f_regression, k=20) #select features with 20 best ANOVA F-Values
X_train_new = fvalue_selector.fit_transform(X_train, y_train)
print(X_train.shape, X_train_new.shape)
# output (143, 59) (143, 20)
```

In Wrapper methods, we primarily choose a subset of features and train them using a machine learning algorithm. Based on the inferences from this model, we employ a search strategy to look through the space of possible feature subsets and decide which feature to add or remove for the next model development. This loop continues until the model performance no longer changes with the desired count of features*(k_features)*.

The downside is that it becomes computationally expensive as the features increase, but on the good side, it takes care of the interactions between the features, ultimately finding the optimal subset of features for your model with the lowest possible error.

A greedy search algorithm comes in two variants- **Sequential Forward Selection** (SFS) and **Sequential Backward Selection** (SBS). It basically starts with a null set of features and then looks for a feature that **minimizes the cost function**. Once the feature is found, it gets added to the feature subset and in the same way one by one, it finds the right set of features to build an optimal model. That’s how SFS works. With Sequential Backward Feature Selection, it takes a totally opposite route. It starts with all the features and iteratively removes one by one feature depending on the performance. Both algorithms have the same goal of attaining the lowest cost model.

*The main limitation of SFS is that it is unable to remove features that become non-useful after the addition of other features. The main limitation of SBS is its inability to reevaluate the usefulness of a feature after it has been discarded.*

```
from mlxtend.feature_selection import SequentialFeatureSelector
sfs = SequentialFeatureSelector(LinearRegression(), # cv = k-fold cross validation, floating is another extension of SFS, not used here
k_features=10,
forward=True,
floating=False,
scoring='accuracy',
cv=2)
sfs = sfs.fit(X_train, y_train)
selected_features = x_train.columns[list(sfs.k_feature_idx_)]
print(selected_features)
# print the selected features.
selected_features = x_train.columns[list(sfs.k_feature_idx_)]
print(selected_features)
# final prediction score.
print(sfs.k_score_)
# transform to the newly selected features.
x_train_new = sfs.transform(X_train)
```

Considering that you have an initial set of features, what this greedy algorithm does is repeatedly performs model building by considering smaller subsets of features each time. How does it do that? After an estimator is trained on the features, it returns a rank value based on the model’s *coef_* or *feature_importances_ *attribute conveying the importance of each feature. For the next step, the least important features are pruned from the current set of features. This process is recursively repeated until the specified number of features are attained.

```
from sklearn.feature_selection import RFE
lm = LinearRegression()
rfe1 = RFE(lm, 20) # RFE with 20 features
# Fit on train and test data with 20 features
X_train_new = rfe1.fit_transform(X_train, y_train)
X_test_new = rfe1.transform(X_test)
# Print the boolean results
print(rfe1.support_)
# Output [False False False False True False False False True False False...]
print(rfe1.ranking_)
# Output [36 34 23 26 1 21 12 27 1 13 28 1 18 19 32 25 1 11 9 7 8 10 30 35...]
lm.fit(X_train_new, y_train)
predictions_rfe = lm.predict(X_test_new)
RMSE = np.sqrt(mean_squared_error(y_test, predictions_rfe))
R2 = r2_score(y_test, predictions)
print('R2:',R2,'RMSE:',RMSE)
#Output R2: 0.88 RMSE: 0.33
```

These methods combine the functionalities of both Filter and Wrapper methods. The upside is that they perform feature selection during the process of training which is why they are called embedded! The computational speed is as good as filter methods and of course better accuracy, making it a win-win model!

Before diving into L1, let’s understand a bit about regularization. Primarily, it is a technique used to reduce overfitting to highly complex models. We add a penalty term to the cost function so that as the model complexity increases the cost function increases by a huge value. Coming back to LASSO (Least Absolute Shrinkage and Selection Operator) Regularization, what you need to understand here is that it comes with a parameter, **‘alpha’, **and the higher the alpha is, the more feature coefficients of least important features** **are shrunk to zero. Eventually, we get a much simple model with the same or better accuracy!

However, in cases where a certain feature is important, you can try Ridge regularization (L2) or Elastic Net (a combination of L1 and L2), wherein instead of dropping it completely, it reduces the feature weightage.

```
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
lasso= Lasso()
parameters = {'alpha':[1e-15, 1e-10, 1e-8, 1e-4,1e-3,1e-2,1,3,5]}
lasso_model = GridSearchCV(lasso, parameters, scoring = 'r2',cv=5)
lasso_model.fit(X_train,y_train)
pred = lasso_model.predict(X_test)
print(lasso_model.best_params_) # output {'alpha': 0.001}
print(lasso_model.best_score_) # output 0.8630550401365724
```

One of the most popular and accurate machine learning algorithms, random forests are an ensemble of randomized **decision** trees. An individual tree won’t contain all the features and samples. The reason why we use these for feature selection is the way decision trees are constructed! That is during the process of tree building, decision trees use several feature selection methods that are built into it. Starting from the root, the function used to create the tree tries all possible splits by making conditional comparisons at each step and chooses the one that splits the data into the most homogenous groups (most pure). The importance of each feature is derived from how “pure” each of the sets is.

Using **Gini impurity** for classification and variance for regression, we can identify the features that would lead to an optimal model. The same concept can be applied to CART (Classification and Regression Trees) and boosting tree algorithms as well.

```
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
# fit the model
model.fit(X_train, y_train)
# get importance
importance = model.feature_importances_
# summarize feature importance
impList = zip(X_train.columns, importance)
for feature in sorted(impList, key = lambda t: t[1], reverse=True):
print(feature)
#Output - Important features
""" ('enginesize', 0.6348884035234398)
('curbweight', 0.2389770360203148)
('horsepower', 0.03458620700119025)
('carwidth', 0.027170640676336785)
('stroke', 0.012516866412495744)
('peakrpm', 0.011750282673996262)
('carCompany_bmw', 0.009801675326218959)
('carlength', 0.008737911775028553) ..... """
```

This is a binary classification problem where all of the attributes are numeric.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure or differences in numerical precision. Consider running the example a few times and comparing the average outcome.

Statistical tests can be used to select those features that have the strongest relationship with the output variable. The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

Many different statistical tests can be used with this selection method. For example, the ANOVA F-value method is appropriate for numerical inputs and categorical data, as we see in the Pima dataset. This can be used via the f_classif() function. We will select the 4 best features using this method in the example below.

```
# Feature Selection with Univariate Statistical Tests
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
test = SelectKBest(score_func=f_classif, k=4)
fit = test.fit(X, Y)
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])
```

You can see the scores for each attribute and the 4 attributes chosen (those with the highest scores). Specifically features with indexes 0 (*preq*), 1 (*plas*), 5 (*mass*), and 7 (*age*).

```
[ 39.67 213.162 3.257 4.304 13.281 71.772 23.871 46.141]
[[ 6. 148. 33.6 50. ]
[ 1. 85. 26.6 31. ]
[ 8. 183. 23.3 32. ]
[ 1. 89. 28.1 21. ]
[ 0. 137. 43.1 33. ]]
```

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute. You can learn more about the RFE class in the scikit-learn documentation. The example below uses RFE with the logistic regression algorithm to select the top 3 features. The choice of algorithm does not matter too much as long as it is skillful and consistent.

```
# Feature Extraction with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
model = LogisticRegression(solver='lbfgs')
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)
```

You can see that RFE chose the top 3 features as preg, mass, and *pedi*. These are marked True in the *support_* array and marked with a choice “1” in the *ranking_* array.

```
Num Features: 3
Selected Features: [ True False False False False True True False]
Feature Ranking: [1 2 3 5 6 1 1 4]
```

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form. Generally, this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result. In the example below, we use PCA and select 3 principal components. Learn more about the PCA class in scikit-learn by reviewing the PCA API.

```
# Feature Extraction with PCA
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)
```

You can see that the transformed dataset (3 principal components) bare little resemblance to the source data.

```
Explained Variance: [ 0.88854663 0.06159078 0.02579012]
[[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02
9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03]
[ 2.26488861e-02 9.72210040e-01 1.41909330e-01 -5.78614699e-02
-9.46266913e-02 4.69729766e-02 8.16804621e-04 1.40168181e-01]
[ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01
2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]]
```

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features. In the example below we construct an ExtraTreesClassifier classifier for the Pima Indians onset of diabetes dataset. You can learn more about the ExtraTreesClassifier class in the scikit-learn API.

```
# Feature Importance with Extra Trees Classifier
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
model = ExtraTreesClassifier(n_estimators=10)
model.fit(X, Y)
print(model.feature_importances_)
```

You can see that we are given an importance score for each attribute where the larger score the more important the attribute. The scores suggest the importance of *plas*, *age,* and *mass*.

```
[ 0.11070069 0.2213717 0.08824115 0.08068703 0.07281761 0.14548537 0.12654214 0.15415431]
```

That’s all! Hope you got a good intuition of how these statistical tests work as feature selection techniques. An important thing to consider here is that application of a feature selection algorithm doesn’t guarantee better accuracy always, but will surely lead to a simpler model than before!

Resources:

https://towardsdatascience.com/feature-selection-for-the-lazy-data-scientist-c31ba9b4ee66