Setting up Apache Airflow using Docker-Compose
Coursera Deep Learning Specialization Notes
Show all

How to return pandas dataframes from Scikit-Learn transformations: New API simplifies data preprocessing

3 mins read

Scikit-learn is a widely-used Python library in machine learning. In fact, it is usually one of the first libraries we learn about in data science. Scikit-learn provides functions and methods to cover the entire machine learning workflow. Therefore, it is not only used for implementing machine learning algorithms but also for tasks like feature preprocessing and model evaluation.

In this article, we will talk about a new API related to data preprocessing functions. In machine learning, it is highly unlikely that we use the features as they appear in the raw data. They usually require a lot of preprocessing for optimal results. For instance, some algorithms do not perform well if feature value ranges are very different. They tend to give more importance to the features with higher values so the results become biased.

Consider a house price prediction problem. The area of a house is around 200 square meters whereas the age is usually less than 20. The number of bedrooms can be 1, 2, or 3 in most cases. All of these features are important in determining the price of a house. However, if we use them without any scaling, machine learning models might give more importance to the features with higher values. Models tend to perform better and converge faster when the features are on a relatively similar scale.

The preprocessing module of Scikit-learn provides functions to scale features down to similar ranges. The issue with these functions is that they return a NumPy array instead of a DataFrame, which makes it hard to track the feature names. In most cases, we need to include extra lines of code in the script to keep track of the feature names.

Let’s do an example on the famous iris dataset.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=0

We can train a standard scaler on the train set and use it to transform the feature values in the test set.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_test_scaled = scaler.transform(X_test)
# output

The variable X_test is a Pandas DataFrame whereas X_test_scaled is a NumPy array. It would be much more practical to also have X_test_scaled as a DataFrame.

It is time to share the good news now! ️😊

The set_output API

The new set_output API allows for configuring transformers to output pandas DataFrames. It is still in an unstable release so I will share the example from the official documentation. The same example as we did above but with a small addition of the set_output.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().set_output(transform="pandas")
X_test_scaled = scaler.transform(X_test)

The X_test_scaled is a Pandas DataFrame as we can see in the screenshot above.

This new feature will simplify the code for data preprocessing tasks and also be very useful when creating pipelines with Scikit-learn.


Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.