Understanding Self-Attention in Transformers with example
2022-06-23
Understanding Moving Average Model in Time Series with Python
2022-06-25
Show all

A tutorial on Pandas apply, applymap, map, and transform

16 mins read

In Data Processing, it is often necessary to perform operations (such as statistical calculations, splitting, or substituting values) on a certain row or column to obtain new data. Writing a for-loop to iterate through Pandas DataFrame and Series will do the job, but that doesn’t seem like a good idea. The for-loop tends to have more lines of code, less code readability, and slower performance.

Fortunately, there are already great methods that are built into Pandas to help you accomplish the goals! In this post, we will see how to perform operations using apply() and applymap(), and how to substitute value using map().

First of all, you should be aware that DataFrame and Series will have some or all of these three methods, as follows:

And the Pandas official API reference suggests that:

  • apply() is used to apply a function along an axis of the DataFrame or on values of Series.
  • applymap() is used to apply a function to a DataFrame elementwise.
  • map() is used to substitute each value in a Series with another value.

Before we dive into the details, let’s first create a DataFrame for demonstration.

import pandas as pd
df = pd.DataFrame({ 'A': [1,2,3,4], 
                   'B': [10,20,30,40],
                   'C': [20,40,60,80]
                  }, 
                  index=['Row 1', 'Row 2', 'Row 3', 'Row 4'])

How to use apply()?

The Pandas apply() is used to apply a function along an axis of the DataFrame or on values of Series. Let’s begin with a simple example, to sum each row and save the result to a new column “D”

# Let's call this "custom_sum" as "sum" is a built-in function
def custom_sum(row):
    return row.sum()
df['D'] = df.apply(custom_sum, axis=1)

And here is the output

Let’s take a look df.apply(custom_sumaxis=1)

  • The first parameter custom_sum is a function.
  • The second parameter axis is to specify which axis the function is applied to. 0 for applying the function to each column and 1 for applying the function to each row.

Let me explain this process in a more intuitive way. The second parameter axis = 1 tells Pandas to use the row. So, the custom_sum is applied to each row and returns a new Series with the output of each row as value.

With the understanding of the sum of each row, the sum of each column is just to use axis = 0 instead

df.loc['Row 5'] = df.apply(custom_sum, axis=0)

So far, we have been talking about apply() on a DataFrame. Similarly, apply() can be used on the values of Series. For example, multiply the column “C” by 2 and save the result to a new column “D”

def multiply_by_2(val):
    return val * 2
df['D'] = df['C'].apply(multiply_by_2)

Notice that df[‘C’] is used to select the column “C” and then call apply() with the only parameter multiply_by_2. We don’t need to specify axis anymore because Series is a one-dimensional array. The return value is a Series and gets assigned to the new column D by df[‘D’].

Use lambda with apply

You can also use lambda expression with Pandas apply() function.

The lambda equivalent for the sum of each row of a DataFrame:

df['D'] = df.apply(lambda x:x.sum(), axis=1)

The lambda equivalent for the sum of each column of a DataFrame:

df.loc['Row 5'] = df.apply(lambda x:x.sum(), axis=0)

And the lambda equivalent for multiplying by 2 on a Series:

df['D'] = df['C'].apply(lambda x:x*2)

With result_type parameter

result_type is a parameter in apply() set to 'expand''reduce', or 'broadcast' to get the desired type of result.

In the above scenario if result_type is set to 'broadcast' then the output will be a DataFrame substituted by the custom_sum value.

df.apply(custom_sum, axis=1, result_type='broadcast')

The result is broadcasted to the original shape of the frame, and the original index and columns are retained.

To understand result_type as 'expand' and 'reduce', we will first create a function that returns a list.

def cal_multi_col(row):
    return [row['A'] * 2, row['B'] * 3]

Now apply this function across the DataFrame column with result_type as 'expand'

df.apply(cal_multi_col, axis=1, result_type='expand')

The output is a new DataFrame with column names 0 and 1.

In order to append this to the existing DataFrame, the result has to be kept in a variable so the column names can be accessed by res.columns.

res = df.apply(cal_multi_col, axis=1, result_type='expand')
df[res.columns] = res

And the output is:

The output of result_type=’expand’

Next, apply the function across the DataFrame column with result_type as 'reduce' . result_type='reduce' is just the opposite of 'expand' and returns a Series if possible rather than expanding list-like results.

df['New'] = df.apply(cal_multi_col, axis=1, result_type='reduce')
the output of result_type=’reduce’

How to use applymap()?

applymap() is only available in DataFrame and used for element-wise operation across the whole DataFrame. It has been optimized and some cases work much faster than apply() , but it’s good to compare it with apply() before going for any heavier operation.

For example, to output a DataFrame with squared numbers we can use the following:

df.applymap(np.square)

How to use map()?

map() is only available in Series and used for substituting each value in a Series with another value. To understand how the map() works, we first create a Series.

s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
s
0      cat
1      dog
2      NaN
3   rabbit
dtype: object

map() accepts a dict or a Series. Values that are not found in the dict are converted to NaN, unless the dict has a default value (e.g. defaultdict):

s.map({'cat': 'kitten', 'dog': 'puppy'})
0   kitten
1    puppy
2      NaN
3      NaN
dtype: object

It also accepts a function:

s.map('I am a {}'.format)
0       I am a cat
1       I am a dog
2       I am a nan
3    I am a rabbit
dtype: object

To avoid applying the function to missing values (and keep them as NaNna_action='ignore' can be used:

s.map('I am a {}'.format, na_action='ignore')
0     I am a cat
1     I am a dog
2            NaN
3  I am a rabbit
dtype: object

Summary of apply(), map(), and applymap()

For DataFrame:

  • apply(): It is used when you want to apply a function along the rows or columns. axis = 0 for columns and axis = 1 for rows.
  • applymap(): It is used for element-wise operation across the whole DataFrame.

For Series:

  • apply(): It is used when you want to apply a function on the values of Series.
  • map(): It is used to substitute each value with another value.

Pandas transform()

In this section, we will cover the following most frequently used Pandas transform() features:

  1. Transforming values
  2. Combining groupby() results
  3. Filtering data
  4. Handling missing values at the group level

Transforming values

Let’s take a look at pd.transform(funcaxis=0)

  • The first argument func is to specify the function to be used for manipulating data. It can be a function, a string function name, a list of functions, or a dictionary of axis labels -> functions
  • The 2nd argument axis is to specify which axis the func is applied to. 0 for applying the func to each column and 1 for applying the func to each row.

Let’s see how transform() works with the help of some examples.

A function

We can pass a function to func. For example

df = pd.DataFrame({'A': [1,2,3], 'B': [10,20,30] })
def plus_10(x):
    return x+10
df.transform(plus_10)

You can also use a lambda expression. Below is the lambda equivalent for the plus_10() :

df.transform(lambda x: x+10)

A string function

We can pass any valid Pandas string function to func, for example 'sqrt' :

df.transform('sqrt')

A list of functions

func can be a list of functions. for example sqrt and exp from NumPy:

df.transform([np.sqrt, np.exp])

A dictionary of axis labels -> function

func can be a dictionary of axis labels -> function. For example

df.transform({
    'A': np.sqrt,
    'B': np.exp,
})

Combining groupby() results

One of the most compelling usages of Pandas transform() is combining grouby() results. Let’s see how this works with the help of an example. Suppose we have a dataset about a restaurant chain:

df = pd.DataFrame({
  'restaurant_id': [101,102,103,104,105,106,107],
  'address': ['A','B','C','D', 'E', 'F', 'G'],
  'city': ['London','London','London','Oxford','Oxford', 'Durham', 'Durham'],
  'sales': [10,500,48,12,21,22,14]
})

We can see that each city has multiple restaurants with sales. We would like to know “What is the percentage of sales each restaurant represents in the city”. The expected output is:

The tricky part in this calculation is that we need to get a city_total_sales and combine it back into the data in order to get the percentage.

There are 2 solutions:

  1. groupby()apply(), and merge()
  2. groupby() and transform()

Solution 1: groupby()apply(), and merge()

The first solution is to split the data with groupby() and using apply() to aggregate each group, then merge the results back into the original DataFrame using merge()

Step 1: Use groupby() and apply() to calculate the city_total_sales

city_sales = df.groupby('city')['sales']
             .apply(sum).rename('city_total_sales').reset_index()

groupby('city') split the data by grouping on the city column. For each of these groups, the function sum gets applied to the sales column to calculate the sum for each group. Finally, the new column gets renamed to city_total_sales and the index gets reset (Note: reset_inde() is required to clear the index generated by groupby('city')).

In addition, Pandas has a built-in sum() function and the following is the Pandas sum() equivalent:


city_sales = df.groupby('city')['sales']
             .sum().rename('city_total_sales').reset_index()

Step 2: Use merge() function to combine the results

df_new = pd.merge(df, city_sales, how='left')

The group results get merged back into the original DataFrame using merge() with how='left' for left outer join.

Step 3: Calculate the percentage

Finally, the percentage can be calculated and formatted.

df_new['pct'] = df_new['sales'] / df_new['city_total_sales']
df_new['pct'] = df_new['pct'].apply(lambda x: format(x, '.2%'))

This certainly does our work. But it is a multistep process and requires extra code to get the data in the form we require.

We can solve this effectively using the transform() function

Solution 2: groupby() and transform()

This solution is a game-changer. A single line of code can solve the apply and merge.

Step 1: Use groupby() and transform() to calculate the city_total_sales

The transform function retains the same number of items as the original dataset after performing the transformation. Therefore, a one-line step using groupby followed by a transform(sum) returns the same output.

df['city_total_sales'] = df.groupby('city')['sales']
                           .transform('sum')

Step 2: Calculate the percentage

Finally, this is the same as the solution one to get the percentage.

df['pct'] = df['sales'] / df['city_total_sales']
df['pct'] = df['pct'].apply(lambda x: format(x, '.2%'))

Filtering data

transform() can also be used to filter data. Here we are trying to get records where the city’s total sales are greater than 40

df[df.groupby('city')['sales'].transform('sum') > 40]

Handling missing values at the group level

Another usage of Pandas transform() is to handle missing values at the group level. Let’s see how this works with an example.

Here is a DataFrame for demonstration

df = pd.DataFrame({
    'name': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'value': [1, np.nan, np.nan, 2, 8, 2, np.nan, 3]
})

In the example above, the data can be split into three groups by name, and each group has missing values. A common solution to replace missing values is to replace NaN with mean.

Let’s take a look at the average value in each group.

df.groupby('name')['value'].mean()name
A    1.0
B    5.0
C    2.5
Name: value, dtype: float64

Here we can use transform() to replace missing values with the group average value.

df['value'] = df.groupby('name')
                .transform(lambda x: x.fillna(x.mean()))

Comparing panadas apply() and transform()

When looking for applying a custom function, you might be confused with the following two choices:

  • apply(func, axis=0): call a function func along an axis of the DataFrame. It returns the result of applying func along the given axis.
  • transform(func, axis=0): call a function func on self producing a DataFrame with transformed values. It returns a DataFrame that has the same length as self.

They take the same arguments func and axis. Both call the func along the axis of the given DataFrame. So what is the difference? How do we choose one over the other?

In this section, we will cover the following usages and go through their difference:

  1. Manipulating values
  2. In conjunction with groupby() results

Manipulating values

Both apply() and transform() can be used to manipulate values. Let’s see how they work with the help of some examples.

df = pd.DataFrame({'A': [1,2,3], 'B': [10,20,30] })
def plus_10(x):
    return x+10

For the entire DataFrame

Both apply() and transform() can be used to manipulate the entire DataFrame.

df.apply(plus_10)
df.transform(plus_10)

Both apply() and transform() support lambda expression and below is the lambda equivalent:

df.apply(lambda x: x+10)
df.transform(lambda x: x+10)

For a single column

Both apply() and transform() can be used for manipulating a single column

df['B_ap'] = df['B'].apply(plus_10)
# The lambda equivalent
df['B_ap'] = df['B'].apply(lambda x: x+10)
df['B_tr'] = df['B'].transform(plus_10)
# The lambda equivalent
df['B_tr'] = df['B'].transform(lambda x: x+10)

What are the differences?

Here are the three main differences:

  • (1) transform() works with function, a string function, a list of functions, or a dictionary. However, apply() is only allowed with function.
  • (2) transform() cannot produce aggregated results.
  • (3) apply() works with multiple Series at a time. But, transform() is only allowed to work with a single Series at a time.

Let’s take a look at them with the help of some examples.

(1) transform() works with function, a string function, a list of functions, or a dictionary. However, apply() is only allowed a function.

For transform(), we can pass any valid Pandas string function to func

df.transform('sqrt')

func can be a list of functions, for example, sqrt and exp from NumPy:

df.transform([np.sqrt, np.exp])

func can be a dict of axis labels -> function. For example

df.transform({
    'A': np.sqrt,
    'B': np.exp,
})

(2) transform() cannot produce aggregated results.

We can use apply() to produce aggregated results, for example, the sum:

df.apply(lambda x:x.sum())
A     6
B    60
dtype: int64

However, we will get a ValueError when trying to do the same with transform() . We are getting this problem because the output of transform() has to be a DataFrame that has the same length as self.

df.transform(lambda x:x.sum())

(3) apply() works with multiple Series at a time. But, transform() is only allowed to work with a single Series at a time.

To demonstrate this, let’s create a function to work on 2 Series at a time.

def subtract_two(x):
    return x['B'] - x['A']

apply() works perfect with subtract_two and axis=1

df.apply(subtract_two, axis=1)
0     9
1    18
2    27
dtype: int64

However, we are getting a ValueError when trying to do the same with transform(). This is because transform() is only allowed to work with a single Series at a time.

# Getting error when trying the same with transform
df.transform(subtract_two, axis=1)

We will get the same result when using a lambda expression

# It is working
df.apply(lambda x: x['B'] - x['A'], axis=1)
# Getting same error
df.transform(lambda x: x['B'] - x['A'], axis=1)

In conjunction with groupby()

Both apply() and transform() can be used in conjunction with groupby(). And in fact, it is one of the most compelling usages of transform().

Here are the 2 differences when using them in conjunction with groupby()

  • (1) transform() returns a DataFrame that has the same length as the input
  • (2) apply() works with multiple Series at a time. But, transform() is only allowed to work with a single Series at a time.

Let’s create a DataFrame and show the difference with some examples

df = pd.DataFrame({
    'key': ['a','b','c'] * 4,
    'A': np.arange(12),
    'B': [1,2,3] * 4,
})

In the example above, the data can be split into three groups by key.

(1) transform() returns a Series that has the same length as the input

To demonstrate this, let’s make a function to produce an aggregated result.

# Aggregating the sum of the given Series
def group_sum(x):
    return x.sum()

For apply() , it returns one value for each group and the output shape is (num_of_groups, 1) .

gr_data_ap = df.groupby('key')['A'].apply(group_sum)
gr_data_ap

key
a     9
b    12
c    15
Name: A, dtype: int64

For transform(), it returns a Series that has the same length as the given DataFrame and the output shape is (len(df), 1)

gr_data_tr = df.groupby('key')['A'].transform(group_sum)
gr_data_tr

0     9
1    12
2    15
3     9
4    12
5    15
6     9
7    12
8    15
Name: A, dtype: int64

(2) apply() works with multiple Series at a time. But transform() is only allowed to work with a single Series at a time.

This is the same difference as we mentioned in “Manipulating values”, and we just don’t need to specify the argument axis on a groupby() result.

To demonstrate this, let’s create a function to work on 2 Series at a time.

def subtract_two(x):
    return x['B'] - x['A']

apply() works with multiple Series at a time.

df.groupby('key').apply(subtract_two)
key   
a    0    1
     3   -2
     6   -5
b    1    1
     4   -2
     7   -5
c    2    1
     5   -2
     8   -5
dtype: int64

However, we are getting a KeyError when trying the same with transform():

df.groupby('key').transform(subtract_two)

Summary

For manipulating values, both apply() and transform() can be used to manipulate an entire DataFrame or any specific column. But there are 3 differences:

  1. transform() can take a function, a string function, a list of functions, and a dictionary. However, apply() is only allowed a function.
  2. transform() cannot produce aggregated results.
  3. apply() works with multiple Series at a time. But, transform() is only allowed to work with a single Series at a time.

For working in conjunction with groupby()

  1. transform() returns a Series that has the same length as the input.
  2. apply() works with multiple Series at a time. But, transform() is only allowed to work with a single Series at a time.

Resources:

https://towardsdatascience.com/when-to-use-pandas-transform-function-df8861aa0dcf

https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff

https://towardsdatascience.com/difference-between-apply-and-transform-in-pandas-242e5cf32705

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.