2022-06-23
###### Understanding Moving Average Model in Time Series with Python
2022-06-25 In Data Processing, it is often necessary to perform operations (such as statistical calculations, splitting, or substituting values) on a certain row or column to obtain new data. Writing a for-loop to iterate through Pandas DataFrame and Series will do the job, but that doesn’t seem like a good idea. The for-loop tends to have more lines of code, less code readability, and slower performance.

Fortunately, there are already great methods that are built into Pandas to help you accomplish the goals! In this post, we will see how to perform operations using `apply()` and `applymap()`, and how to substitute value using `map()`.

First of all, you should be aware that DataFrame and Series will have some or all of these three methods, as follows:

And the Pandas official API reference suggests that:

• `apply()` is used to apply a function along an axis of the DataFrame or on values of Series.
• `applymap()` is used to apply a function to a DataFrame elementwise.
• `map()` is used to substitute each value in a Series with another value.

Before we dive into the details, let’s first create a DataFrame for demonstration.

```import pandas as pd
df = pd.DataFrame({ 'A': [1,2,3,4],
'B': [10,20,30,40],
'C': [20,40,60,80]
},
index=['Row 1', 'Row 2', 'Row 3', 'Row 4'])
```

## How to use apply()?

The Pandas `apply()` is used to apply a function along an axis of the DataFrame or on values of Series. Let’s begin with a simple example, to sum each row and save the result to a new column “D”

```# Let's call this "custom_sum" as "sum" is a built-in function
def custom_sum(row):
return row.sum()
df['D'] = df.apply(custom_sum, axis=1)
```

And here is the output

Let’s take a look `df.apply(custom_sum, axis=1)`

• The first parameter `custom_sum` is a function.
• The second parameter `axis` is to specify which axis the function is applied to. `0` for applying the function to each column and `1` for applying the function to each row.

Let me explain this process in a more intuitive way. The second parameter `axis = 1` tells Pandas to use the row. So, the `custom_sum` is applied to each row and returns a new Series with the output of each row as value.

With the understanding of the sum of each row, the sum of each column is just to use `axis = 0` instead

```df.loc['Row 5'] = df.apply(custom_sum, axis=0)
```

So far, we have been talking about `apply()` on a DataFrame. Similarly, `apply()` can be used on the values of Series. For example, multiply the column “C” by 2 and save the result to a new column “D”

```def multiply_by_2(val):
return val * 2
df['D'] = df['C'].apply(multiply_by_2)
```

Notice that `df[‘C’]` is used to select the column “C” and then call `apply()` with the only parameter `multiply_by_2`. We don’t need to specify axis anymore because Series is a one-dimensional array. The return value is a Series and gets assigned to the new column D by `df[‘D’]`.

## Use lambda with apply

You can also use lambda expression with Pandas `apply()` function.

The lambda equivalent for the sum of each row of a DataFrame:

```df['D'] = df.apply(lambda x:x.sum(), axis=1)
```

The lambda equivalent for the sum of each column of a DataFrame:

```df.loc['Row 5'] = df.apply(lambda x:x.sum(), axis=0)
```

And the lambda equivalent for multiplying by 2 on a Series:

```df['D'] = df['C'].apply(lambda x:x*2)
```

## With result_type parameter

`result_type` is a parameter in `apply()` set to `'expand'``'reduce'`, or `'broadcast'` to get the desired type of result.

In the above scenario if `result_type` is set to `'broadcast'` then the output will be a DataFrame substituted by the `custom_sum` value.

```df.apply(custom_sum, axis=1, result_type='broadcast')
```

The result is broadcasted to the original shape of the frame, and the original index and columns are retained.

To understand `result_type` as `'expand'` and `'reduce'`, we will first create a function that returns a list.

```def cal_multi_col(row):
return [row['A'] * 2, row['B'] * 3]
```

Now apply this function across the DataFrame column with `result_type` as `'expand'`

```df.apply(cal_multi_col, axis=1, result_type='expand')
```

The output is a new DataFrame with column names 0 and 1.

In order to append this to the existing DataFrame, the result has to be kept in a variable so the column names can be accessed by `res.columns`.

```res = df.apply(cal_multi_col, axis=1, result_type='expand')
df[res.columns] = res
```

And the output is:

Next, apply the function across the DataFrame column with `result_type` as `'reduce'` . `result_type='reduce'` is just the opposite of `'expand'` and returns a Series if possible rather than expanding list-like results.

```df['New'] = df.apply(cal_multi_col, axis=1, result_type='reduce')
```

## How to use applymap()?

`applymap()` is only available in DataFrame and used for element-wise operation across the whole DataFrame. It has been optimized and some cases work much faster than `apply()` , but it’s good to compare it with `apply()` before going for any heavier operation.

For example, to output a DataFrame with squared numbers we can use the following:

```df.applymap(np.square)
```

## How to use map()?

`map()` is only available in Series and used for substituting each value in a Series with another value. To understand how the `map()` works, we first create a Series.

```s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
s
0      cat
1      dog
2      NaN
3   rabbit
dtype: object
```

`map()` accepts a `dict` or a `Series`. Values that are not found in the `dict` are converted to `NaN`, unless the `dict` has a default value (e.g. `defaultdict`):

```s.map({'cat': 'kitten', 'dog': 'puppy'})
0   kitten
1    puppy
2      NaN
3      NaN
dtype: object
```

It also accepts a function:

```s.map('I am a {}'.format)
0       I am a cat
1       I am a dog
2       I am a nan
3    I am a rabbit
dtype: object
```

To avoid applying the function to missing values (and keep them as `NaN``na_action='ignore'` can be used:

```s.map('I am a {}'.format, na_action='ignore')
0     I am a cat
1     I am a dog
2            NaN
3  I am a rabbit
dtype: object
```

## Summary of `apply`(), map(), and applymap()

For DataFrame:

• `apply()`: It is used when you want to apply a function along the rows or columns. `axis = 0` for columns and `axis = 1` for rows.
• `applymap()`: It is used for element-wise operation across the whole DataFrame.

For Series:

• `apply()`: It is used when you want to apply a function on the values of Series.
• `map()`: It is used to substitute each value with another value.

## Pandas transform()

In this section, we will cover the following most frequently used Pandas `transform()` features:

1. Transforming values
2. Combining `groupby()` results
3. Filtering data
4. Handling missing values at the group level

## Transforming values

Let’s take a look at `pd.transform(func, axis=0)`

• The first argument `func` is to specify the function to be used for manipulating data. It can be a function, a string function name, a list of functions, or a dictionary of axis labels -> functions
• The 2nd argument axis is to specify which axis the `func` is applied to. `0` for applying the `func` to each column and `1` for applying the `func` to each row.

Let’s see how `transform()` works with the help of some examples.

### A function

We can pass a function to `func`. For example

```df = pd.DataFrame({'A': [1,2,3], 'B': [10,20,30] })
def plus_10(x):
return x+10
df.transform(plus_10)
```

You can also use a lambda expression. Below is the lambda equivalent for the `plus_10()` :

```df.transform(lambda x: x+10)
```

### A string function

We can pass any valid Pandas string function to `func`, for example `'sqrt'` :

```df.transform('sqrt')
```

### A list of functions

`func` can be a list of functions. for example `sqrt` and `exp` from NumPy:

```df.transform([np.sqrt, np.exp])
```

### A dictionary of axis labels -> function

`func` can be a dictionary of axis labels -> function. For example

```df.transform({
'A': np.sqrt,
'B': np.exp,
})
```

## Combining `groupby()` results

One of the most compelling usages of Pandas `transform()` is combining `grouby()` results. Let’s see how this works with the help of an example. Suppose we have a dataset about a restaurant chain:

```df = pd.DataFrame({
'restaurant_id': [101,102,103,104,105,106,107],
'city': ['London','London','London','Oxford','Oxford', 'Durham', 'Durham'],
'sales': [10,500,48,12,21,22,14]
})
```

We can see that each city has multiple restaurants with sales. We would like to know “What is the percentage of sales each restaurant represents in the city”. The expected output is:

The tricky part in this calculation is that we need to get a city_total_sales and combine it back into the data in order to get the percentage.

There are 2 solutions:

1. `groupby()``apply()`, and `merge()`
2. `groupby()` and `transform()`

### Solution 1: `groupby()`, `apply()`, and `merge()`

The first solution is to split the data with `groupby()` and using `apply()` to aggregate each group, then merge the results back into the original DataFrame using `merge()`

Step 1: Use `groupby()` and `apply()` to calculate the city_total_sales

```city_sales = df.groupby('city')['sales']
.apply(sum).rename('city_total_sales').reset_index()
```

`groupby('city')` split the data by grouping on the city column. For each of these groups, the function `sum` gets applied to the sales column to calculate the sum for each group. Finally, the new column gets renamed to city_total_sales and the index gets reset (Note: `reset_inde()` is required to clear the index generated by `groupby('city')`).

In addition, Pandas has a built-in `sum()` function and the following is the Pandas `sum()` equivalent:

```
city_sales = df.groupby('city')['sales']
.sum().rename('city_total_sales').reset_index()
```

Step 2: Use `merge()` function to combine the results

```df_new = pd.merge(df, city_sales, how='left')
```

The group results get merged back into the original DataFrame using `merge()` with `how='left'` for left outer join.

Step 3: Calculate the percentage

Finally, the percentage can be calculated and formatted.

```df_new['pct'] = df_new['sales'] / df_new['city_total_sales']
df_new['pct'] = df_new['pct'].apply(lambda x: format(x, '.2%'))
```

This certainly does our work. But it is a multistep process and requires extra code to get the data in the form we require.

We can solve this effectively using the `transform()` function

### Solution 2: `groupby()` and `transform()`

This solution is a game-changer. A single line of code can solve the apply and merge.

Step 1: Use `groupby()` and `transform()` to calculate the city_total_sales

The transform function retains the same number of items as the original dataset after performing the transformation. Therefore, a one-line step using `groupby` followed by a `transform(sum)` returns the same output.

```df['city_total_sales'] = df.groupby('city')['sales']
.transform('sum')
```

Step 2: Calculate the percentage

Finally, this is the same as the solution one to get the percentage.

```df['pct'] = df['sales'] / df['city_total_sales']
df['pct'] = df['pct'].apply(lambda x: format(x, '.2%'))
```

## Filtering data

`transform()` can also be used to filter data. Here we are trying to get records where the city’s total sales are greater than 40

```df[df.groupby('city')['sales'].transform('sum') > 40]
```

## Handling missing values at the group level

Another usage of Pandas `transform()` is to handle missing values at the group level. Let’s see how this works with an example.

Here is a DataFrame for demonstration

```df = pd.DataFrame({
'name': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'value': [1, np.nan, np.nan, 2, 8, 2, np.nan, 3]
})
```

In the example above, the data can be split into three groups by name, and each group has missing values. A common solution to replace missing values is to replace NaN with mean.

Let’s take a look at the average value in each group.

```df.groupby('name')['value'].mean()name
A    1.0
B    5.0
C    2.5
Name: value, dtype: float64
```

Here we can use `transform()` to replace missing values with the group average value.

```df['value'] = df.groupby('name')
.transform(lambda x: x.fillna(x.mean()))
```

## Comparing panadas apply() and transform()

When looking for applying a custom function, you might be confused with the following two choices:

• `apply(func, axis=0)`: call a function `func` along an axis of the DataFrame. It returns the result of applying `func` along the given axis.
• `transform(func, axis=0)`: call a function `func` on self producing a DataFrame with transformed values. It returns a DataFrame that has the same length as self.

They take the same arguments `func` and `axis`. Both call the `func` along the axis of the given DataFrame. So what is the difference? How do we choose one over the other?

In this section, we will cover the following usages and go through their difference:

1. Manipulating values
2. In conjunction with `groupby()` results

### Manipulating values

Both `apply()` and `transform()` can be used to manipulate values. Let’s see how they work with the help of some examples.

```df = pd.DataFrame({'A': [1,2,3], 'B': [10,20,30] })
def plus_10(x):
return x+10
```

### For the entire DataFrame

Both `apply()` and `transform()` can be used to manipulate the entire DataFrame.

```df.apply(plus_10)
```
```df.transform(plus_10)
```

Both `apply()` and `transform()` support lambda expression and below is the lambda equivalent:

```df.apply(lambda x: x+10)
df.transform(lambda x: x+10)
```

### For a single column

Both `apply()` and `transform()` can be used for manipulating a single column

```df['B_ap'] = df['B'].apply(plus_10)
# The lambda equivalent
df['B_ap'] = df['B'].apply(lambda x: x+10)
df['B_tr'] = df['B'].transform(plus_10)
# The lambda equivalent
df['B_tr'] = df['B'].transform(lambda x: x+10)
```

### What are the differences?

Here are the three main differences:

• (1) `transform()` works with function, a string function, a list of functions, or a dictionary. However, `apply()` is only allowed with function.
• (2) `transform()` cannot produce aggregated results.
• (3) `apply()` works with multiple Series at a time. But, `transform()` is only allowed to work with a single Series at a time.

Let’s take a look at them with the help of some examples.

(1) `transform()` works with function, a string function, a list of functions, or a dictionary. However, `apply()` is only allowed a function.

For `transform()`, we can pass any valid Pandas string function to `func`

```df.transform('sqrt')
```

`func` can be a list of functions, for example, `sqrt` and `exp` from NumPy:

```df.transform([np.sqrt, np.exp])
```

`func` can be a dict of axis labels -> function. For example

```df.transform({
'A': np.sqrt,
'B': np.exp,
})
```

(2) `transform()` cannot produce aggregated results.

We can use `apply()` to produce aggregated results, for example, the sum:

```df.apply(lambda x:x.sum())
A     6
B    60
dtype: int64
```

However, we will get a ValueError when trying to do the same with `transform()` . We are getting this problem because the output of `transform()` has to be a DataFrame that has the same length as self.

```df.transform(lambda x:x.sum())
```

(3) `apply()` works with multiple Series at a time. But, `transform()` is only allowed to work with a single Series at a time.

To demonstrate this, let’s create a function to work on 2 Series at a time.

```def subtract_two(x):
return x['B'] - x['A']
```

`apply()` works perfect with `subtract_two` and `axis=1`

```df.apply(subtract_two, axis=1)
0     9
1    18
2    27
dtype: int64
```

However, we are getting a ValueError when trying to do the same with `transform()`. This is because `transform()` is only allowed to work with a single Series at a time.

`# Getting error when trying the same with transformdf.transform(subtract_two, axis=1)`

We will get the same result when using a lambda expression

```# It is working
df.apply(lambda x: x['B'] - x['A'], axis=1)
# Getting same error
df.transform(lambda x: x['B'] - x['A'], axis=1)
```

### In conjunction with `groupby()`

Both `apply()` and `transform()` can be used in conjunction with `groupby()`. And in fact, it is one of the most compelling usages of `transform()`.

Here are the 2 differences when using them in conjunction with `groupby()`

• (1) `transform()` returns a DataFrame that has the same length as the input
• (2)` apply()` works with multiple Series at a time. But, `transform()` is only allowed to work with a single Series at a time.

Let’s create a DataFrame and show the difference with some examples

```df = pd.DataFrame({
'key': ['a','b','c'] * 4,
'A': np.arange(12),
'B': [1,2,3] * 4,
})
```

In the example above, the data can be split into three groups by key.

(1) `transform()` returns a Series that has the same length as the input

To demonstrate this, let’s make a function to produce an aggregated result.

```# Aggregating the sum of the given Series
def group_sum(x):
return x.sum()
```

For `apply()` , it returns one value for each group and the output shape is `(num_of_groups, 1)` .

```gr_data_ap = df.groupby('key')['A'].apply(group_sum)
gr_data_ap

key
a     9
b    12
c    15
Name: A, dtype: int64
```

For `transform()`, it returns a Series that has the same length as the given DataFrame and the output shape is `(len(df), 1)`

```gr_data_tr = df.groupby('key')['A'].transform(group_sum)
gr_data_tr

0     9
1    12
2    15
3     9
4    12
5    15
6     9
7    12
8    15
Name: A, dtype: int64
```

(2) `apply()` works with multiple Series at a time. But `transform()` is only allowed to work with a single Series at a time.

This is the same difference as we mentioned in “Manipulating values”, and we just don’t need to specify the argument `axis` on a `groupby()` result.

To demonstrate this, let’s create a function to work on 2 Series at a time.

```def subtract_two(x):
return x['B'] - x['A']
```

`apply()` works with multiple Series at a time.

```df.groupby('key').apply(subtract_two)
key
a    0    1
3   -2
6   -5
b    1    1
4   -2
7   -5
c    2    1
5   -2
8   -5
dtype: int64
```

However, we are getting a KeyError when trying the same with `transform()`:

```df.groupby('key').transform(subtract_two)
```

## Summary

For manipulating values, both `apply()` and `transform()` can be used to manipulate an entire DataFrame or any specific column. But there are 3 differences:

1. `transform()` can take a function, a string function, a list of functions, and a dictionary. However, `apply()` is only allowed a function.
2. `transform()` cannot produce aggregated results.
3. `apply()` works with multiple Series at a time. But, `transform()` is only allowed to work with a single Series at a time.

For working in conjunction with `groupby()`

1. `transform()` returns a Series that has the same length as the input.
2. `apply()` works with multiple Series at a time. But, `transform()` is only allowed to work with a single Series at a time.

Resources:

https://towardsdatascience.com/when-to-use-pandas-transform-function-df8861aa0dcf

https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff

https://towardsdatascience.com/difference-between-apply-and-transform-in-pandas-242e5cf32705 ##### Amir Masoud Sefidian
Machine Learning Engineer