31 mins read
## 1. Ways to Create Multi-Level / Hierarchical Index

#### MultiIndex

#### Hierarchical Indexing

## 2. Creating Sample DataFrames with MultiIndex

#### DataFrame with MultiIndex for Rows (axis=0)

#### DataFrame with Hierarchical Index for Rows (axis=0)

#### DataFrame with Hierarchical Index for Rows (axis=0) and Columns (axis=1)

#### DataFrame with Hierarchical Index for Rows (axis=0) and Columns (axis=1)

## 3. Indexing DataFrame

#### Using slice() Function

#### Range Indexing

#### Using xs() for Indexing

## A Multiply Indexed Series

### The bad way

### The Better Way: Pandas MultiIndex

### MultiIndex as an extra dimension

## Methods of MultiIndex Creation

### Explicit MultiIndex constructors

### MultiIndex level names

### MultiIndex for columns

## Indexing and Slicing a MultiIndex

### Multiply indexed Series

### Multiply indexed DataFrames

## Rearranging Multi-Indices

### Sorted and unsorted indices

### Stacking and unstacking indices

### Index setting and resetting

## Data Aggregations on Multi-Indices

**Pandas** is the go-to library when for data analysis when working with tabular datasets. It is the best solution available for working with tabular datasets which fit in the main memory of the computer. Pandas provide two data structures for working with data (1. Series 2. DataFrame). Pandas let us give labels to each row by providing **index** values and name columns of data as well. These labels can then be used to filter pandas’ dataframe/series in a particular dimension. We can use **index** values to filter rows of a data frame as well as perform operations on them. If we don’t provide **index** values then pandas will internally assign integers from 0 till the end of the data. It creates **RangeIndex** objects by default for both row labels and column labels. Internally pandas use various kinds of **Index** objects based on the label’s data type to store labels for axes. We can access **Index** object for rows (axis=0) by calling **‘.index’** property of dataframe and for columns (axis=1) by calling **‘.columns’** property of dataframe. Pandas internally represent labels of both rows and columns using **Index** objects of various types based on the data type of labels.

The majority of uses case of using **Pandas** dataframe/series requires a single value per label. We generally have a single label for entry in a particular axis (single label for a particular row / single label for a column). But there can be situations where we want more than one value to be labels of row or column of data. This lets us represent high-dimensional data in our **2D** data structure dataframe or **1D** data structure series. This kind of indexing is generally referred to as **Multi-Level Indexing** where we have more than one label value per row/column. We have more than one level in indexing. The values inside a single label can be unrelated or they can be related to each other where the value at a lower level is a sub-value of a higher-level value (E.g – In a multi-level index, first-level values can be a year, second-level values can be months and third-level values can be the day. Here, lower-level values are sub-values of higher level ). When values in the index at lower levels are sub-values of higher-level values then the index is generally referred to as **Hierarchical Index**.

One-dimensional and two-dimensional data can be stored in Pandas `Series`

and `DataFrame`

objects, respectively. Often it is useful to go beyond this and store higher-dimensional data–that is, data indexed by more than one or two keys. While Pandas does provide `Panel`

and `Panel4D`

objects that natively handle three-dimensional and four-dimensional data, a far more common pattern in practice is to make use of *hierarchical indexing* (also known as *multi-indexing*) to incorporate multiple index *levels* within a single index. In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional `Series`

and two-dimensional `DataFrame`

objects.

In this post, we’ll explore the direct creation of `MultiIndex`

objects, considerations when indexing, slicing, and computing statistics across multiply indexed data, and useful routines for converting between simple and hierarchically indexed representations of your data.

Below we have imported pandas and printed the version of them that we have used in this tutorial. We have also imported numpy as we’ll be using it to create data for our pandas dataframe.

```
import pandas as pd
import numpy as np
```

In this section, we’ll explain how we can create **MultiIndex** object which is used by pandas to represent an index that has more than one value per label of data. We can use **MultiIndex** object to represent row labels as well as columns labels. Pandas provide 4 different methods which are available as factory methods from **MultiIndex** class to create **MultiIndex** objects.

**from_tuples()**– This method takes a list of tuples as input and creates a**MultiIndex**object from it. Each tuple represents one label that will uniquely identify one row/column.**from_arrays()**– This method takes as input a list of arrays and creates a**MultiIndex**object from it. Each array should have the same length as a total number of labels. The first values from each input array will create the first label, the second values will create the second label, and so on.**from_product()**– This method takes as input more than one iterator and performs cartesian product on input iterators to create labels.**from_frame()**– This method takes as input another pandas dataframe and uses its values to create**MultiIndex**object.

Below we have created two lists with a list of labels. We have then combined both lists using **zip()** python method. We’ll be creating **MultiIndex** object using this list of tuples.

```
index1 = ["A", "B", "C", "D", "E"]
index2 = [1,2,3,4,5]
combined_idx = list(zip(index1, index2))
combined_idx
[('A', 1), ('B', 2), ('C', 3), ('D', 4), ('E', 5)]
```

Below we have created **MultiIndex** object using **from_tuples()** method by providing a list of tuples that we created in the previous cell as input. We can also provide names to individual levels of **MultiIndex** object using **names** parameter.

```
combined_idx_obj = pd.MultiIndex.from_tuples(combined_idx, names=["Index1", "Index2"])
combined_idx_obj
MultiIndex([('A', 1),
('B', 2),
('C', 3),
('D', 4),
('E', 5)],
names=['Index1', 'Index2'])
```

Below, we have created **MultiIndex** object using **from_arrays()** method by providing two lists we had declared earlier as input. We can notice that the results are exactly the same as **from_tuples()** method.

```
combined_idx_obj = pd.MultiIndex.from_arrays([index1, index2], names=["Index1", "Index2"])
combined_idx_obj
MultiIndex([('A', 1),
('B', 2),
('C', 3),
('D', 4),
('E', 5)],
names=['Index1', 'Index2'])
```

Below, we have created a **MultiIndex** object using **from_product()** method by providing two lists we had defined earlier. We can notice that **from_product()** method multiplied all values of the first list with all values of the second list (cartesian product).

```
combined_idx_obj = pd.MultiIndex.from_product([index1, index2], names=["Index1", "Index2"])
combined_idx_obj
MultiIndex([('A', 1),
('A', 2),
('A', 3),
('A', 4),
('A', 5),
('B', 1),
('B', 2),
('B', 3),
('B', 4),
('B', 5),
('C', 1),
('C', 2),
('C', 3),
('C', 4),
('C', 5),
('D', 1),
('D', 2),
('D', 3),
('D', 4),
('D', 5),
('E', 1),
('E', 2),
('E', 3),
('E', 4),
('E', 5)],
names=['Index1', 'Index2'])
```

Below, we have created a dataframe using two lists that we had defined earlier. We’ll be using this dataframe to create **MultiIndex** object next.

```
df = pd.DataFrame(zip(index1,index2), columns=["Index1", "index2"])
df
```

Index1 | index2 | |
---|---|---|

0 | A | 1 |

1 | B | 2 |

2 | C | 3 |

3 | D | 4 |

4 | E | 5 |

Below, we have created a **MultiIndex** object using **from_frame()** method by providing the dataframe we created in the previous cell.

```
combined_idx_obj = pd.MultiIndex.from_frame(df)
combined_idx_obj
MultiIndex([('A', 1),
('B', 2),
('C', 3),
('D', 4),
('E', 5)],
names=['Index1', 'index2'])
```

In this section, we have created a **MultiIndex** object which represents a hierarchical indexing example. We have created **MultiIndex** object with 3 levels. We have given three lists as input to **from_product()** method. The first list is a single value list of the year, the second list is a 2-value list of months and the third list is integers in the range 1-31 representing days.

```
year = [2021,]
months = [1,2]
days = list(range(1,32))
combined_idx_obj = pd.MultiIndex.from_product([year,months, days], names=["Year", "Month","Day"])
combined_idx_obj[:5]
MultiIndex([(2021, 1, 1),
(2021, 1, 2),
(2021, 1, 3),
(2021, 1, 4),
(2021, 1, 5)],
names=['Year', 'Month', 'Day'])
```

We can retrieve unique values per level of **MultiIndex** object using **levels** property of **MultiIndex** object.

```
combined_idx_obj.levels
FrozenList([[2021], [1, 2], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]])
```

We can retrieve a list of values per level using **get_level_values()** method by providing integer or level name as input.

```
combined_idx_obj.get_level_values(0)
Int64Index([2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021,
2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021,
2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021,
2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021,
2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021,
2021, 2021, 2021, 2021, 2021, 2021, 2021],
dtype='int64', name='Year')
combined_idx_obj.get_level_values("Month")
Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
dtype='int64', name='Month')
```

In this section, we’ll create pandas **DataFrames** which will have either index or column labels or both represented with **MultiIndex** object. We’ll be using various dataframes created in this section, later on, to explain indexing dataframe with multi-level indexing.

```
np.random.seed(123)
```

Below we have created a**MultiIndex** object using **from_arrays()** method by providing two lists to it. Then in the next cell, we have created a pandas dataframe with random numbers whose **index** is set as **MultiIndex** object we created.

```
index1 = ["A", "B", "C", "D", "E"]
index2 = [1,2,3,4,5]
combined_idx_obj = pd.MultiIndex.from_arrays([index1, index2], names=["Index1", "Index2"])
combined_idx_obj
MultiIndex([('A', 1),
('B', 2),
('C', 3),
('D', 4),
('E', 5)],
names=['Index1', 'Index2'])
```

```
data= np.random.rand(len(combined_idx_obj), 5)
df1 = pd.DataFrame(data=data, columns=["A1", "A2", "A3", "A4", "A5"], index=combined_idx_obj)
df1
```

In this section, we have created a hierarchical index using **from_product()** method by providing three lists to it. We have then created a pandas dataframe whose **index** is set as **MultiIndex** object representing hierarchical index.

```
year = [2021,]
months = [1,2]
days = list(range(1,32))
combined_idx_obj = pd.MultiIndex.from_product([year,months, days], names=["Year", "Month","Day"])
combined_idx_obj[:5]
MultiIndex([(2021, 1, 1),
(2021, 1, 2),
(2021, 1, 3),
(2021, 1, 4),
(2021, 1, 5)],
names=['Year', 'Month', 'Day'])
```

```
data = np.random.rand(len(combined_idx_obj), 5)
df2 = pd.DataFrame(data=data, index=combined_idx_obj, columns=["C1", "C2", "C3", "C4", "C5"])
df2
```

In this section, we have created examples where both row and column headings will be represented using **MultiIndex** object.

First, we have created hierarchical **MultiIndex** just like our previous example to represent rows of dataframe. Then, we have created another hierarchical **MultiIndex** object using **from_product()** method which will be used to represent column labels.

```
year = [2021,]
months = [1,2]
days = list(range(1,32))
combined_idx_obj = pd.MultiIndex.from_product([year,months, days], names=["Year", "Month","Day"])
columns_idx = pd.MultiIndex.from_product([["Location1","Location2"],
["Item1", "Item2", "Item3", "Item4", "Item5"]],
names=["Location", "Item"]
)
columns_idx
MultiIndex([('Location1', 'Item1'),
('Location1', 'Item2'),
('Location1', 'Item3'),
('Location1', 'Item4'),
('Location1', 'Item5'),
('Location2', 'Item1'),
('Location2', 'Item2'),
('Location2', 'Item3'),
('Location2', 'Item4'),
('Location2', 'Item5')],
names=['Location', 'Item'])
```

Below, we have created a pandas dataframe of random numbers whose **index** and **columns** are represented using **MultiIndex** objects.

```
data = np.random.randint(1, 1000, size=(len(combined_idx_obj), len(columns_idx)))
df3 = pd.DataFrame(data=data, index=combined_idx_obj, columns=columns_idx)
df3
```

In this section, we have created another example where we are using **MultiIndex** object to represent labels of rows and columns. We have this time used a list of dates as a second-level index for rows.

```
locations = ["Location1", "Location2"]
rng = pd.date_range(start="2021-1-1", end="2021-2-28", freq="D")
rows_idx = pd.MultiIndex.from_product([locations,rng], names=["Location", "Date"])
columns_idx = pd.MultiIndex.from_product([["Category1","Category2"],
["Item1", "Item2", "Item3", "Item4", "Item5"]],
names=["Category", "Item"])
columns_idx
MultiIndex([('Category1', 'Item1'),
('Category1', 'Item2'),
('Category1', 'Item3'),
('Category1', 'Item4'),
('Category1', 'Item5'),
('Category2', 'Item1'),
('Category2', 'Item2'),
('Category2', 'Item3'),
('Category2', 'Item4'),
('Category2', 'Item5')],
names=['Category', 'Item'])
```

```
data = np.random.randint(1, 1000, size=(len(rows_idx), len(columns_idx)))
df4 = pd.DataFrame(data=data, index=rows_idx, columns=columns_idx)
df4
```

Below, we have retrieved single level from **MultiIndex** object using **get_level_values()** method and we can notice that it’s **DatetimeIndex** object. Internally, **MultiIndex** object is composed of more than one **Index** object.

```
df4.index.get_level_values(1)
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
'2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
'2021-01-09', '2021-01-10',
...
'2021-02-19', '2021-02-20', '2021-02-21', '2021-02-22',
'2021-02-23', '2021-02-24', '2021-02-25', '2021-02-26',
'2021-02-27', '2021-02-28'],
dtype='datetime64[ns]', name='Date', length=118, freq=None)
```

In this section, we’ll explain how we can perform indexing on a dataframe whose **index** or **column** labels are represented using **MultiIndex** objects. We’ll explain various ways of indexing multi-level index in this section. We’ll be using **‘.loc’** property of pandas dataframe to perform indexing as it accepts actual index values which can be of any type to perform the indexing. We won’t be covering indexing using **‘.iloc’** property which accepts integers for indexing. We’ll also explain indexing using **xs()** function.

We can provide labels of 1st levels of **MultiIndex** object directly to **‘.loc’** property and it’ll retrieve all rows/columns that satisfy that 1st level.

Below we have retrieved rows of pandas dataframe whose 1st level value for **MultiIndex** object is **‘A’**.

```
```

df1.loc["A"]

A1 | A2 | A3 | A4 | A5 | |
---|---|---|---|---|---|

Index2 | |||||

1 | 0.696469 | 0.286139 | 0.226851 | 0.551315 | 0.719469 |

Below, we have provided a list of 1st level values to **‘.loc’** property to retrieve rows where these three values are 1st level values of the index.

```
```

df1.loc[["A","B","D"]]

A1 | A2 | A3 | A4 | A5 | ||
---|---|---|---|---|---|---|

Index1 | Index2 | |||||

A | 1 | 0.696469 | 0.286139 | 0.226851 | 0.551315 | 0.719469 |

B | 2 | 0.423106 | 0.980764 | 0.684830 | 0.480932 | 0.392118 |

D | 4 | 0.737995 | 0.182492 | 0.175452 | 0.531551 | 0.531828 |

Below, we have provided indexing values for both rows and columns to **‘.loc’** property. The values provided for **index** is the same as our previous example.

```
```

df1.loc[["A","B","D"], ["A1","A3","A5"]]

A1 | A3 | A5 | ||
---|---|---|---|---|

Index1 | Index2 | |||

A | 1 | 0.696469 | 0.226851 | 0.719469 |

B | 2 | 0.423106 | 0.684830 | 0.392118 |

D | 4 | 0.737995 | 0.175452 | 0.531828 |

Below, we have created an example to explain how we can give a tuple of values to an index dataframe. Each tuple has two values representing values for both levels of **MultiIndex** object. This kind of indexing will retrieve rows/columns which exactly match the combinations of values provided.

```
```

df1.loc[[("A",1),("B",2),("D",4)]]

A1 | A2 | A3 | A4 | A5 | ||
---|---|---|---|---|---|---|

Index1 | Index2 | |||||

A | 1 | 0.696469 | 0.286139 | 0.226851 | 0.551315 | 0.719469 |

B | 2 | 0.423106 | 0.980764 | 0.684830 | 0.480932 | 0.392118 |

D | 4 | 0.737995 | 0.182492 | 0.175452 | 0.531551 | 0.531828 |

Below we have again used only the 1st level value of **MultiIndex** object to the index dataframe.

```
df2.loc[2021]
```

Below, we have provided a two-value tuple to index dataframe which has **3-level index**.

```
df2.loc[(2021,1),]
```

Below we have provided indexing values for all three levels to index the dataframe.

```
df2.loc[(2021,1, 1),]
C1 0.322959
C2 0.361789
C3 0.228263
C4 0.293714
C5 0.630976
Name: (2021, 1, 1), dtype: float64
```

In the below cell, we have explained an example where the second level of the index has date-time values. We can provide date-time values as either strings or as date-time objects.

```
df4.loc[("Location1","2021-1-1"),]
```

The **slice(start,end[,step])** function is normally used to create range of values. It has the same format as integer indexing on numpy arrays but **slice()** can also accept any data type as input and is used to represent a range of values for that data type based on values present in an index of dataframe/series.

We can call **slice()** function with just **None** value and it will represent all values in a particular level of **MultiIndex** object.

Below we have filtered our dataframe with a 2-level index. We have provided a tuple of two values where the first value is **slice(None)** and the second value is a list of three integers. This will inform the dataframe to take all entries where second-level values belong to one of the provided three integers. It’ll consider all first-level values which have specified second-level values.

```
df1.loc[(slice(None), [1,2,3]),]
```

A1 | A2 | A3 | A4 | A5 | ||
---|---|---|---|---|---|---|

Index1 | Index2 | |||||

A | 1 | 0.696469 | 0.286139 | 0.226851 | 0.551315 | 0.719469 |

B | 2 | 0.423106 | 0.980764 | 0.684830 | 0.480932 | 0.392118 |

C | 3 | 0.343178 | 0.729050 | 0.438572 | 0.059678 | 0.398044 |

In the below cell, we have filtered our data set at row level using the same logic as our previous example. This time, we have provided filters for columns as well.

```
row_filter = (slice(None), [1,2,3])
column_filter = ["A1","A3","A5"]
df1.loc[row_filter, column_filter]
```

A1 | A3 | A5 | ||
---|---|---|---|---|

Index1 | Index2 | |||

A | 1 | 0.696469 | 0.226851 | 0.719469 |

B | 2 | 0.423106 | 0.684830 | 0.392118 |

C | 3 | 0.343178 | 0.438572 | 0.398044 |

In the below cell, we have filtered rows of our dataframe which has a three-level index. For the first level, we have provided a single value. For the second level, we have provided **slice(None)** which will consider all values based on the values of other levels. For the third level, we have provided five values. This will consider all second-level values where the first-level value is 2021 and the third-level value is one of the five input integers.

```
df3.loc[(2021, slice(None), [1,2,3,4,5]),]
```

Below, we have created another example demonstrating the usage of **slice()** function.

```
df3.loc[(2021, [2,], slice(None)),]
```

Below, we have created an example, where we are indexing a dataframe whose both row and column labels are **MultiIndex** objects. For filtering rows, we have used the same tuple of three values that we had used in one of our earlier examples. For filtering columns, we have used a tuple of two values where the first value is a single string and the second value is a list of three strings.

```
df3.loc[(2021, slice(None), [1,2,3,4,5]), (["Location1"], ["Item1","Item2", "Item3"])]
```

In the below cell, we have tried to reproduce our example from the previous cell but this time instead of providing 5 integers for the last level of filtering rows, we have provided **slice(1, 5)** function. It’ll have the same effect as our previous example.

```
df3.loc[(2021,slice(None), slice(1,5)), (["Location1"], ["Item1","Item2", "Item3"])]
```

In the below cell, we have filtered rows of our dataframe where we had datetime index as a part of **MultiIndex** object.

```
df4.loc[(slice(None), ["2021-1-1", "2021-1-2", "2021-1-3"]),]
```

In this section, we’ll explain how we can specify a range of values using **‘:’** operator for filtering rows/columns of our dataframe.

Below we have filtered our first dataframe by providing value **‘”A”:”C”‘** for indexing. This informs the **‘.loc’** property that keep rows of dataframe where first level index values ranges from **‘A’** to **‘C’** which is three values **[‘A’,’B’,’C’]**.

```
df1.loc["A":"C",]
```

A1 | A2 | A3 | A4 | A5 | ||
---|---|---|---|---|---|---|

Index1 | Index2 | |||||

A | 1 | 0.696469 | 0.286139 | 0.226851 | 0.551315 | 0.719469 |

B | 2 | 0.423106 | 0.980764 | 0.684830 | 0.480932 | 0.392118 |

C | 3 | 0.343178 | 0.729050 | 0.438572 | 0.059678 | 0.398044 |

Below we have created another example where we have provided tuples to filter rows of dataframe. This will keep all entries in the dataframe starting from the first tuple index values till the last tuple index values.

```
df1.loc[("A",1): ("C",3),]
```

A1 | A2 | A3 | A4 | A5 | ||
---|---|---|---|---|---|---|

Index1 | Index2 | |||||

A | 1 | 0.696469 | 0.286139 | 0.226851 | 0.551315 | 0.719469 |

B | 2 | 0.423106 | 0.980764 | 0.684830 | 0.480932 | 0.392118 |

C | 3 | 0.343178 | 0.729050 | 0.438572 | 0.059678 | 0.398044 |

In the below cell, we have created another example demonstrating the usage of range filtering where we have filtered rows of dataframe as well as columns of dataframe using range operator **(‘:’)**.

```
df2.loc[(2021,1,1):(2021,1,15), "C3":"C5"]
```

In the below cell, we have created another example demonstrating the usage of the range operator for filtering rows and columns of the dataframe. This time we have used tuples for columns as well.

```
df3.loc[(2021,1,1):(2021,1,15), ("Location1", "Item4"): ("Location2","Item1")]
```

In the below cell, we have used range indexing on our dataframe where values of one level are of datetime data type.

```
df4.loc["Location1", "2021-1-1":"2021-1-5", :]
```

In the below cell, we have created one more example demonstrating the usage of range operator to filter rows of dataframe where one level values are of datetime data type. We have provided datetime values as strings.

```
df4.loc["Location1":"Location2", "2021-1-1":"2021-1-5", :]
```

In this section, we’ll explain how we can use **xs()** function available from the pandas dataframe to filter rows/columns of the dataframe.

**xs(key,axis=0,level=None)**– This method takes as input labels of rows/columns along with axis value (0-rows,1-columns) and level names of**MultiIndex**object and returns dataframe with specified labels present in it.- The
**key**parameter takes as input single label or tuple of labels from our multi-level indexes. The rows/columns which have these label values will be kept and all others will be dropped. - The axis method accepts either 0 (rows) or 1 (columns) as value. The value of 0 will consider label values provided to key parameter across the index of dataframe and value of ‘1’ will consider values provided to key parameter across columns of the dataframe.
- The level accepts a single value or tuple of values representing level names for a multi-level index.

- The

The benefit of using this method is that we can provide information about individual levels and all other level labels will be handled internally without we need to explicitly handle them.

Below we have retrieved rows from our dataframe where values for level **‘Day’** is 1. There are two rows in the dataframe where the value of level **‘Day’** is 1.

```
df2.xs(1, level="Day")
```

We can also provide **slice()** function call to **xs()** method.

Below we have retrieved rows from our dataframe where values for level **‘Day’** are in the range **[1-3]**. There are six values that satisfy this condition because we have **‘Day’** values **[1,2,3]** present for month values **1 and 2**.

```
df2.xs(slice(1,3), level="Day")
```

In the below cell, we have explained how we can use **xs()** method with columns of our dataframe.

We have filtered columns of our dataframe where level named **‘Location’** has values **‘Location1’**. There are 5 columns that satisfy this condition.

```
df3.xs("Location1", level="Location", axis=1)
```

In the below cell, we have filtered columns of our dataframe using **xs()** function where label values are provided by calling **slice()** function. We have filtered columns of the dataframe where labels of level **‘Item’** are one of **[‘Item1′,’Item2’,’Item3]**.

```
df3.xs(slice("Item1", "Item3"), level="Item", axis=1)
```

In the below cell, we have created another example demonstrating usage of **xs()** function. This time we have explained how we can give labels for more than one level. We have provided values for three levels of our dataframe and only one row satisfies those labels.

```
df3.xs((2021,1,1), level=("Year", "Month","Day"))
```

In the below cell, we have created another example demonstrating how we can provide labels for multiple levels to **xs()** method.

```
df4.xs(("Location1","2021-1-1"), level=("Location", "Date"))
```

In the below cell, we have again created an example demonstrating how to give labels for multiple levels to **xs()** method but this time we have used **slice()** call as one of the level’s labels.

```
df3.xs((2021,slice(None),1), level=("Year", "Month","Day"))
```

In the next cell, we have created another example demonstrating usage of **xs()** method. This time we have used two **slice()** function call. One of the calls refers to all labels at that level and another call refers to the range of values.

```
df3.xs((2021,slice(None),slice(1,5)), level=("Year", "Month","Day"))
```

In our last example, we have explained how we can use **xs()** function when one of the level values is of datetime data type.

```
df4.xs(("Location1",slice("2021-1-1", "2021-1-5")), level=("Location", "Date"))
```

Let’s start by considering how we might represent two-dimensional data within a one-dimensional `Series`

. For concreteness, we will consider a series of data where each point has a character and numerical key.

Suppose you would like to track data about states from two different years. Using the Pandas tools we’ve already covered, you might be tempted to simply use Python tuples as keys:

```
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations, index=index)
pop
(California, 2000) 33871648
(California, 2010) 37253956
(New York, 2000) 18976457
(New York, 2010) 19378102
(Texas, 2000) 20851820
(Texas, 2010) 25145561
dtype: int64
```

With this indexing scheme, you can straightforwardly index or slice the series based on this multiple index:

```
pop[('California', 2010):('Texas', 2000)]
(California, 2010) 37253956
(New York, 2000) 18976457
(New York, 2010) 19378102
(Texas, 2000) 20851820
dtype: int64
```

But the convenience ends there. For example, if you need to select all values from 2010, you’ll need to do some messy (and potentially slow) munging to make it happen:

```
pop[[i for i in pop.index if i[1] == 2010]]
(California, 2010) 37253956
(New York, 2010) 19378102
(Texas, 2010) 25145561
dtype: int64
```

This produces the desired result but is not as clean (or as efficient for large datasets) as the slicing syntax we’ve grown to love in Pandas.

Fortunately, Pandas provides a better way. Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas `MultiIndex`

type gives us the type of operations we wish to have. We can create a multi-index from the tuples as follows:

```
index = pd.MultiIndex.from_tuples(index)
index
MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
```

Notice that the `MultiIndex`

contains multiple *levels* of indexing–in this case, the state names and the years, as well as multiple *labels* for each data point that encode these levels.

If we re-index our series with this `MultiIndex`

, we see the hierarchical representation of the data:

```
pop = pop.reindex(index)
pop
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
```

Here are the first two columns of the `Series`

representation show the multiple index values, while the third column shows the data. Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.

Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:

```
pop[:, 2010]
California 37253956
New York 19378102
Texas 25145561
dtype: int64
```

The result is a singly indexed array with just the keys we’re interested in. This syntax is much more convenient (and the operation is much more efficient!) than the home-spun tuple-based multi-indexing solution that we started with. We’ll now further discuss this sort of indexing operation on hierarchically indexed data.

You might notice something else here: we could easily have stored the same data using a simple `DataFrame`

with index and column labels. In fact, Pandas is built with this equivalence in mind. The `unstack()`

method will quickly convert a multiply indexed `Series`

into a conventionally indexed `DataFrame`

:

```
pop_df = pop.unstack()
pop_df
```

2000 | 2010 | |
---|---|---|

California | 33871648 | 37253956 |

New York | 18976457 | 19378102 |

Texas | 20851820 | 25145561 |

Naturally, the `stack()`

method provides the opposite operation:

```
pop_df.stack()
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
```

Seeing this, you might wonder why would we would bother with hierarchical indexing at all. The reason is simple: just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional `Series`

, we can also use it to represent data of three or more dimensions in a `Series`

or `DataFrame`

. Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent. Concretely, we might want to add another column of demographic data for each state at each year (say, population under 18) ; with a `MultiIndex`

this is as easy as adding another column to the `DataFrame`

:

```
pop_df = pd.DataFrame({'total': pop,
'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df
```

In addition, all the ufuncs and other functionality discussed in Operating on Data in Pandas work with hierarchical indices as well. Here we compute the fraction of people under 18 by year, given the above data:

```
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()
```

2000 | 2010 | |
---|---|---|

California | 0.273594 | 0.249211 |

New York | 0.247010 | 0.222831 |

Texas | 0.283251 | 0.273568 |

This allows us to easily and quickly manipulate and explore even high-dimensional data.

The most straightforward way to construct a multiply indexed `Series`

or `DataFrame`

is to simply pass a list of two or more index arrays to the constructor. For example:

```
df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df
```

The work of creating the `MultiIndex`

is done in the background.

Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a `MultiIndex`

by default:

```
data = {('California', 2000): 33871648,
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,
('New York', 2010): 19378102}
pd.Series(data)
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
```

Nevertheless, it is sometimes useful to explicitly create a `MultiIndex`

; we’ll see a couple of these methods here.

For more flexibility in how the index is constructed, you can instead use the class method constructors available in the `pd.MultiIndex`

. For example, as we did before, you can construct the `MultiIndex`

from a simple list of arrays giving the index values within each level:

```
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
```

You can construct it from a list of tuples giving the multiple index values of each point:

```
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])
MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
```

You can even construct it from a Cartesian product of single indices:

```
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])
MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
```

Similarly, you can construct the `MultiIndex`

directly using its internal encoding by passing `levels`

(a list of lists containing available index values for each level) and `labels`

(a list of lists that reference these labels):

```
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
```

Any of these objects can be passed as the `index`

argument when creating a `Series`

or `Dataframe`

, or be passed to the `reindex`

method of an existing `Series`

or `DataFrame`

.

Sometimes it is convenient to name the levels of the `MultiIndex`

. This can be accomplished by passing the `names`

argument to any of the above `MultiIndex`

constructors, or by setting the `names`

attribute of the index after the fact:

```
pop.index.names = ['state', 'year']
pop
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
```

With more involved datasets, this can be a useful way to keep track of the meaning of various index values.

In a `DataFrame`

, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well. Consider the following, which is a mock-up of some (somewhat realistic) medical data:

```
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
```

Here we see where the multi-indexing for both rows and columns can come in *very* handy. This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number. With this in place, we can, for example, index the top-level column by the person’s name and get a full `DataFrame`

containing just that person’s information:

```
health_data['Guido']
```

For complicated records containing multiple labeled measurements across multiple times for many subjects (people, countries, cities, etc.) use of hierarchical rows and columns can be extremely convenient!

Indexing and slicing on a `MultiIndex`

is designed to be intuitive, and it helps if you think about the indices as added dimensions. We’ll first look at indexing multiply indexed `Series`

, and then multiply-indexed `DataFrame`

s.

Consider the multiply indexed `Series`

of state populations we saw earlier:

```
pop
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
```

We can access single elements by indexing with multiple terms:

```
pop['California', 2000]
33871648
```

The `MultiIndex`

also supports *partial indexing*, or indexing just one of the levels in the index. The result is another `Series`

, with the lower-level indices maintained:

```
pop['California']
year
2000 33871648
2010 37253956
dtype: int64
```

Partial slicing is available as well, as long as the `MultiIndex`

is sorted:

```
pop.loc['California':'New York']
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
dtype: int64
```

With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:

```
pop[:, 2000]
state
California 33871648
New York 18976457
Texas 20851820
dtype: int64
```

Other types of indexing and selection (work as well; for example, selection based on Boolean masks:

```
pop[pop > 22000000]
state year
California 2000 33871648
2010 37253956
Texas 2010 25145561
dtype: int64
```

Selection based on fancy indexing also works:

```
pop[['California', 'Texas']]
state year
California 2000 33871648
2010 37253956
Texas 2000 20851820
2010 25145561
dtype: int64
```

A multiply indexed `DataFrame`

behaves in a similar manner. Consider our toy medical `DataFrame`

from before:

```
health_data
```

Remember that columns are primary in a `DataFrame`

, and the syntax used for multiply indexed `Series`

applies to the columns. For example, we can recover Guido’s heart rate data with a simple operation:

```
health_data['Guido', 'HR']
year visit
2013 1 32.0
2 50.0
2014 1 39.0
2 48.0
Name: (Guido, HR), dtype: float64
```

Also, as with the single-index case, we can use the `loc`

, `iloc`

, and `ix`

indexers. For example:

```
health_data.iloc[:2, :2]
```

These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in `loc`

or `iloc`

can be passed a tuple of multiple indices. For example:

```
health_data.loc[:, ('Bob', 'HR')]
year visit
2013 1 31.0
2 44.0
2014 1 30.0
2 47.0
Name: (Bob, HR), dtype: float64
```

Working with slices within these index tuples is not especially convenient; trying to create a slice within a tuple will lead to a syntax error:

```
health_data.loc[(:, 1), (:, 'HR')]
File "<ipython-input-32-8e3cc151e316>", line 1
health_data.loc[(:, 1), (:, 'HR')]
^
SyntaxError: invalid syntax
```

You could get around this by building the desired slice explicitly using Python’s built-in `slice()`

function, but a better way in this context is to use an `IndexSlice`

object, which Pandas provides for precisely this situation. For example:

```
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]
```

There are so many ways to interact with data in multiply indexed `Series`

and `DataFrame`

s, and as with many tools in this book the best way to become familiar with them is to try them out!

One of the keys to working with multiply indexed data is knowing how to effectively transform the data. There are a number of operations that will preserve all the information in the dataset, but rearrange it for the purposes of various computations. We saw a brief example of this in the `stack()`

and `unstack()`

methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns, and we’ll explore them here.

Earlier, we briefly mentioned a caveat, but we should emphasize it more here. *Many of the MultiIndex slicing operations will fail if the index is not sorted.* Let’s take a look at this here.

We’ll start by creating some simple multiply indexed data where the indices are *not lexicographically sorted*:

```
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data
char int
a 1 0.003001
2 0.164974
c 1 0.741650
2 0.569264
b 1 0.001693
2 0.526226
dtype: float64
```

If we try to take a partial slice of this index, it will result in an error:

```
try:
data['a':'b']
except KeyError as e:
print(type(e))
print(e)
<class 'KeyError'>
'Key length (1) was greater than MultiIndex lexsort depth (0)'
```

Although it is not entirely clear from the error message, this is the result of the MultiIndex not being sorted. For various reasons, partial slices and other similar operations require the levels in the `MultiIndex`

to be in sorted (i.e., lexicographical) order. Pandas provides a number of convenience routines to perform this type of sorting; examples are the `sort_index()`

and `sortlevel()`

methods of the `DataFrame`

. We’ll use the simplest, `sort_index()`

, here:

```
data = data.sort_index()
data
char int
a 1 0.003001
2 0.164974
b 1 0.001693
2 0.526226
c 1 0.741650
2 0.569264
dtype: float64
```

With the index sorted in this way, partial slicing will work as expected:

```
data['a':'b']
char int
a 1 0.003001
2 0.164974
b 1 0.001693
2 0.526226
dtype: float64
```

As we saw briefly before, it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use:

```
pop.unstack(level=0)
```

state | California | New York | Texas |
---|---|---|---|

year | |||

2000 | 33871648 | 18976457 | 20851820 |

2010 | 37253956 | 19378102 | 25145561 |

```
pop.unstack(level=1)
```

year | 2000 | 2010 |
---|---|---|

state | ||

California | 33871648 | 37253956 |

New York | 18976457 | 19378102 |

Texas | 20851820 | 25145561 |

The opposite of `unstack()`

is `stack()`

, which here can be used to recover the original series:

```
pop.unstack().stack()
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
```

Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the `reset_index`

method. Calling this on the population dictionary will result in a `DataFrame`

with a *state* and *year* column holding the information that was formerly in the index. For clarity, we can optionally specify the name of the data for the column representation:

```
pop_flat = pop.reset_index(name='population')
pop_flat
```

state | year | population | |
---|---|---|---|

0 | California | 2000 | 33871648 |

1 | California | 2010 | 37253956 |

2 | New York | 2000 | 18976457 |

3 | New York | 2010 | 19378102 |

4 | Texas | 2000 | 20851820 |

5 | Texas | 2010 | 25145561 |

Often when working with data in the real world, the raw input data looks like this and it’s useful to build a `MultiIndex`

from the column values. This can be done with the `set_index`

method of the `DataFrame`

, which returns a multiply indexed `DataFrame`

:

```
pop_flat.set_index(['state', 'year'])
```

In practice, I find this type of reindexing to be one of the more useful patterns when encountering real-world datasets.

We’ve previously seen that Pandas has built-in data aggregation methods, such as `mean()`

, `sum()`

, and `max()`

. For hierarchically indexed data, these can be passed a `level`

parameter that controls which subset of the data the aggregate is computed on.

For example, let’s return to our health data:

```
health_data
```

Perhaps we’d like to average out the measurements in the two visits each year. We can do this by naming the index level we’d like to explore, in this case, the year:

```
data_mean = health_data.mean(level='year')
data_mean
```

By further making use of the `axis`

keyword, we can take the mean among levels on the columns as well:

```
data_mean.mean(axis=1, level='type')
```

type | HR | Temp |
---|---|---|

year | ||

2013 | 36.833333 | 37.000000 |

2014 | 46.000000 | 37.283333 |

Thus in two lines, we’ve been able to find the average heart rate and temperature measured among all subjects in all visits each year. This syntax is actually a shortcut to the GroupBy functionality. While this is a toy example, many real-world datasets have a similar hierarchical structure.

Resources:

https://coderzcolumn.com/tutorials/python/guide-to-pandas-multi-level-hierarchical-index

https://jakevdp.github.io/PythonDataScienceHandbook/03.05-hierarchical-indexing.html