To group by multiple columns and then find the variance of rows in a pandas DataFrame, you can use the groupby() and var() functions.
import pandas as pd
df = pd.DataFrame({"animal_type":["dog","cat","dog","cat","dog","dog","cat","cat","dog"],
"gender":["F","F","F","F","M","M","M","F","M"],
"age":[1,2,3,4,5,6,7,8,9],
"weight":[10,20,15,20,25,10,15,30,40]})
print(df)
print(df.groupby(["animal_type","gender"])["age"].var().rename('age_variance').reset_index())
#Output:
animal_type gender age weight
0 dog F 1 10
1 cat F 2 20
2 dog F 3 15
3 cat F 4 20
4 dog M 5 25
5 dog M 6 10
6 cat M 7 15
7 cat F 8 30
8 dog M 9 40
animal_type gender age_variance
0 cat F 9.333333
1 cat M NaN
2 dog F 2.000000
3 dog M 4.333333
When working with data, it is very useful to be able to group and aggregate data by multiple columns to understand the various segments of our data.
One such case is if you want to group your data and get the variance of a variable for each group.
To get the variance of a variable by groups of columns in a pandas DataFrame, you can use the groupby() and var() functions.
Below is a simple example showing you how you can group by and then get the variance of a variable of each group in a pandas DataFrame in Python.
In the example below, I’ve renamed the variance of rows to ‘age_variance’ and then reset the index so that we can work with the resulting DataFrame easier.
import pandas as pd
df = pd.DataFrame({"animal_type":["dog","cat","dog","cat","dog","dog","cat","cat","dog"],
"gender":["F","F","F","F","M","M","M","F","M"],
"age":[1,2,3,4,5,6,7,8,9],
"weight":[10,20,15,20,25,10,15,30,40]})
print(df)
print(df.groupby(["animal_type","gender"])["age"].var().rename('age_variance').reset_index())
#Output:
animal_type gender age weight
0 dog F 1 10
1 cat F 2 20
2 dog F 3 15
3 cat F 4 20
4 dog M 5 25
5 dog M 6 10
6 cat M 7 15
7 cat F 8 30
8 dog M 9 40
animal_type gender age_variance
0 cat F 9.333333
1 cat M NaN
2 dog F 2.000000
3 dog M 4.333333
Using groupby() and var() on Single Column in pandas DataFrame
You can use groupby() to group a pandas DataFrame by one column or multiple columns.
If you want to group a pandas DataFrame by one column and then get the variance of a single variable in each group with var(), you can do the following.
import pandas as pd
df = pd.DataFrame({"animal_type":["dog","cat","dog","cat","dog","dog","cat","cat","dog"],
"gender":["F","F","F","F","M","M","M","F","M"],
"age":[1,2,3,4,5,6,7,8,9],
"weight":[10,20,15,20,25,10,15,30,40]})
print(df)
print(df.groupby(["animal_type"])["age"].var().rename('age_variance').reset_index())
#Output:
animal_type gender
0 dog F
1 cat F
2 dog F
3 cat F
4 dog M
5 dog M
6 cat M
7 cat F
8 dog M
animal_type age_variance
0 cat 7.583333
1 dog 9.200000
If you want to group by a single column and find the variance of multiple variables, you can do the following. In this case, the column names will be the names of the original columns.
import pandas as pd
df = pd.DataFrame({"animal_type":["dog","cat","dog","cat","dog","dog","cat","cat","dog"],
"gender":["F","F","F","F","M","M","M","F","M"],
"age":[1,2,3,4,5,6,7,8,9],
"weight":[10,20,15,20,25,10,15,30,40]})
print(df)
print(df.groupby(["gender"])["age","weight"].var().reset_index())
#Output:
animal_type gender age weight
0 dog F 1 10
1 cat F 2 20
2 dog F 3 15
3 cat F 4 20
4 dog M 5 25
5 dog M 6 10
6 cat M 7 15
7 cat F 8 30
8 dog M 9 40
gender age weight
0 F 7.300000 55.0
1 M 2.916667 175.0
Using groupby() to Group By Multiple Columns and var() in pandas DataFrame
If you want to group a pandas DataFrame by multiple columns and then get the variance of a variable for each group with var(), you can do the following.
import pandas as pd
df = pd.DataFrame({"animal_type":["dog","cat","dog","cat","dog","dog","cat","cat","dog"], "gender":["F","F","F","F","M","M","M","F","M"], "age":[1,2,3,4,5,6,7,8,9], "weight":[10,20,15,20,25,10,15,30,40]})
print(df)
print(df.groupby(["animal_type","gender"])["age"].var().rename('age_variance').reset_index())
#Output:
animal_type gender age weight
0 dog F 1 10
1 cat F 2 20
2 dog F 3 15
3 cat F 4 20
4 dog M 5 25
5 dog M 6 10
6 cat M 7 15
7 cat F 8 30
8 dog M 9 40
animal_type gender age_variance
0 cat F 9.333333
1 cat M NaN
2 dog F 2.000000
3 dog M 4.333333
If you want to group by multiple columns and find the variance of multiple variables, you can do the following. In this case, the column names will be the names of the original columns.
import pandas as pd
df = pd.DataFrame({"animal_type":["dog","cat","dog","cat","dog","dog","cat","cat","dog"], "gender":["F","F","F","F","M","M","M","F","M"], "age":[1,2,3,4,5,6,7,8,9], "weight":[10,20,15,20,25,10,15,30,40]})
print(df)
print(df.groupby(["animal_type","gender"])["age","weight"].var().reset_index())
#Output:
animal_type gender age weight
0 dog F 1 10
1 cat F 2 20
2 dog F 3 15
3 cat F 4 20
4 dog M 5 25
5 dog M 6 10
6 cat M 7 15
7 cat F 8 30
8 dog M 9 40
animal_type gender age weight
0 cat F 9.333333 33.333333
1 cat M NaN NaN
2 dog F 2.000000 12.500000
3 dog M 4.333333 225.000000
Hopefully this article has been useful for you to learn how to group by and variance in pandas with groupby() and var().