Grouping and Aggregation¶

Sorting, Grouping, and Aggregating in Pandas¶

In the previous lesson, we learned how to create Pandas DataFrames, how to access information stored within a DataFrame, and how to perform filtering operations on the rows of a DataFrame. In this lesson we will learn how to sort rows of a DataFrame according to specific columns, as well as how to performing grouping and aggregation operations.

We begin by importing a few packages.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-fa18575525b6> in <module>
      1 import numpy as np
----> 2 import pandas as pd
      3 import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'pandas'

Gapminder Dataset¶

To illustrate the concepts covered in this lesson, we will make use of the Gapminder Dataset. This dataset contains socioeconomic information for 184 countries for each year since 1800. We have 219 years worth of information for each of the 184 countries, and so the dataset contains 40,296 rows. The dataset contains the following seven columns:

country - This column contains the names of the countries.
year - Entries in this column provide the the year during which that particular record was collected.
continent - The provides the name of one of four continental regions to which the country has been assigned. The regions are africa, americas, asia, and europe.
population - Each entry in this column provides the population of the associated country in the relevant year.
life_exp - This column provides the average life expectancy in a given country in a given year.
gdp_per_cap - This column provides the per capita GDP of a given country in a given year, measured in US Dollars. GDP is a measure of the economic output of a country. The per capita GDP is the total GDP of the country divided by the population.
gini - Each entry in this column provides the Gini index for a particular country in a particular year. The Gini index is a measure of economic inequality. The scale ranges from 0 to 100, which larger values indicating a higher level of inequality.

In the cell below, we load the dataset into a DataFrame, and view 10 randomly selected rows.

gm = pd.read_csv('data/gapminder_data.txt', sep='\t')
gm.sample(n=10, random_state=1)

We can use the describe() method to get obtain statistical summaries of each of the numerical columns.

gm.describe()

Add new columns¶

We will start by adding two new columns, total_gdp, and total_life_exp. Entries in the column total_gdp will record the total GDP of a specific country in a given year, and will be calculated by multiplying population and gdp_per_cap. Similarly, entries in total_life_exp will be calculated by multiplying population by life_exp. This column can be viewed as an estimate of the total number of years that people living in the country at that time would be collectively expected to live.

gm.loc[:,'total_gdp'] = gm.loc[:,'population'] * gm.loc[:,'gdp_per_cap']
gm.loc[:,'total_life_exp'] = gm.loc[:,'population'] * gm.loc[:,'life_exp']
gm.head()

Filtering 2018 Data¶

For now, we will focus only on 2018 data. We will apply a filtering operation to extra only the 184 rows associated with 2018.

gm_2018 = gm.loc[gm.year == 2018, :]
print(len(gm_2018))

Sorting¶

We can use the DataFrame method sort_values to sort the records in a DataFrame according to a particular column. This method has one required parameter named by, which is used to specify the column by which the sorting is to be performed.

We will sort by Gini index to find the 10 countries with the least amount of income inequality in 2018.

gm_2018.sort_values(by='gini').head(10)

We can set the optional parameter ascending to be equal to False to sort a DataFrame in descending order according to a column. By default, this parameter is set to True. In the cell below, we will sort by Gini index in descending order to identify the 10 countries with the greatest levels of income inequality in 2018.

gm_2018.sort_values(by='gini', ascending=False).head(10)

We will now use a sorting operation to find the 10 countries with the lowest life expectancy in 2018.

gm_2018.sort_values(by='life_exp').head(10)

In the cell below, we use sorting to find the 10 countries with the highest life expectancy in 2018.

gm_2018.sort_values(by='life_exp', ascending=False).head(10)

Relationship Between Life Expectancy and Income Inequality¶

In the cell below, we create a scatter plot to explore the relationship between life expectancy and income inequality, as measured by the Gini index.

continent_list = ['africa', 'americas', 'asia', 'europe']
colors = ['steelblue', 'orange', 'limegreen', 'crimson']
plt.figure(figsize = [8,6])
for i, c in enumerate(continent_list):
    plt.scatter(gm_2018.loc[gm_2018.continent == c,'gini'], 
                gm_2018.loc[gm_2018.continent == c,'life_exp'],
                s=50, edgecolor='k', alpha=0.8, label=c.title(),
                color=colors[i])

plt.xlabel('Gini Index')
plt.ylabel('Life Expectancy')
plt.title('Life Expectancy and Income Inequality (2018)')   
plt.legend()
plt.show()
    

It does appear as though there is some relationship between life expectancy and Gini index. Countries with lower levels of income inequality tend to also have a higher life expectancy.

In the cell below, we separate the plot above into four subplots. Each subplot will provide information for countries within a single continental region.

plt.figure(figsize = [10,8])
for i, c in enumerate(continent_list):  
    plt.subplot(2,2,i+1)
    plt.scatter(gm_2018.loc[gm_2018.continent == c,'gini'], 
                gm_2018.loc[gm_2018.continent == c,'life_exp'],
                s=50, edgecolor='k', alpha=0.8, label=c.title(),
                color=colors[i])
    plt.xlim([20,70])
    plt.ylim([45,90])
    plt.xlabel('Gini Index')
    plt.ylabel('Life Expectancy')
    plt.legend()
plt.suptitle('Life Expectancy and Income Inequality by Continent (2018)', fontsize=16, y=0.94)
plt.show()

Global Per Capita GDP in 2018¶

We will now calculate the global per capita GDP in 2018. We wil do this by first using the sum() method to calculate the global population and global GDP in 2018. We will then divide the GDP by the population to obtain the per capita GDP.

total_pop_2018 = gm_2018.loc[:, 'population'].sum()
print(total_pop_2018)

total_gdp_2018 = gm_2018.loc[:, 'total_gdp'].sum()
print(total_gdp_2018)

pcgdp_2018 = total_gdp_2018 / total_pop_2018
print(pcgdp_2018)

As we see, the global per capita GDP in 2018 is around $15,566 per person.

Grouping: Per Capita GDP and Life Expectancy By Continent¶

We can combine the steps above with a filter opperation to calculate the per capita GDP of Africa in 2018.

africa_gdp_2018 = gm_2018.loc[gm.loc[:, 'continent'] == 'africa',
                              'total_gdp'].sum()
africa_pop_2018 = gm_2018.loc[gm.loc[:, 'continent'] == 'africa',
                              'population'].sum()
africa_pcgdp_2018 = africa_gdp_2018 / africa_pop_2018
print(africa_pcgdp_2018)

We could repeat this process for each of the four continental regions. However, Pandas DataFrames come equipped with a groupby() method that can be used to calculate aggregate results for groups determined by the values in a particular column. Each call to groupby() must specific one or more columns by which to group, and must be immediately followed by an aggregation that is to be performed on the individual groups.

In the cells that follow, we will now use grouping operations to calcuate the per capita GDP and life expectancy for each continental region in 2018. We will start by finding the total population, total GDP, and population weighted life expectancy of each continent.

grouped = gm_2018.loc[:,['continent','population',
                         'total_gdp', 'total_life_exp']
                     ].groupby('continent').sum()
grouped

We will now caluclate per capita GDP and life expectancy of each continent, adding these columns to the grouped DataFrame.

grouped.loc[:,'pc_gdp'] = grouped.loc[:,'total_gdp'] / grouped.loc[:,'population']
grouped.loc[:,'life_exp'] = grouped.loc[:,'total_life_exp'] / grouped.loc[:,'population']
grouped.loc[:,['pc_gdp', 'life_exp']]

We could perform the calculate above by starting with the original DataFrame, and chaining together operations without storing any intermediate DataFrames to variables.

gm.loc[gm.year == 2018, 
       ['continent', 'population', 'total_gdp', 'total_life_exp']].\
    groupby('continent').sum().\
    assign(pcgdp = lambda df : df.total_gdp / df.population).\
    assign(life_exp = lambda df : df.total_life_exp / df.population).\
    loc[:,['pcgdp','life_exp']]

Grouping by Multiple Columns¶

We can group DataFrames by more than one column. In the cell below, we group by both continent and year to create a DataFrame that reports the total population and total gdp for each continental region in each year. We display 10 randomly selected rows from this DataFrame.

gm.loc[:, ['continent', 'year', 'population', 'total_gdp']].\
    groupby(['continent', 'year']).sum().sample(n=10, random_state=1)

We get build off the the code in the previous cell to calculate the per capita gdp and life expectancy of each contintal region during each year. This code uses the assign() DataFrame method to add the columns pcgdp and life_exp to the DataFrame. You can read more about that method here: pandas.DataFrame.assign()

gm_yc =\
    gm.loc[:, ['continent', 'year', 'population', 'total_gdp', 'total_life_exp']].\
    groupby(['continent', 'year']).sum().\
    assign(pcgdp = lambda df : df.total_gdp / df.population).\
    assign(life_exp = lambda df : df.total_life_exp / df.population).\
    filter(['pcgdp', 'life_exp'])

gm_yc.sample(n=10, random_state=1)

We will now create plots to show how the per capita gdp and life expectancy have changed over time in each continental region.

years = range(1800, 2019)

plt.figure(figsize=[12,4])
plt.subplot(1,2,1)
plt.plot(years, gm_yc.loc['africa','pcgdp'], label='Africa')
plt.plot(years, gm_yc.loc['americas','pcgdp'], label='Americas')
plt.plot(years, gm_yc.loc['asia','pcgdp'], label='Asia')
plt.plot(years, gm_yc.loc['europe','pcgdp'], label='Europe')
plt.xlabel("Years")
plt.ylabel("Per Capita GDP")
plt.title("Per Capita GDP by Continent and Year")
plt.legend()
plt.subplot(1,2,2)
plt.plot(years, gm_yc.loc['africa','life_exp'], label='Africa')
plt.plot(years, gm_yc.loc['americas','life_exp'], label='Americas')
plt.plot(years, gm_yc.loc['asia','life_exp'], label='Asia')
plt.plot(years, gm_yc.loc['europe','life_exp'], label='Europe')
plt.xlabel("Years")
plt.ylabel("Life Expectancy")
plt.title("Life Expectancy by Continent and Year")
plt.legend()
plt.show()

Python for Data Science