In the previous lesson, we learned how to create Pandas DataFrames, how to access information stored within a DataFrame, and how to perform filtering operations on the rows of a DataFrame. In this lesson we will learn how to sort rows of a DataFrame according to specific columns, as well as how to performing grouping and aggregation operations.
We begin by importing a few packages.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
To illustrate the concepts covered in this lesson, we will make use of the Gapminder Dataset. This dataset contains socioeconomic information for 184 countries for each year since 1800. We have 219 years worth of information for each of the 184 countries, and so the dataset contains 40,296 rows. The dataset contains the following seven columns:
africa
, americas
, asia
, and europe
. In the cell below, we load the dataset into a DataFrame, and view 10 randomly selected rows.
gm = pd.read_csv('data/gapminder_data.txt', sep='\t')
gm.sample(n=10, random_state=1)
We can use the describe()
method to get obtain statistical summaries of each of the numerical columns.
gm.describe()
We will start by adding two new columns, total_gdp
, and total_life_exp
. Entries in the column total_gdp
will record the total GDP of a specific country in a given year, and will be calculated by multiplying population
and gdp_per_cap
. Similarly, entries in total_life_exp
will be calculated by multiplying population
by life_exp
. This column can be viewed as an estimate of the total number of years that people living in the country at that time would be collectively expected to live.
gm.loc[:,'total_gdp'] = gm.loc[:,'population'] * gm.loc[:,'gdp_per_cap']
gm.loc[:,'total_life_exp'] = gm.loc[:,'population'] * gm.loc[:,'life_exp']
gm.head()
For now, we will focus only on 2018 data. We will apply a filtering operation to extra only the 184 rows associated with 2018.
gm_2018 = gm.loc[gm.year == 2018, :]
print(len(gm_2018))
We can use the DataFrame method sort_values
to sort the records in a DataFrame according to a particular column. This method has one required parameter named by
, which is used to specify the column by which the sorting is to be performed.
We will sort by Gini index to find the 10 countries with the least amount of income inequality in 2018.
gm_2018.sort_values(by='gini').head(10)
We can set the optional parameter ascending
to be equal to False
to sort a DataFrame in descending order according to a column. By default, this parameter is set to True
. In the cell below, we will sort by Gini index in descending order to identify the 10 countries with the greatest levels of income inequality in 2018.
gm_2018.sort_values(by='gini', ascending=False).head(10)
We will now use a sorting operation to find the 10 countries with the lowest life expectancy in 2018.
gm_2018.sort_values(by='life_exp').head(10)
In the cell below, we use sorting to find the 10 countries with the highest life expectancy in 2018.
gm_2018.sort_values(by='life_exp', ascending=False).head(10)
In the cell below, we create a scatter plot to explore the relationship between life expectancy and income inequality, as measured by the Gini index.
continent_list = ['africa', 'americas', 'asia', 'europe']
colors = ['steelblue', 'orange', 'limegreen', 'crimson']
plt.figure(figsize = [8,6])
for i, c in enumerate(continent_list):
plt.scatter(gm_2018.loc[gm_2018.continent == c,'gini'],
gm_2018.loc[gm_2018.continent == c,'life_exp'],
s=50, edgecolor='k', alpha=0.8, label=c.title(),
color=colors[i])
plt.xlabel('Gini Index')
plt.ylabel('Life Expectancy')
plt.title('Life Expectancy and Income Inequality (2018)')
plt.legend()
plt.show()
It does appear as though there is some relationship between life expectancy and Gini index. Countries with lower levels of income inequality tend to also have a higher life expectancy.
In the cell below, we separate the plot above into four subplots. Each subplot will provide information for countries within a single continental region.
plt.figure(figsize = [10,8])
for i, c in enumerate(continent_list):
plt.subplot(2,2,i+1)
plt.scatter(gm_2018.loc[gm_2018.continent == c,'gini'],
gm_2018.loc[gm_2018.continent == c,'life_exp'],
s=50, edgecolor='k', alpha=0.8, label=c.title(),
color=colors[i])
plt.xlim([20,70])
plt.ylim([45,90])
plt.xlabel('Gini Index')
plt.ylabel('Life Expectancy')
plt.legend()
plt.suptitle('Life Expectancy and Income Inequality by Continent (2018)', fontsize=16, y=0.94)
plt.show()
We will now calculate the global per capita GDP in 2018. We wil do this by first using the sum()
method to calculate the global population and global GDP in 2018. We will then divide the GDP by the population to obtain the per capita GDP.
total_pop_2018 = gm_2018.loc[:, 'population'].sum()
print(total_pop_2018)
total_gdp_2018 = gm_2018.loc[:, 'total_gdp'].sum()
print(total_gdp_2018)
pcgdp_2018 = total_gdp_2018 / total_pop_2018
print(pcgdp_2018)
As we see, the global per capita GDP in 2018 is around $15,566 per person.
We can combine the steps above with a filter opperation to calculate the per capita GDP of Africa in 2018.
africa_gdp_2018 = gm_2018.loc[gm.loc[:, 'continent'] == 'africa',
'total_gdp'].sum()
africa_pop_2018 = gm_2018.loc[gm.loc[:, 'continent'] == 'africa',
'population'].sum()
africa_pcgdp_2018 = africa_gdp_2018 / africa_pop_2018
print(africa_pcgdp_2018)
We could repeat this process for each of the four continental regions. However, Pandas DataFrames come equipped with a groupby()
method that can be used to calculate aggregate results for groups determined by the values in a particular column. Each call to groupby()
must specific one or more columns by which to group, and must be immediately followed by an aggregation that is to be performed on the individual groups.
In the cells that follow, we will now use grouping operations to calcuate the per capita GDP and life expectancy for each continental region in 2018. We will start by finding the total population, total GDP, and population weighted life expectancy of each continent.
grouped = gm_2018.loc[:,['continent','population',
'total_gdp', 'total_life_exp']
].groupby('continent').sum()
grouped
We will now caluclate per capita GDP and life expectancy of each continent, adding these columns to the grouped
DataFrame.
grouped.loc[:,'pc_gdp'] = grouped.loc[:,'total_gdp'] / grouped.loc[:,'population']
grouped.loc[:,'life_exp'] = grouped.loc[:,'total_life_exp'] / grouped.loc[:,'population']
grouped.loc[:,['pc_gdp', 'life_exp']]
We could perform the calculate above by starting with the original DataFrame, and chaining together operations without storing any intermediate DataFrames to variables.
gm.loc[gm.year == 2018,
['continent', 'population', 'total_gdp', 'total_life_exp']].\
groupby('continent').sum().\
assign(pcgdp = lambda df : df.total_gdp / df.population).\
assign(life_exp = lambda df : df.total_life_exp / df.population).\
loc[:,['pcgdp','life_exp']]
We can group DataFrames by more than one column. In the cell below, we group by both continent
and year
to create a DataFrame that reports the total population and total gdp for each continental region in each year. We display 10 randomly selected rows from this DataFrame.
gm.loc[:, ['continent', 'year', 'population', 'total_gdp']].\
groupby(['continent', 'year']).sum().sample(n=10, random_state=1)
We get build off the the code in the previous cell to calculate the per capita gdp and life expectancy of each contintal region during each year. This code uses the assign()
DataFrame method to add the columns pcgdp
and life_exp
to the DataFrame. You can read more about that method here: pandas.DataFrame.assign()
gm_yc =\
gm.loc[:, ['continent', 'year', 'population', 'total_gdp', 'total_life_exp']].\
groupby(['continent', 'year']).sum().\
assign(pcgdp = lambda df : df.total_gdp / df.population).\
assign(life_exp = lambda df : df.total_life_exp / df.population).\
filter(['pcgdp', 'life_exp'])
gm_yc.sample(n=10, random_state=1)
We will now create plots to show how the per capita gdp and life expectancy have changed over time in each continental region.
years = range(1800, 2019)
plt.figure(figsize=[12,4])
plt.subplot(1,2,1)
plt.plot(years, gm_yc.loc['africa','pcgdp'], label='Africa')
plt.plot(years, gm_yc.loc['americas','pcgdp'], label='Americas')
plt.plot(years, gm_yc.loc['asia','pcgdp'], label='Asia')
plt.plot(years, gm_yc.loc['europe','pcgdp'], label='Europe')
plt.xlabel("Years")
plt.ylabel("Per Capita GDP")
plt.title("Per Capita GDP by Continent and Year")
plt.legend()
plt.subplot(1,2,2)
plt.plot(years, gm_yc.loc['africa','life_exp'], label='Africa')
plt.plot(years, gm_yc.loc['americas','life_exp'], label='Americas')
plt.plot(years, gm_yc.loc['asia','life_exp'], label='Asia')
plt.plot(years, gm_yc.loc['europe','life_exp'], label='Europe')
plt.xlabel("Years")
plt.ylabel("Life Expectancy")
plt.title("Life Expectancy by Continent and Year")
plt.legend()
plt.show()