12 - Example: Gapminder Dataset(Python)

Loading...

Lesson 12 - Example: Gapminder Dataset

Introduction

In this lesson, we will be working with the Gapminder dataset, which contains socioeconomic data for 184 countries. The dataset contains information about each country, collected (or estimated) for every year from 1800 to 2018. Each record within the dataset contains the following peices of information: country, continent, year, population, life expectancy, per capita GDP, and gini score.

from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import pandas as pd
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

Load and Explore Data

We will load the data from a text file, and will examine its contents.

gm_raw = sc.textFile('FileStore/tables/gapminder_data.txt')
for row in gm_raw.take(5):
    print(row)
country year continent population life_exp gdp_per_cap gini Afghanistan 1800 asia 3280000 28.2 603 30.5 Albania 1800 europe 410000 35.4 667 38.9 Algeria 1800 africa 2500000 28.8 715 56.2 Angola 1800 africa 1570000 27 618 57.2

We will start by processing the dataset. We will filter out the header row and will process each other line by tokenizing the string and coercing each value into the appropriate datatype. We will return the results in the form of a list of values.

header = gm_raw.take(1)[0].split('\t')
 
def process_row(row):
    tokens = row.split('\t')
    return [tokens[0], int(tokens[1]), tokens[2], int(tokens[3]), 
            float(tokens[4]), int(tokens[5]), float(tokens[6])]
 
gm = (gm_raw
      .filter(lambda x : 'country' not in x)
      .map(process_row))
 
for row in gm.take(5):
    print(row)
['Afghanistan', 1800, 'asia', 3280000, 28.2, 603, 30.5] ['Albania', 1800, 'europe', 410000, 35.4, 667, 38.9] ['Algeria', 1800, 'africa', 2500000, 28.8, 715, 56.2] ['Angola', 1800, 'africa', 1570000, 27.0, 618, 57.2] ['Antigua and Barbuda', 1800, 'americas', 37000, 33.5, 757, 40.0]

Let's see how many records are present in the dataset.

gm.count()
Out[14]: 40296

For this example, we will be working with only the most recent data represented in the dataset. Let's find the latest year for which we have data.

# Find the latest year in the data. 
print(gm.map(lambda x : x[1]).max())
2018

We will now use filter() to keep only the records from 2018.

gm_18 = gm.filter(lambda x : x[1] == 2018)
gm_18.persist()
gm_18.count()
Out[16]: 184

Largest and Smallest Populations

We will apply pair RDD tools to find the countries with the largest and smallest populations in 2018.

print('Largest Populations in 2018')
print('-' * 40)
for row in gm_18.sortBy(lambda x : x[3], ascending=False).take(10):
    print(f'{row[0]:<30}{row[3]:>10}')
 
Largest Populations in 2018 ---------------------------------------- China 1420000000 India 1350000000 United States 327000000 Indonesia 267000000 Brazil 211000000 Pakistan 201000000 Nigeria 196000000 Bangladesh 166000000 Russia 144000000 Mexico 131000000
print('Smallest Populations in 2018')
print('-' * 40)
for row in gm_18.sortBy(lambda x : x[3]).take(10):
    print(f'{row[0]:<30}{row[3]:>10}')
Smallest Populations in 2018 ---------------------------------------- Seychelles 95200 Antigua and Barbuda 103000 Micronesia, Fed. Sts. 106000 Grenada 108000 Tonga 109000 St. Vincent and the Grenadines 110000 Kiribati 118000 St. Lucia 180000 Samoa 198000 Sao Tome and Principe 209000

Highest and Lowest Life Expectancy

We will apply pair RDD tools to find the countries with the highest and lowest life expectancies in 2018.

print('Highest Life Expectancy in 2018')
print('-' * 40)
for row in gm_18.sortBy(lambda x : x[4], ascending=False).take(10):
    print(f'{row[0]:<30}{row[4]:>10}')
Highest Life Expectancy in 2018 ---------------------------------------- Japan 84.2 Singapore 84.0 Switzerland 83.5 Spain 83.2 Australia 82.9 France 82.6 Iceland 82.6 Italy 82.6 Israel 82.4 Luxembourg 82.4
print('Lowest Life  Expectancy in 2018')
print('-' * 40)
for row in gm_18.sortBy(lambda x : x[4]).take(10):
    print(f'{row[0]:<30}{row[4]:>10}')
Lowest Life Expectancy in 2018 ---------------------------------------- Lesotho 51.1 Central African Republic 51.6 Somalia 58.0 Swaziland 58.6 Afghanistan 58.7 Zambia 59.5 Guinea-Bissau 59.7 Sierra Leone 60.0 Zimbabwe 60.2 Chad 60.5