Lesson 23 - Pandas DataFrames

The following topics are discussed in this notebook:

  • The dataframe data type from the pandas package.

DataFrames

The pandas package provides us with the DataFrame data type for working with structured data, or in other words, data that is organized in a tabular format with defined rows and columns.

We have previously seen two tools that can be used for working with structured data: 2D Numpy arrays and dictionaries. DataFrames provide the following advantages over these other data structures:

  1. The elements of a Numpy array must all be of the same data type. While individual columns of a pandas DataFrame must have a consistent data type, different columns can contain different types.

  2. Unlike a Numpy array, we can assign names to the columns and (less importantly) to the rows of a DataFrame.

  3. We have seen that we can use dictionaries to represent tabular data. For example, we can set the values in a dictionary to be lists representing columns, in which case the keys will represent column names. However, there would be no explicit concept of a row in such a setup. DataFrames explicitly define both rows and columns.

It is a common convention to import pandas under the alias pd.

In [1]:
import pandas as pd

Creating a DataFrame from a Dictionary

We can create a pandas DataFrame from a dictionary in which the key-value pairs represent columns.

In [2]:
employee_dict = {
    'eid':[214, 174, 126, 227, 151, 175, 193, 146],
    'name':['Drew', 'Faye', 'Beth', 'Chad', 'Hana', 'Gary', 'Alex', 'Emma'],
    'age':[25, 19, 42, 31, 25, 28, 31, 25],
    'rate':[11.50, 12.30, 14.60, 12.50, None, 15.60, 13.50, 14.30],
    'hours':[38, 32, 40, 29, 40, 36, 24, 20]
}
In [3]:
employee_df = pd.DataFrame(employee_dict)
employee_df
Out[3]:
eid name age rate hours
0 214 Drew 25 11.5 38
1 174 Faye 19 12.3 32
2 126 Beth 42 14.6 40
3 227 Chad 31 12.5 29
4 151 Hana 25 NaN 40
5 175 Gary 28 15.6 36
6 193 Alex 31 13.5 24
7 146 Emma 25 14.3 20

Every DataFrame comes with a head() method that can be used to view the first few rows of the DataFrame.

In [4]:
employee_df.head(3)
Out[4]:
eid name age rate hours
0 214 Drew 25 11.5 38
1 174 Faye 19 12.3 32
2 126 Beth 42 14.6 40

Accessing Elements

Every row and column in a pandas DataFrames has both a name, and a numerical index. We can use the iloc[] attribute to access DataFrame elements using column and row indices, and we can use loc[] to access elements using column and row names.

Note that for the DataFrame we have created above, the numerical indices for the rows are the same as their names. This is common, but not required. We will see an example later where the row names are different from the numerical indices.

In [5]:
employee_df.iloc[2, 3]
Out[5]:
14.6
In [6]:
employee_df.iloc[4, 1]
Out[6]:
'Hana'
In [7]:
employee_df.loc[4, 'name']
Out[7]:
'Hana'

Accessing Rows

We can extract entire rows from a DataFrame using slicing.

In [8]:
employee_df.loc[4,:]
#employee_df.loc[4]
Out[8]:
eid       151
name     Hana
age        25
rate      NaN
hours      40
Name: 4, dtype: object

Accessing Columns

We can also use slicing to extract entire columns from a DataFrame.

In [9]:
employee_df.loc[:,'name']
#employee_df['name']
#employee_df.name
Out[9]:
0    Drew
1    Faye
2    Beth
3    Chad
4    Hana
5    Gary
6    Alex
7    Emma
Name: name, dtype: object

The loc and iloc indexing attributes support fancy indexing.

In [10]:
employee_df.loc[:,['name','rate']]
Out[10]:
name rate
0 Drew 11.5
1 Faye 12.3
2 Beth 14.6
3 Chad 12.5
4 Hana NaN
5 Gary 15.6
6 Alex 13.5
7 Emma 14.3

Since columns and rows have a well-defined order in a DataFrame, we can also use slicing with the loc indexer to slice a range of columns using their column names.

Notice that when performing this type of slicing, the column appearing after the colon IS included in the slice.

In [11]:
employee_df.loc[:,'eid':'age']
Out[11]:
eid name age
0 214 Drew 25
1 174 Faye 19
2 126 Beth 42
3 227 Chad 31
4 151 Hana 25
5 175 Gary 28
6 193 Alex 31
7 146 Emma 25

Adding New Columns

We can use loc to add new columns to a DataFrame.

In [12]:
employee_df.loc[:,'pay'] = employee_df.loc[:,'rate'] * employee_df.loc[:,'hours']
employee_df
Out[12]:
eid name age rate hours pay
0 214 Drew 25 11.5 38 437.0
1 174 Faye 19 12.3 32 393.6
2 126 Beth 42 14.6 40 584.0
3 227 Chad 31 12.5 29 362.5
4 151 Hana 25 NaN 40 NaN
5 175 Gary 28 15.6 36 561.6
6 193 Alex 31 13.5 24 324.0
7 146 Emma 25 14.3 20 286.0

Column Operations

DataFrames come equipped with several methods such as sum() and mean() for performing operations on columns.

In [13]:
employee_df.loc[:,['hours', 'pay']].sum()
Out[13]:
hours     259.0
pay      2948.7
dtype: float64
In [14]:
employee_df.iloc[:,2:].mean()
Out[14]:
age       28.250000
rate      13.471429
hours     32.375000
pay      421.242857
dtype: float64

Missing Values

A common headache for anyone who works with data is encountering data sets with missing values. Pandas provides us with several tools for identifying and working with missing values. We will discuss some of those here.

The isnull() method returns a DataFrame consisting of Boolean values that indicate the location of missing values within the original DataFrame.

In [15]:
employee_df.isnull()
Out[15]:
eid name age rate hours pay
0 False False False False False False
1 False False False False False False
2 False False False False False False
3 False False False False False False
4 False False False True False True
5 False False False False False False
6 False False False False False False
7 False False False False False False

We can combine the isnull() and sum() methods to count the number of missing values in each row of a DataFrame.

In [16]:
employee_df.isnull().sum()
Out[16]:
eid      0
name     0
age      0
rate     1
hours    0
pay      1
dtype: int64

The dropna() method removes from a DataFrame any rows that contain missing values. By default, this method returns a new DataFrame, leaving the original object untouched. However, if we set the parameter inplace=True, then the operation is performed on the original DataFrame.

In [17]:
employee_df.dropna(inplace=True)
employee_df
Out[17]:
eid name age rate hours pay
0 214 Drew 25 11.5 38 437.0
1 174 Faye 19 12.3 32 393.6
2 126 Beth 42 14.6 40 584.0
3 227 Chad 31 12.5 29 362.5
5 175 Gary 28 15.6 36 561.6
6 193 Alex 31 13.5 24 324.0
7 146 Emma 25 14.3 20 286.0

Filtering

We can use Boolean masking to filter DataFrames, just as with Numpy arrays.

The code in the cell below filters employee_df, keeping only records for employees who worked more than 30 hours.

In [18]:
sel = employee_df.loc[:,'hours'] > 30
employee_df.loc[sel, :]
Out[18]:
eid name age rate hours pay
0 214 Drew 25 11.5 38 437.0
1 174 Faye 19 12.3 32 393.6
2 126 Beth 42 14.6 40 584.0
5 175 Gary 28 15.6 36 561.6

We call above that the average age for employees in the data set was 28.25. In the cell below, we determine the average age of employees older than 25.

In [19]:
sel = employee_df.loc[:,'age'] > 25
employee_df.loc[sel, 'age'].mean()
Out[19]:
33.0

Just like with Numpy arrays, we can use & and | to filter on multiple criteria.

In [20]:
sel = (employee_df.loc[:,'rate'] < 13) & (employee_df.loc[:,'hours'] > 30)
employee_df.loc[sel, :]
Out[20]:
eid name age rate hours pay
0 214 Drew 25 11.5 38 437.0
1 174 Faye 19 12.3 32 393.6

Setting a New Index

DataFrames refer to the collection of row names (somewhat confusingly) as the index of the DataFrame. In most cases, these row names will be set to be equal to the numerical indices of the rows. However, it is possible to set the "index" of a DataFrame to a column within a DataFrame

In the cell below, we set eid to be the index of employee_df.

In [21]:
employee_df.set_index('eid', inplace=True)
employee_df
Out[21]:
name age rate hours pay
eid
214 Drew 25 11.5 38 437.0
174 Faye 19 12.3 32 393.6
126 Beth 42 14.6 40 584.0
227 Chad 31 12.5 29 362.5
175 Gary 28 15.6 36 561.6
193 Alex 31 13.5 24 324.0
146 Emma 25 14.3 20 286.0

We can still use iloc to index rows and columns according to their numerical indices.

In [22]:
employee_df.iloc[:4, :3]
Out[22]:
name age rate
eid
214 Drew 25 11.5
174 Faye 19 12.3
126 Beth 42 14.6
227 Chad 31 12.5

However, we must use the correct row names when using loc to subset the DataFrame. Notice that the rows are kept in their original order, and are not reordered according to the new row names that we have set.

In [23]:
employee_df.loc[:227, :'rate']
Out[23]:
name age rate
eid
214 Drew 25 11.5
174 Faye 19 12.3
126 Beth 42 14.6
227 Chad 31 12.5

Adding Records to a Data Frame

We will now see how to add new records to a DataFrame. First, lets recall what employee_df DataFrame currently contains.

In [24]:
employee_df
Out[24]:
name age rate hours pay
eid
214 Drew 25 11.5 38 437.0
174 Faye 19 12.3 32 393.6
126 Beth 42 14.6 40 584.0
227 Chad 31 12.5 29 362.5
175 Gary 28 15.6 36 561.6
193 Alex 31 13.5 24 324.0
146 Emma 25 14.3 20 286.0

We can use loc to add a new records with a specific row name.

In [25]:
employee_df.loc[232] = ['Iris', 34, 11.2, 30, 11.2 * 30]
employee_df
Out[25]:
name age rate hours pay
eid
214 Drew 25 11.5 38 437.0
174 Faye 19 12.3 32 393.6
126 Beth 42 14.6 40 584.0
227 Chad 31 12.5 29 362.5
175 Gary 28 15.6 36 561.6
193 Alex 31 13.5 24 324.0
146 Emma 25 14.3 20 286.0
232 Iris 34 11.2 30 336.0

We can also create a DataFrame of new records with the same structure as our original DataFrame, and then combine the two using the append() method.

In [26]:
new_records = pd.DataFrame({
    'name':['Jake', 'Kate'],
    'age':[36, 29],
    'rate':[11.7, 12.4],
    'hours':[34, 32],
}, index=[251, 368])

new_records.loc[:,'pay'] = new_records.loc[:,'rate'] * new_records.loc[:,'hours']

new_records
Out[26]:
name age rate hours pay
251 Jake 36 11.7 34 397.8
368 Kate 29 12.4 32 396.8
In [27]:
employee_df.append(new_records)
Out[27]:
name age rate hours pay
214 Drew 25 11.5 38 437.0
174 Faye 19 12.3 32 393.6
126 Beth 42 14.6 40 584.0
227 Chad 31 12.5 29 362.5
175 Gary 28 15.6 36 561.6
193 Alex 31 13.5 24 324.0
146 Emma 25 14.3 20 286.0
232 Iris 34 11.2 30 336.0
251 Jake 36 11.7 34 397.8
368 Kate 29 12.4 32 396.8

It should be noted that append() returns a new DataFrame. It does not change either of the DataFrames being combined.

Creating DataFrame from List of Lists

There are many ways to create DataFrames. At the beginning of this lesson, we saw how to create DataFrames from dictionaries. DataFrames can also be created from lists of lists. In this case, row and column names must be provided, or they will be set to be the same as the numerical indices.

In [28]:
LoL = [[30, 3.7, 'Ant', True], 
       [24, 1.3, 'Bird', True], 
       [45, 2.6, 'Cat', False], 
       [18, 4.2, 'Dog', True]]

unnamed_df = pd.DataFrame(LoL)
print(unnamed_df)
    0    1     2      3
0  30  3.7   Ant   True
1  24  1.3  Bird   True
2  45  2.6   Cat  False
3  18  4.2   Dog   True
In [29]:
named_df = pd.DataFrame(data = LoL,
                        index = ['Row1', 'Row2', 'Row3', 'Row4'],
                        columns = ['Col1', 'Col2', 'Col3', 'Col4'])
print(named_df)
      Col1  Col2  Col3   Col4
Row1    30   3.7   Ant   True
Row2    24   1.3  Bird   True
Row3    45   2.6   Cat  False
Row4    18   4.2   Dog   True

Loading Data From a Text File

Tabular data is often stored in the form of text files. When a text file is used to store tabular data, each record (row) in the data set is recorded on a separate line of the text file. Elements belonging to different columns are separated by a specific sequence of one or more characters. This sequence is referred to as a separator or delimiter. Common choices for a delimiter are tabs, commas, and spaces. A data file using tabs as delimiters is often referred to as being tab-separated or tab-delimited. Similarly, a data file using commas for delimiters is often referred to as being comma-separated or comma-delimited. The extension used for a text file storing data can be essentially anything, but it is common to see .txt files used. When the file is comma-delimited, the file will often have the extension .csv, which stands for "comma-separated values".

Pandas provides several tools for reading data from text files into DataFrames. Two common functions used for this task are pd.read_table() and pd.read_csv(). These functions have several parameters that control how the data is imported, but for many use-cases, it sufficient to use just two parameters: filepath_or_buffer and sep.

  • The filepath_or_buffer parameter expects a string that states the path to the data file. The path can be an absolute path that states exactly where the file lives on your system, or a relative path that explains where the file is stored in relation to the directory containing the notebook or script you are using. Relative paths tend to be more flexible, as they allow the directory containing the notebook and data file to be moved without having to update the path in the code.

  • The sep parameter specify the separator or delimiter used in the data file. The default value of sep is '\t' for pd.read_table(), and is ',' for pd.read_csv().

These two functions can be used interchangeably, as long as you are careful to specify the correct separator for your data file.

We will now provide two examples of loading data into a Pandas DataFrame.

Pima Diabetes Dataset

We will start by loading the Pima Indian Diabetes dataset. This data set contains information collected in a study on the prevalence of type 2 diabetes in a Native American community in Arizona. You can find more information about this dataset here: Pima Indian Diabetes.

The data is stored in a comma-separated file named diabetes.csv stored within the data/ directory. In the cell below, we load this dataset into a DataFrame and then view its first few rows.

In [30]:
diabetes = pd.read_table('data/diabetes.csv', sep=',')
diabetes.head()
Out[30]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

Titanic Dataset

In the next example, we will load the Titanic dataset. This dataset contains information about the 887 passengers on the first and only voyage of the HMS Titanic. You can read more about this dataset here: Titanic Dataset.

The data is store in the tab-delimited file titanic.txt stored within the data/ directory. In the cell below, we load this dataset into a DataFrame and then view its first few rows.

In [31]:
titanic = pd.read_table('data/titanic.txt', sep='\t')
titanic.head()
Out[31]:
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
0 0 3 Mr. Owen Harris Braund male 22.0 1 0 7.2500
1 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cum... female 38.0 1 0 71.2833
2 1 3 Miss. Laina Heikkinen female 26.0 0 0 7.9250
3 1 1 Mrs. Jacques Heath (Lily May Peel) Futrelle female 35.0 1 0 53.1000
4 0 3 Mr. William Henry Allen male 35.0 0 0 8.0500