The pandas package provides us with the DataFrame data type for working with structured data, or in other words, data that is organized in a tabular format with defined rows and columns.
We have previously seen two tools that can be used for working with structured data: 2D Numpy arrays and dictionaries. DataFrames provide the following advantages over these other data structures:
The elements of a Numpy array must all be of the same data type. While individual columns of a pandas DataFrame must have a consistent data type, different columns can contain different types.
Unlike a Numpy array, we can assign names to the columns and (less importantly) to the rows of a DataFrame.
We have seen that we can use dictionaries to represent tabular data. For example, we can set the values in a dictionary to be lists representing columns, in which case the keys will represent column names. However, there would be no explicit concept of a row in such a setup. DataFrames explicitly define both rows and columns.
It is a common convention to import pandas under the alias pd
.
import pandas as pd
We can create a pandas DataFrame from a dictionary in which the key-value pairs represent columns.
employee_dict = {
'eid':[214, 174, 126, 227, 151, 175, 193, 146],
'name':['Drew', 'Faye', 'Beth', 'Chad', 'Hana', 'Gary', 'Alex', 'Emma'],
'age':[25, 19, 42, 31, 25, 28, 31, 25],
'rate':[11.50, 12.30, 14.60, 12.50, None, 15.60, 13.50, 14.30],
'hours':[38, 32, 40, 29, 40, 36, 24, 20]
}
employee_df = pd.DataFrame(employee_dict)
employee_df
Every DataFrame comes with a head()
method that can be used to view the first few rows of the DataFrame.
employee_df.head(3)
Every row and column in a pandas DataFrames has both a name, and a numerical index. We can use the iloc[]
attribute to access DataFrame elements using column and row indices, and we can use loc[]
to access elements using column and row names.
Note that for the DataFrame we have created above, the numerical indices for the rows are the same as their names. This is common, but not required. We will see an example later where the row names are different from the numerical indices.
employee_df.iloc[2, 3]
employee_df.iloc[4, 1]
employee_df.loc[4, 'name']
We can extract entire rows from a DataFrame using slicing.
employee_df.loc[4,:]
#employee_df.loc[4]
We can also use slicing to extract entire columns from a DataFrame.
employee_df.loc[:,'name']
#employee_df['name']
#employee_df.name
The loc
and iloc
indexing attributes support fancy indexing.
employee_df.loc[:,['name','rate']]
Since columns and rows have a well-defined order in a DataFrame, we can also use slicing with the loc
indexer to slice a range of columns using their column names.
Notice that when performing this type of slicing, the column appearing after the colon IS included in the slice.
employee_df.loc[:,'eid':'age']
We can use loc
to add new columns to a DataFrame.
employee_df.loc[:,'pay'] = employee_df.loc[:,'rate'] * employee_df.loc[:,'hours']
employee_df
DataFrames come equipped with several methods such as sum()
and mean()
for performing operations on columns.
employee_df.loc[:,['hours', 'pay']].sum()
employee_df.iloc[:,2:].mean()
A common headache for anyone who works with data is encountering data sets with missing values. Pandas provides us with several tools for identifying and working with missing values. We will discuss some of those here.
The isnull()
method returns a DataFrame consisting of Boolean values that indicate the location of missing values within the original DataFrame.
employee_df.isnull()
We can combine the isnull()
and sum()
methods to count the number of missing values in each row of a DataFrame.
employee_df.isnull().sum()
The dropna()
method removes from a DataFrame any rows that contain missing values. By default, this method returns a new DataFrame, leaving the original object untouched. However, if we set the parameter inplace=True
, then the operation is performed on the original DataFrame.
employee_df.dropna(inplace=True)
employee_df
We can use Boolean masking to filter DataFrames, just as with Numpy arrays.
The code in the cell below filters employee_df
, keeping only records for employees who worked more than 30 hours.
sel = employee_df.loc[:,'hours'] > 30
employee_df.loc[sel, :]
We call above that the average age for employees in the data set was 28.25. In the cell below, we determine the average age of employees older than 25.
sel = employee_df.loc[:,'age'] > 25
employee_df.loc[sel, 'age'].mean()
Just like with Numpy arrays, we can use & and | to filter on multiple criteria.
sel = (employee_df.loc[:,'rate'] < 13) & (employee_df.loc[:,'hours'] > 30)
employee_df.loc[sel, :]
DataFrames refer to the collection of row names (somewhat confusingly) as the index of the DataFrame. In most cases, these row names will be set to be equal to the numerical indices of the rows. However, it is possible to set the "index" of a DataFrame to a column within a DataFrame
In the cell below, we set eid
to be the index of employee_df
.
employee_df.set_index('eid', inplace=True)
employee_df
We can still use iloc
to index rows and columns according to their numerical indices.
employee_df.iloc[:4, :3]
However, we must use the correct row names when using loc
to subset the DataFrame. Notice that the rows are kept in their original order, and are not reordered according to the new row names that we have set.
employee_df.loc[:227, :'rate']
We will now see how to add new records to a DataFrame. First, lets recall what employee_df
DataFrame currently contains.
employee_df
We can use loc
to add a new records with a specific row name.
employee_df.loc[232] = ['Iris', 34, 11.2, 30, 11.2 * 30]
employee_df
We can also create a DataFrame of new records with the same structure as our original DataFrame, and then combine the two using the append()
method.
new_records = pd.DataFrame({
'name':['Jake', 'Kate'],
'age':[36, 29],
'rate':[11.7, 12.4],
'hours':[34, 32],
}, index=[251, 368])
new_records.loc[:,'pay'] = new_records.loc[:,'rate'] * new_records.loc[:,'hours']
new_records
employee_df.append(new_records)
It should be noted that append()
returns a new DataFrame. It does not change either of the DataFrames being combined.
There are many ways to create DataFrames. At the beginning of this lesson, we saw how to create DataFrames from dictionaries. DataFrames can also be created from lists of lists. In this case, row and column names must be provided, or they will be set to be the same as the numerical indices.
LoL = [[30, 3.7, 'Ant', True],
[24, 1.3, 'Bird', True],
[45, 2.6, 'Cat', False],
[18, 4.2, 'Dog', True]]
unnamed_df = pd.DataFrame(LoL)
print(unnamed_df)
named_df = pd.DataFrame(data = LoL,
index = ['Row1', 'Row2', 'Row3', 'Row4'],
columns = ['Col1', 'Col2', 'Col3', 'Col4'])
print(named_df)
Tabular data is often stored in the form of text files. When a text file is used to store tabular data, each record (row) in the data set is recorded on a separate line of the text file. Elements belonging to different columns are separated by a specific sequence of one or more characters. This sequence is referred to as a separator or delimiter. Common choices for a delimiter are tabs, commas, and spaces. A data file using tabs as delimiters is often referred to as being tab-separated or tab-delimited. Similarly, a data file using commas for delimiters is often referred to as being comma-separated or comma-delimited. The extension used for a text file storing data can be essentially anything, but it is common to see .txt
files used. When the file is comma-delimited, the file will often have the extension .csv
, which stands for "comma-separated values".
Pandas provides several tools for reading data from text files into DataFrames. Two common functions used for this task are pd.read_table()
and pd.read_csv()
. These functions have several parameters that control how the data is imported, but for many use-cases, it sufficient to use just two parameters: filepath_or_buffer
and sep
.
The filepath_or_buffer
parameter expects a string that states the path to the data file. The path can be an absolute path that states exactly where the file lives on your system, or a relative path that explains where the file is stored in relation to the directory containing the notebook or script you are using. Relative paths tend to be more flexible, as they allow the directory containing the notebook and data file to be moved without having to update the path in the code.
The sep
parameter specify the separator or delimiter used in the data file. The default value of sep
is '\t'
for pd.read_table()
, and is ','
for pd.read_csv()
.
These two functions can be used interchangeably, as long as you are careful to specify the correct separator for your data file.
We will now provide two examples of loading data into a Pandas DataFrame.
We will start by loading the Pima Indian Diabetes dataset. This data set contains information collected in a study on the prevalence of type 2 diabetes in a Native American community in Arizona. You can find more information about this dataset here: Pima Indian Diabetes.
The data is stored in a comma-separated file named diabetes.csv
stored within the data/
directory. In the cell below, we load this dataset into a DataFrame and then view its first few rows.
diabetes = pd.read_table('data/diabetes.csv', sep=',')
diabetes.head()
In the next example, we will load the Titanic dataset. This dataset contains information about the 887 passengers on the first and only voyage of the HMS Titanic. You can read more about this dataset here: Titanic Dataset.
The data is store in the tab-delimited file titanic.txt
stored within the data/
directory. In the cell below, we load this dataset into a DataFrame and then view its first few rows.
titanic = pd.read_table('data/titanic.txt', sep='\t')
titanic.head()