Working with Text Files¶
In this lesson, we will see how to use the open()
function to open an existing text file, or to create a new text file. We will see how to read text from a file and how to write text to a file.
In several of the examples we see in this lesson, we will be working with the file my_file.txt
in the data/
directory. This file is a text file containing the three lines of text shown below.
This is the first line.
This is the second line.
This is the third line.
Opening and Closing Files¶
We will start by discussing how to open and close files. These tasks can be accomplished using the open()
and close()
functions. The open()
function requires a parameter named file
that is expected to be a string representing the path to a file. This function also accepts a number of optional parameters. The most important of these is mode
, which we will consider later in the lesson.
In the cell below, we open the file my_file.txt
, storing the value returned into a variable named fin
(which stands for file input). We then print the type of fin
, and see that it has type _io.TextIOWrapper
. This object does not contain the actual text from the file, but instead provides a link through which we can access the contents of the file.
fin = open('data/my_file.txt')
print(type(fin))
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-1-d31523d77f51> in <module>
----> 1 fin = open('data/my_file.txt')
2 print(type(fin))
FileNotFoundError: [Errno 2] No such file or directory: 'data/my_file.txt'
After running the cell above, the file will be open in Python. You won’t see the contents of the file as a window in your operating system, but the file is none-the-less open. If you were to try to delete the file at this point, you would likely see a message similar to the one below:
"The action can't be completed because the file is open in Python."
We can confirm that the file is open by printing the closed
attribute of the TextIOWrapper
object.
print(fin.closed)
False
It is good practice to always close files when you are done working with them. This can be accomplished using the close()
method of the TextIOWrapper
object.
Python will automatically close any open files when the Python sessions ends, but closing the files manually will free up valuable resources, and is particularly important in programs that work with multiple files, or very large files.
fin.close()
We will again check the value of the closed
attribute to confirm that the file has been closed.
print(fin.closed)
True
The Mode Parameter¶
We can use the mode
parameter to specify the time of file operations should be allowed on the file we have opened. In particular, we can use mode
to specify if we would like for the text file to be read-only, or if writing to the file should be allowed.
A list of possible values for the mode
parameter is provided below, along with explanations of the purpose of these values.
r
means “read”. A file opened in this model will be read-only.w
means “write”. If the file does not exist, it is created. If the file does exist, it is overwritten.x
means “write”. This mode will only works if the file does not already exist. If the file already exists, an error will occur.a
means “append”. This mode allows for new lines to be added to the end of a file.
The default value for mode
is r
, so if we only wish to read the contents of a file, we do not need to specify the mode
parameter.
Reading File Contents¶
There are several tools available for reading the contents of an open file. The three most common such tools are the methods read()
, read_lines()
, and read_line()
.
The
read()
method will return a string that contains the entire content of the file.The
readlines()
method will return a list of strings, with each string representing a single line of the file.The
readline()
method will return an iterator, each value of which will be a string representing a single line of the file.
read()¶
We will now take a look at an example of using the read()
method.
fin = open('data/my_file.txt')
contents = fin.read()
fin.close()
We will print the data type of the contents
variable to confirm that it is a string.
print(type(contents))
<class 'str'>
If we print contents
, we will see that it contains all three lines of my_file.txt
.
print(contents)
This is the first line.
This is the second line.
This is the third line.
If we disply the contents
variable without using print()
, we can see that the string contains newline characters used to separate the lines.
contents
'This is the first line.\nThis is the second line.\nThis is the third line.'
If we wanted to separate each line of the file into its own string, we could use the split()
method, splitting the string on newline characters.
contents_list = contents.split('\n')
print(contents_list)
['This is the first line.', 'This is the second line.', 'This is the third line.']
readlines()¶
We will now explore the readlines()
method. In the cell below, we open the file my_files.txt
, read its contents using readlines()
, and then close the file. We also dispay the results returned by readlines()
to confirm that this is a list of strings.
fin = open('data/my_file.txt')
contents_list = fin.readlines()
fin.close()
print(contents_list)
['This is the first line.\n', 'This is the second line.\n', 'This is the third line.']
Notice that each string above ends with a newline character. If we wish to remove these, we can use the strip()
method which removes whitespace characters from the end of a string.
for i in range(len(contents_list)):
contents_list[i] = contents_list[i].strip()
print(contents_list)
['This is the first line.', 'This is the second line.', 'This is the third line.']
Using With¶
We can use the with
keyword to reduce the number of steps involved in working with a file. When we open a file using with
, the file will be automatically closed when we leave the with
block. The usage of this keyword is illustrated in the example below.
with open('data/my_file.txt') as fin:
contents = fin.read()
contents_list = contents.split('\n')
print(contents_list)
['This is the first line.', 'This is the second line.', 'This is the third line.']
Writing to a File¶
We will see how to write to a file by setting the mode
parameter of open
to w
. When using mode='w'
, a new file will be created if one does not already exist with the specified name. If the file does already exist, then it will be overwritten.
In the cell below, we will create a file named new_file.txt
within the data/
folder, and will then write three lines to it.
line1 = 'This is the first line.\n'
line2 = 'This is the second line.\n'
line3 = 'This is the third line.'
with open('data/new_file.txt', 'w') as fout:
fout.write(line1)
fout.write(line2)
fout.write(line3)
We will confirm that the file was written correctly by opening the file in read-only mode and printing its contents.
with open('data/new_file.txt') as fin:
print(fin.read())
This is the first line.
This is the second line.
This is the third line.
Appending¶
If we open a file using mode='a'
, then we can write to the end of the file. This will not delete the current content of the file, but will instead append new lines to the end of the file.
line4 = '\nThis is the fourth line.'
line5 = '\nThis is the fifth line.'
with open('data/new_file.txt', 'a') as fout:
fout.write(line4)
fout.write(line5)
We will confirm that the new content was written to the file by opening the file in read-only mode and printing its contents.
with open('data/new_file.txt') as fin:
print(fin.read())
This is the first line.
This is the second line.
This is the third line.
This is the fourth line.
This is the fifth line.
Processing Strings of Text¶
Occasionally, you will need to break each line of a text file up into smaller pieces called tokens. It is particularly necessary in situations in which we are reading tabular data that has been stored as a text file.
In the exampe below, we will open the file titanic.txt
and read its contents using readlines()
. We will then split each line into tokens, and print the contents of each line in a tabular format.
with open('data/titanic.txt') as fin:
line_list = fin.readlines()
for line in line_list[:20]:
tokens = line.split('\t')
print(f'{tokens[0]:<10}{tokens[1]:<8}{tokens[3]:<10}{tokens[4]:<8}{tokens[2]:<60}')
Survived Pclass Sex Age Name
0 3 male 22 Mr. Owen Harris Braund
1 1 female 38 Mrs. John Bradley (Florence Briggs Thayer) Cumings
1 3 female 26 Miss. Laina Heikkinen
1 1 female 35 Mrs. Jacques Heath (Lily May Peel) Futrelle
0 3 male 35 Mr. William Henry Allen
0 3 male 27 Mr. James Moran
0 1 male 54 Mr. Timothy J McCarthy
0 3 male 2 Master. Gosta Leonard Palsson
1 3 female 27 Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson
1 2 female 14 Mrs. Nicholas (Adele Achem) Nasser
1 3 female 4 Miss. Marguerite Rut Sandstrom
1 1 female 58 Miss. Elizabeth Bonnell
0 3 male 20 Mr. William Henry Saundercock
0 3 male 39 Mr. Anders Johan Andersson
0 3 female 14 Miss. Hulda Amanda Adolfina Vestrom
1 2 female 55 Mrs. (Mary D Kingcome) Hewlett
0 3 male 2 Master. Eugene Rice
1 2 male 23 Mr. Charles Eugene Williams
0 3 female 31 Mrs. Julius (Emelia Maria Vandemoortele) Vander Planke