Python Tips: Who Put That Unicode In My CSV File?
From time to time, it becomes necessary to read in the contents of a CSV file in Python. You would think this is straightforward given that there is a csv
module available. However, I recently ran into an issue reading a CSV file I created in Excel and I thought I'd share with you how I fixed it.
The Problem
I wanted to write a simple script that reads in the contents of a CSV file and prints out the columns. Here is the original script:
1 #!/usr/bin/env python3
2
3 import csv
4
5
6 file_handle = open('users.csv', mode='r')
7 contents = csv.DictReader(file_handle)
8
9 users = []
10 email_addresses = []
11
12 for row in contents:
13 print(row)
14 users.append(row['User'])
15 email_addresses.append(row['Email'])
16
17 print('Users:', users)
18 print('Email addresses:', email_addresses)
19
20 file_handle.close()
At first glance, this seems pretty simple. Let's break it down:
- Lines 1 - 7: open the CSV file and read it into a Dictionary iterator. This will make looping through the contents of the file easier. I won't explain what an iterator is in this post but, trust me, this is a better approach than reading the file as you would any other type.
- Lines 8 - 11: create some empty lists to hold the the column information. These will be populated in the
for
loop. - Lines 12 - 16: loop through the contents of the file and, for every row:
1. print the row
2. add the name in the User column to theusers
list
3. add the address in the Email column to theemail_addresses
list - Lines 17 - 20: print the contents of the lists and close the file
The script seems like it should work but when I run it I get this:
{'\ufeffUser': 'Bubba', 'Email': 'bubba@bubbagump.com'}
Traceback (most recent call last):
File "./read_csv.py", line 14, in <module>
users.append(row['User'])
KeyError: 'User'
What is going on? Notice that the very first item in the Dictionary entry is wrong. Instead of just 'User', the column heading is being read in as '\ufeffUser'.
That '\ufeff' is the BOM:
The Unicode character U+FEFF
is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. It is usually received as the first few bytes of a file, telling you how to interpret the encoding of the rest of the data.
That BOM code is confusing my program and it can't accurately capture the column heading.
The Solution
In order to prevent Python from reading in the BOM, an extra parameter needs to be passed to the open
function to tell it what kind of encoding to use. Specifically, the open
function should be told to read the file using 'utf-8-sig' encoding.
Using utf-8-sig to read a file will treat the BOM as file information instead of a string.
So line 6 in the script should look like this:
file_handle = open('users.csv', mode='r', encoding='utf-8-sig')
Now when I run the script I get the expected output:
{'User': 'Bubba', 'Email': 'bubba@bubbagump.com'}
{'User': 'Forest', 'Email': 'forest@bubbagump.com'}
{'User': 'Jenny', 'Email': 'jenny@bubbagump.com'}
Users: ['Bubba', 'Forest', 'Jenny']
Email addresses: ['bubba@bubbagump.com', 'forest@bubbagump.com', 'jenny@bubbagump.com']
Conclusion
Google is your friend. I searched for that unicode string and found several resources that explained what it was, why it was happening and how to fix it. The other lesson to take from this is that you shouldn't give up just because things don't work out right away. Someone just starting to learn how to script could have decided that scripting/Python is too hard and they could have given up. Don't. This is what makes engineering engineering. This is the part they don't teach you in school. This is what sets you apart from others. So if things get difficult, use that as an opportunity learn something new. Then pass that knowledge along. Happy scripting.