Read CSV file with Pandas

How To Read CSV Files Using Pandas: Step-By-Step

Do you have data in CSV format and would you like to know how to read CSV files in your Python application using Pandas? We will go through that in this guide.

The Python Pandas module provides the read_csv() function to read data from CSV files. This function stores the data from the CSV file into a data type called DataFrame. You can use Python code to read columns and filter rows from the Pandas DataFrame.

I have decided to write this article because I found that other articles available online about reading CSV files with Pandas are not very easy to understand especially if you are getting started with Python.

I will do my best to make every step of this tutorial as clear as possible.

Let’s get started!

Using Pandas to Read The Content of a CSV File

I have created a CSV file called test.csv with the following content in the same directory of the Python program that reads the CSV file. In this file I have used the comma as a delimiter:

username,age,city
user1,23,London
user2,45,Paris
user3,30,New York
user4,60,San Francisco
user5,53,Hong Kong
user6,34,Dublin
user7,46,Barcelona
user8,32,Rome

To read a CSV file using Pandas first import the Pandas module and then use the read_csv() function passing to it the name of the CSV file.

Note: if you see an error when importing the Pandas module check how to install the Pandas module.

import pandas as pd

df = pd.read_csv('test.csv')

Using the keyword “as” in the import statement allows you to refer to the Pandas module using the shorter variable “pd”.

But, what is df?

Let’s use the Python print statement and the type function to know more about the variable df:

print(type(df))

[output]
<class 'pandas.core.frame.DataFrame'>

You can see from the output that the data type of the variable df is a DataFrame.

The DataFrame is a data type that belongs to the Pandas library and allows to store and read tabular data. Tabular data is data that is organised in a table using rows and columns.

Let’s see what’s in the dataframe variable returned by the read_csv() function…

print(df)

[output]
  username  age           city
0    user1   23         London
1    user2   45          Paris
2    user3   30       New York
3    user4   60  San Francisco
4    user5   53      Hong Kong
5    user6   34         Dublin
6    user7   46      Barcelona
7    user8   32           Rome

Related course: Do you want to get started with Data Science in Python? Have a look at Introduction to Data Science in Python.

How To Read A Column From a CSV File Using Pandas

Now that we have stored the data from our CSV file into a dataframe the next step is to read this data.

How can we get the value from a column in the dataframe?

As you can see there are three columns in our dataframe: username, age and city.

  username  age           city
0    user1   23         London
1    user2   45          Paris
2    user3   30       New York
3    user4   60  San Francisco
4    user5   53      Hong Kong
5    user6   34         Dublin
6    user7   46      Barcelona
7    user8   32           Rome

We can retrieve the values in a column by passing the column name (a Python string) within square brackets immediately after the dataframe variable:

print(df['username'])

[output]
0    user1
1    user2
2    user3
3    user4
4    user5
5    user6
6    user7
7    user8
Name: username, dtype: object

I wonder what data type contains the values in a column…

…let’s find out!

print(type(df['username']))

[output]
<class 'pandas.core.series.Series'>

It’s a Pandas Series.

I will cover Pandas Series in a different article to avoid making things too complex in this tutorial.

You can also get the values in a dataframe column by using the dot notation.

Specify the dataframe name followed by a dot, followed by the name of the column.

print(df.username)

[output]
0    user1
1    user2
2    user3
3    user4
4    user5
5    user6
6    user7
7    user8
Name: username, dtype: object

In a real Python application you would assign a column to a variable, for example:

usernames = df.username

Then using indexes you would access the values from the variable the column values have been assigned to:

print(usernames[0])
print(usernames[2])

[output]
user1
user3

You can see that we are using an index to access elements in the Series that contains column values similarly to what you would do to access elements from a Python list.

Note: if the column name contains spaces you can only use the notation with brackets and not the dot notation.

Try to use a column name with spaces with the dot notation and see what happens!

How To Read Specific Rows From a CSV File Using Pandas

To read specific rows of a dataframe that contains data coming from a CSV file you can use logical statements.

For example…

Let’s say you want to get the rows in the CSV file that contain users whose age is greater than 30.

To do that you can use the following expression:

print(df[df.age > 30])

[output]
  username  age           city
1    user2   45          Paris
3    user4   60  San Francisco
4    user5   53      Hong Kong
5    user6   34         Dublin
6    user7   46      Barcelona
7    user8   32           Rome

We have seen that…

The Pandas DataFrame name, followed by brackets, followed by a logical statement returns the rows in the DataFrame that match the condition in the logical statement.

How To Read a CSV File With a Different Separator Than the Comma Using Pandas

Let’s see what happens with the code we have used so far if the CSV file uses a different separator than the comma (e.g. the semicolon).

username;age;city
user1;23;London
user2;45;Paris
user3;30;New York
user4;60;San Francisco
user5;53;Hong Kong
user6;34;Dublin
user7;46;Barcelona
user8;32;Rome

Execute the following code:

import pandas as pd

df = pd.read_csv('test.csv')
print(df)

Here is the output of our Python program…

It’s a bit messy! Why?!?

        username;age;city
0         user1;23;London
1          user2;45;Paris
2       user3;30;New York
3  user4;60;San Francisco
4      user5;53;Hong Kong
5         user6;34;Dublin
6      user7;46;Barcelona
7           user8;32;Rome

That’s because the read_csv() function is not able to split the fields in the CSV file because it uses the comma as default separator.

To read the CSV file correctly we have to pass the additional sep argument to the read_csv() function and set its value to the separator used in the CSV file.

df = pd.read_csv('test.csv', sep=';')
print(df)

Execute this code and confirm that the read_csv() Pandas function is able to split the fields in the CSV file correctly.

You should see the following output:

  username  age           city
0    user1   23         London
1    user2   45          Paris
2    user3   30       New York
3    user4   60  San Francisco
4    user5   53      Hong Kong
5    user6   34         Dublin
6    user7   46      Barcelona
7    user8   32           Rome

Conclusion

Now you know the basics on how to read a CSV file using the Pandas Python module.

You can read columns, filter rows based on a logical statement and read CSV files that use a separator different than the comma.

What’s the next step?

It’s important to fit the concepts you have learned in this tutorial into a bigger context that gives you a clear understanding on how to use them.

Go through Introduction to Data Science in Python, a great way to:

  • learn how to use Pandas in your Python programs.
  • practice using Pandas in a real Python environment.
  • receive continuous feedback to understand if you are improving as a developer.
  • introduce you to the Data Science learning path.