How To Read CSV Files Using Pandas: Step-By-Step

Do you have data in CSV format and would you like to know how to read CSV files in your Python application using Pandas? We will go through that in this guide.

The Python Pandas library provides the read_csv() function to read data from CSV files. This function stores data from a CSV file into a Pandas data type called DataFrame. You can use Python code to read columns and filter rows from the Pandas DataFrame.

Importing data is the first step of many Python applications, this is an important concept to understand. This Pandas tutorial will show you how to read CSV files using Pandas step-by-step.

Let’s get started!

Using Pandas to Read The Content of a CSV File with Header

I have created a CSV file called test.csv with the following content in the same directory of the Python program that reads the CSV file. In this file I have used the comma as a delimiter:

username,age,city
user1,23,London
user2,45,Paris
user3,30,New York
user4,60,San Francisco
user5,53,Hong Kong
user6,34,Dublin
user7,46,Barcelona
user8,32,Rome

On the first line of the file, you can see the header that contains a list of values to use as column names for the data in the CSV file.

The value of the first column (username) in the first record is user1, in the second record user2, and so on.

To read a CSV file using Pandas first import the Pandas module and then use the read_csv() function passing to it the name of the CSV file.

Note: if you see an error when importing the Pandas module check how to install the Pandas module.

import pandas as pd

df = pd.read_csv('test.csv')

Using the keyword “as” in the import statement allows you to refer to the Pandas module using the shorter variable “pd”.

But, what is df?

Let’s use the Python print statement and the type function to know more about the variable df:

print(type(df))

[output]
<class 'pandas.core.frame.DataFrame'>

You can see from the output that the data type of the variable df is a DataFrame.

The DataFrame is a data type that belongs to the Pandas library and allows to store and read tabular data. Tabular data is data that is organized in a table using rows and columns.

With this code we have used pandas.read_csv() to read a CSV file into dataframe df.

Let’s see what’s in the dataframe variable returned by the pandas.read_csv() function:

print(df)

[output]
  username  age           city
0    user1   23         London
1    user2   45          Paris
2    user3   30       New York
3    user4   60  San Francisco
4    user5   53      Hong Kong
5    user6   34         Dublin
6    user7   46      Barcelona
7    user8   32           Rome

Related course: Do you want to get started with Data Science in Python? Have a look at Introduction to Data Science in Python and get ready to work on Data Science projects.

How Do You Get the Columns From a CSV File with Pandas?

We have seen how to load the data from a CSV file into a Pandas DataFrame.

To print the name of the columns in a CSV file you can use the columns attribute of the DataFrame after loading CSV data into it.

import pandas as pd

df = pd.read_csv('test.csv')
print(df.columns)

[output]
Index(['username', 'age', 'city'], dtype='object')

How To Read Specific Columns From a CSV File Using Pandas

Now that we have stored the data from our CSV file into a dataframe the next step is to read this data.

How can we get the value from a column in the dataframe?

As you can see there are three columns in our dataframe: username, age, and city.

  username  age           city
0    user1   23         London
1    user2   45          Paris
2    user3   30       New York
3    user4   60  San Francisco
4    user5   53      Hong Kong
5    user6   34         Dublin
6    user7   46      Barcelona
7    user8   32           Rome

We can retrieve the values in a column by passing the column name (a Python string) within square brackets immediately after the dataframe variable:

print(df['username'])

[output]
0    user1
1    user2
2    user3
3    user4
4    user5
5    user6
6    user7
7    user8
Name: username, dtype: object

I wonder what data type contains the values in a column…

…let’s find out!

print(type(df['username']))

[output]
<class 'pandas.core.series.Series'>

It’s a Pandas Series.

I will cover Pandas Series in a different article to avoid making things too complex in this tutorial.

You can also get the values in a dataframe column by using the dot notation.

Specify the dataframe name followed by a dot, followed by the name of the column.

print(df.username)

[output]
0    user1
1    user2
2    user3
3    user4
4    user5
5    user6
6    user7
7    user8
Name: username, dtype: object

In a real Python application, you would assign a column to a variable, for example:

usernames = df.username

Then using indexes you would access values from the username Series:

print(usernames[0])
print(usernames[2])

[output]
user1
user3

You can see that we are using an index to access elements in the Series that contains column values in a similar way to what you would do to access elements from a Python list.

Note: if the column name contains spaces you can only use the notation with brackets and not the dot notation.

Try to use a column name with spaces with the dot notation and see what happens!

How To Read Specific Rows From a CSV File Using Pandas

To read specific rows of a dataframe that contains data from a CSV file you can use logical statements.

For example…

Let’s say you want to get the rows in the CSV file that contain users whose age is greater than 30.

To do that you can use the following expression:

print(df[df.age > 30])

[output]
  username  age           city
1    user2   45          Paris
3    user4   60  San Francisco
4    user5   53      Hong Kong
5    user6   34         Dublin
6    user7   46      Barcelona
7    user8   32           Rome

We have seen that…

The Pandas DataFrame name, followed by brackets, followed by a logical statement returns the rows in the DataFrame that match the condition in the logical statement.

How Do You Specify a Separator When Reading a CSV File Using Pandas?

Let’s see what happens with the code we have used so far if the CSV file uses a different separator than the comma (e.g. the semicolon).

username;age;city
user1;23;London
user2;45;Paris
user3;30;New York
user4;60;San Francisco
user5;53;Hong Kong
user6;34;Dublin
user7;46;Barcelona
user8;32;Rome

Execute the following code:

import pandas as pd

df = pd.read_csv('test.csv')
print(df)

Here is the output of our Python program…

        username;age;city
0         user1;23;London
1          user2;45;Paris
2       user3;30;New York
3  user4;60;San Francisco
4      user5;53;Hong Kong
5         user6;34;Dublin
6      user7;46;Barcelona
7           user8;32;Rome

It’s a bit messy! Why?!?

The Pandas read_csv() function is not able to split the fields in the CSV file because it uses the comma as the default separator.

To load data from the CSV file and parse it correctly, we have to pass the additional sep argument to the read_csv() function and set its value to the separator used in the CSV file.

df = pd.read_csv('test.csv', sep=';')
print(df)

Execute this code and confirm that the read_csv() Pandas function is able to split the fields in the CSV file correctly.

You should see the following output:

  username  age           city
0    user1   23         London
1    user2   45          Paris
2    user3   30       New York
3    user4   60  San Francisco
4    user5   53      Hong Kong
5    user6   34         Dublin
6    user7   46      Barcelona
7    user8   32           Rome

The default separator for the Pandas read_csv function is the comma. To use a different separator pass the additional argument “sep” to the read_csv function and set the value of sep to the new separator.

How to Read the First Rows of a CSV File Using Pandas

To get the first rows of a CSV file using the Python Pandas library you can read the CSV file into a DataFrame and then use the DataFrame head() function.

import pandas as pd

df = pd.read_csv('test.csv')
print(df.head())

[output]
  username  age           city
0    user1   23         London
1    user2   45          Paris
2    user3   30       New York
3    user4   60  San Francisco
4    user5   53      Hong Kong

By default, the head() function returns the first 5 rows of a dataframe.

If you want to get back a different number of rows you can pass the number of rows as an argument to the head() function.

print(df.head(3))

[output]
  username  age      city
0    user1   23    London
1    user2   45     Paris
2    user3   30  New York

You can see that the code above has returned the first 3 rows because we have passed the number 3 as argument to the head() function.

How to Read Only Specific Columns From a CSV File Using Pandas

You might only need some of the columns from the CSV files. In that case, importing only those columns into the DataFrame would help you save memory.

To import only specific columns from a CSV file you can pass the additional argument usecols to the read_csv function. This argument will contain the list of columns to read.

Let’s assume we only want to read the username and age columns from our existing CSV file. We can use the following Python code:

import pandas as pd

columns = ['username', 'age']
df = pd.read_csv('test.csv', usecols=columns)
print(df)

You can see the column city has not been imported into the DataFrame.

  username  age
0    user1   23
1    user2   45
2    user3   30
3    user4   60
4    user5   53
5    user6   34
6    user7   46
7    user8   32

We have seen how selecting columns works when reading data from a CSV file.

How to Get Statistics For a Column in a CSV File Using Pandas

It can be very useful to calculate statistics like minimum, maximum, mean, and standard deviation for a numeric column in a CSV file. We can use the DataFrame describe function for that.

For example, let’s calculate statistics for the data in the “age” column of our CSV file.

import pandas as pd

df = pd.read_csv('test.csv')
print(df['age'].describe())

The output contains:

  • number of rows
  • mean
  • standard deviation
  • minimum of the values
  • lower percentile (25%)
  • median (50% percentile)
  • upper percentile (75%)
  • maximum of the values
count     8.000000
mean     40.375000
std      12.637105
min      23.000000
25%      31.500000
50%      39.500000
75%      47.750000
max      60.000000
Name: age, dtype: float64

Quite handy to have these values calculated with a single function call.

What is Chunksize When Reading a CSV File with Pandas?

If you want to read a big CSV file with Pandas and you have issues with the memory available on your computer, you can read the CSV file in chunks.

To read a CSV file in multiple chunks using Pandas, you can pass the chunksize argument to the read_csv function and loop through the data returned by the function.

Below you can see the code to read our test CSV file using a chunksize of 4.

import pandas as pd

for chunk in pd.read_csv('test.csv', chunksize=4):
    print("######## Chunk ########")
    print(chunk)
    print(type(chunk))

Considering that in our CSV file there are 8 records and the chunksize is 4, we read the CSV file in two chunks.

You can also see from the output that each chunk is a Pandas dataframe.

######## Chunk ########
  username  age           city
0    user1   23         London
1    user2   45          Paris
2    user3   30       New York
3    user4   60  San Francisco
<class 'pandas.core.frame.DataFrame'>
######## Chunk ########
  username  age       city
4    user5   53  Hong Kong
5    user6   34     Dublin
6    user7   46  Barcelona
7    user8   32       Rome
<class 'pandas.core.frame.DataFrame'>

Conclusion

Now you know the basics of how to work with data from a CSV file using the Pandas Python library.

You can read columns, filter rows based on logical statements, and read CSV files that use a separator different than the comma.

What’s the next step?

It’s important to fit the concepts you have learned in this tutorial into a wider context that gives you a clear understanding of how to use them.

To do that, go through Introduction to Data Science in Python, a great way to:

  • learn how to use Pandas in your Python programs.
  • practice using Pandas in a real Python environment.
  • receive continuous feedback to understand if you are improving as a developer.
  • introduce you to the Data Science learning path.

Leave a Comment