Disclosure: Some of the links and banners on this page may be affiliate links, which can provide compensation to CodeFatherTech (https://codefather.tech) at no extra cost to you. CodeFatherTech is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn fees by linking to Amazon.com and affiliated sites. CodeFatherTech also participates in affiliate programs from DataCamp, Coursera, and other sites. Our affiliate disclaimer is available here.
Do you have data in CSV format and would you like to know how to read CSV files in your Python application using Pandas? We will go through that in this guide.
The Python Pandas library provides the read_csv() function to read data from CSV files. This function stores data from a CSV file into a Pandas data type called DataFrame. You can use Python to read columns and filter rows from the Pandas DataFrame.
Importing data is the first step of many Python applications, this is an important concept to understand. This Pandas tutorial will show you how to read CSV files using Pandas step-by-step.
Let’s get started!
Using Pandas to Read The Content of a CSV File with Header
I have created a CSV file called test.csv with the following content in the same directory of the Python program that reads the CSV file. In this file I have used the comma as a delimiter:
username,age,city
user1,23,London
user2,45,Paris
user3,30,New York
user4,60,San Francisco
user5,53,Hong Kong
user6,34,Dublin
user7,46,Barcelona
user8,32,Rome
On the first line of the file, you can see the header that contains a list of values to use as column names for the data in the CSV file.
The value of the first column (username) in the first record is user1, in the second record user2, and so on.
To read a CSV file using Pandas first import the Pandas module and then use the read_csv() function passing to it the name of the CSV file.
Note: if you see an error when importing the Pandas module check how to install the Pandas module.
import pandas as pd
df = pd.read_csv('test.csv')
Using the keyword “as” in the import statement allows you to refer to the Pandas module using the shorter variable “pd”.
But, what is df?
Let’s use the Python print statement and the type function to know more about the variable df:
print(type(df))
[output]
<class 'pandas.core.frame.DataFrame'>
You can see from the output that the data type of the variable df is a DataFrame.
The DataFrame is a data type that belongs to the Pandas library and allows to store and read tabular data. Tabular data is data that is organized in a table using rows and columns.
With this code we have used pandas.read_csv() to read a CSV file into dataframe df.
Let’s see what’s in the dataframe variable returned by the pandas.read_csv() function:
print(df)
[output]
username age city
0 user1 23 London
1 user2 45 Paris
2 user3 30 New York
3 user4 60 San Francisco
4 user5 53 Hong Kong
5 user6 34 Dublin
6 user7 46 Barcelona
7 user8 32 Rome
Related course: Do you want to get started with Data Science in Python? Have a look at Introduction to Data Science in Python and get ready to work on Data Science projects.
How Do You Get the Columns From a CSV File with Pandas?
We have seen how to load the data from a CSV file into a Pandas DataFrame.
To print the name of the columns in a CSV file you can use the columns attribute of the DataFrame after loading CSV data into it.
import pandas as pd
df = pd.read_csv('test.csv')
print(df.columns)
[output]
Index(['username', 'age', 'city'], dtype='object')
How To Read Specific Columns From a CSV File Using Pandas
Now that we have stored the data from our CSV file into a dataframe the next step is to read this data.
How can we get the value from a column in the dataframe?
As you can see there are three columns in our dataframe: username, age, and city.
username age city
0 user1 23 London
1 user2 45 Paris
2 user3 30 New York
3 user4 60 San Francisco
4 user5 53 Hong Kong
5 user6 34 Dublin
6 user7 46 Barcelona
7 user8 32 Rome
We can retrieve the values in a column by passing the column name (a Python string) within square brackets immediately after the dataframe variable:
print(df['username'])
[output]
0 user1
1 user2
2 user3
3 user4
4 user5
5 user6
6 user7
7 user8
Name: username, dtype: object
I wonder what data type contains the values in a column…
…let’s find out!
print(type(df['username']))
[output]
<class 'pandas.core.series.Series'>
It’s a Pandas Series.
I will cover Pandas Series in a different article to avoid making things too complex in this tutorial.
You can also get the values in a dataframe column by using the dot notation.
Specify the dataframe name followed by a dot, followed by the name of the column.
print(df.username)
[output]
0 user1
1 user2
2 user3
3 user4
4 user5
5 user6
6 user7
7 user8
Name: username, dtype: object
In a real Python application, you would assign a column to a variable, for example:
usernames = df.username
Then using indexes you would access values from the username Series:
print(usernames[0])
print(usernames[2])
[output]
user1
user3
You can see that we are using an index to access elements in the Series that contains column values in a similar way to what you would do to access elements from a Python list.
Note: if the column name contains spaces you can only use the notation with brackets and not the dot notation.
Try to use a column name with spaces with the dot notation and see what happens!
How To Read Specific Rows From a CSV File Using Pandas
To read specific rows of a dataframe that contains data from a CSV file you can use logical statements.
For example…
Let’s say you want to get the rows in the CSV file that contain users whose age is greater than 30.
To do that you can use the following expression:
print(df[df.age > 30])
[output]
username age city
1 user2 45 Paris
3 user4 60 San Francisco
4 user5 53 Hong Kong
5 user6 34 Dublin
6 user7 46 Barcelona
7 user8 32 Rome
We have seen that…
The Pandas DataFrame name, followed by brackets, followed by a logical statement returns the rows in the DataFrame that match the condition in the logical statement.
How Do You Specify a Separator When Reading a CSV File Using Pandas?
Let’s see what happens with the code we have used so far if the CSV file uses a different separator than the comma (e.g. the semicolon).
username;age;city
user1;23;London
user2;45;Paris
user3;30;New York
user4;60;San Francisco
user5;53;Hong Kong
user6;34;Dublin
user7;46;Barcelona
user8;32;Rome
Execute the following code:
import pandas as pd
df = pd.read_csv('test.csv')
print(df)
Here is the output of our Python program…
username;age;city
0 user1;23;London
1 user2;45;Paris
2 user3;30;New York
3 user4;60;San Francisco
4 user5;53;Hong Kong
5 user6;34;Dublin
6 user7;46;Barcelona
7 user8;32;Rome
It’s a bit messy! Why?!?
The Pandas read_csv() function is not able to split the fields in the CSV file because it uses the comma as the default separator.
To load data from the CSV file and parse it correctly, we have to pass the additional sep argument to the read_csv() function and set its value to the separator used in the CSV file.
df = pd.read_csv('test.csv', sep=';')
print(df)
Execute this code and confirm that the read_csv() Pandas function is able to split the fields in the CSV file correctly.
You should see the following output:
username age city
0 user1 23 London
1 user2 45 Paris
2 user3 30 New York
3 user4 60 San Francisco
4 user5 53 Hong Kong
5 user6 34 Dublin
6 user7 46 Barcelona
7 user8 32 Rome
The default separator for the Pandas read_csv function is the comma. To use a different separator pass the additional argument “sep” to the read_csv function and set the value of sep to the new separator.
How to Read the First Rows of a CSV File Using Pandas
To get the first rows of a CSV file using the Python Pandas library you can read the CSV file into a DataFrame and then use the DataFrame head() function.
import pandas as pd
df = pd.read_csv('test.csv')
print(df.head())
[output]
username age city
0 user1 23 London
1 user2 45 Paris
2 user3 30 New York
3 user4 60 San Francisco
4 user5 53 Hong Kong
By default, the head() function returns the first 5 rows of a dataframe.
If you want to get back a different number of rows you can pass the number of rows as an argument to the head() function.
print(df.head(3))
[output]
username age city
0 user1 23 London
1 user2 45 Paris
2 user3 30 New York
You can see that the code above has returned the first 3 rows because we have passed the number 3 as argument to the head() function.
How to Read Only Specific Columns From a CSV File Using Pandas
You might only need some of the columns from the CSV files. In that case, importing only those columns into the DataFrame would help you save memory.
To import only specific columns from a CSV file you can pass the additional argument usecols to the read_csv function. This argument will contain the list of columns to read.
Let’s assume we only want to read the username and age columns from our existing CSV file. We can use the following Python code:
import pandas as pd
columns = ['username', 'age']
df = pd.read_csv('test.csv', usecols=columns)
print(df)
You can see the column city has not been imported into the DataFrame.
username age
0 user1 23
1 user2 45
2 user3 30
3 user4 60
4 user5 53
5 user6 34
6 user7 46
7 user8 32
We have seen how selecting columns works when reading data from a CSV file.
How to Get Statistics For a Column in a CSV File Using Pandas
It can be very useful to calculate statistics like minimum, maximum, mean, and standard deviation for a numeric column in a CSV file. We can use the DataFrame describe function for that.
For example, let’s calculate statistics for the data in the “age” column of our CSV file.
import pandas as pd
df = pd.read_csv('test.csv')
print(df['age'].describe())
The output contains:
- number of rows
- mean
- standard deviation
- minimum of the values
- lower percentile (25%)
- median (50% percentile)
- upper percentile (75%)
- maximum of the values
count 8.000000
mean 40.375000
std 12.637105
min 23.000000
25% 31.500000
50% 39.500000
75% 47.750000
max 60.000000
Name: age, dtype: float64
Quite handy to have these values calculated with a single function call.
What is Chunksize When Reading a CSV File with Pandas?
If you want to read a big CSV file with Pandas and you have issues with the memory available on your computer, you can read the CSV file in chunks.
To read a CSV file in multiple chunks using Pandas, you can pass the chunksize argument to the read_csv function and loop through the data returned by the function.
Below you can see the code to read our test CSV file using a chunksize of 4.
import pandas as pd
for chunk in pd.read_csv('test.csv', chunksize=4):
print("######## Chunk ########")
print(chunk)
print(type(chunk))
Considering that in our CSV file there are 8 records and the chunksize is 4, we read the CSV file in two chunks.
You can also see from the output that each chunk is a Pandas dataframe.
######## Chunk ########
username age city
0 user1 23 London
1 user2 45 Paris
2 user3 30 New York
3 user4 60 San Francisco
<class 'pandas.core.frame.DataFrame'>
######## Chunk ########
username age city
4 user5 53 Hong Kong
5 user6 34 Dublin
6 user7 46 Barcelona
7 user8 32 Rome
<class 'pandas.core.frame.DataFrame'>
Conclusion
Now you know the basics of how to work with data from a CSV file using the Pandas Python library.
You can read columns, filter rows based on logical statements, and read CSV files that use a separator different than the comma.
What’s the next step?
It’s important to fit the concepts you have learned in this tutorial into a wider context that gives you a clear understanding of how to use them.
To do that, go through Introduction to Data Science in Python, a great way to:
- learn how to use Pandas in your Python programs.
- practice using Pandas in a real Python environment.
- receive continuous feedback to understand if you are improving as a developer.
- introduce you to the Data Science learning path.
Claudio Sabato is an IT expert with over 15 years of professional experience in Python programming, Linux Systems Administration, Bash programming, and IT Systems Design. He is a professional certified by the Linux Professional Institute.
With a Master’s degree in Computer Science, he has a strong foundation in Software Engineering and a passion for robotics with Raspberry Pi.