search
Search
Join our weekly DS/ML newsletter layers DS/ML Guides
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
brightness_medium
share
arrow_backShare
Twitter
Facebook

Pandas | read_csv method

Pandas
chevron_right
Documentation
chevron_right
General Functions
schedule Jul 1, 2022
Last updated
local_offer PandasPython
Tags

Pandas' read_csv(~) method reads a file, and parses its content into a DataFrame.

This mammoth has more than 40 parameters, but only one is required.

Parameters

1. filepath_or_bufferlink | string or path object or file-like object

The path to the file you wish to read.

2. sep | string | optional

The delimiter that separates the data. If set to None, and you're using the Python parsing engine (see engine parameter below), then the method will attempt to infer the delimiter from the file. By default, sep=",".

3. headerlink | int or list<int> | optional

The line numbers that represent the header. If a list is passed, then the resulting DataFrame will be multi-index. By default, head=0, which means that the first line will be treated as the column labels. Note that header does not take into account comments and blank lines (when skip_blank_lines=True), so head=0 may not necessarily be the first line of the file.

4. nameslink | array-like<string> | optional

The column labels to use. Duplicate labels will cause an error.

NOTE

Suppose your file contains a header row (i.e. column labels), but you wish to use some other column labels instead. You can do this by explicitly setting header=0, and then specifying the names parameter.

5. index_collink | int or string or sequence<int> | optional

The integer index or the label of the columns in the file to use as the row labels of the resulting DataFrame. By default, index_col=None.

6. usecolslink | array-like<int> or array-like<string> or function | optional

The integer index or the label of the columns to include in the resulting DataFrame. If a function is provided, then the function takes in as argument the label of a column, and returns a boolean indicating whether or not to include the column.

By default, all columns are included.

7. squeeze | boolean | optional

Whether or not to return a Series if the resulting DataFrame only contains a single column. By default, squeeze=False.

8. prefix | string | optional

The prefix to append to the column names if header is not provided. For instance, prefix="A" would mean that the column labels may be "A0", "A1" and so on. By default, no prefix will be added.

9. mangle_dupe_colslink | boolean | optional

Whether or not to append a numerical suffix to columns with duplicate labels. By default, mangle_dupe_cols=True, which means that instead of overwrites when column labels clash, a suffix will be appended to the column labels. For instance, if there are 2 column labels "A" and "A", then the second one would be called "A.1".

WARNING

Apparently, mangle_dupe_cols=False is not yet supported, though this isn't officially documented.

10. dtypelink | string or type or dict<string, string||type> | optional

The data-type to use for the columns. If a dict is provided, then the key would be the column label and the value would be its desired type.

11. engine | string | optional

Whether to use the C or Python parsing engine. The allowed values are "c" or "python".

The C parsing engine is faster, but has less features compared to the Python counterpart. For instance, the Python parsing engine will be able to infer the delimiter from the file automatically while the C counterpart cannot.

12. converterslink | dict<string, function> | optional

The functions that will be applied to columns. The key is the label of the column on which to apply the function on. By default, no function is applied.

13. true_valueslink | list | optional

A list of values that will be evaluated as True in the resulting DataFrame.

14. false_valueslink | list | optional

A list of values that will be evaluated as False in the resulting DataFrame.

15. skipinitialspace | boolean | optional

The official document claims that this parameter specifies whether or not to skip empty spaces after the delimiter. However, we've extensively tested this parameter, and found out that all leading and trailing spaces of values including column and row labels are stripped out regardless of what you set here. If you have some insight, please reach out to have a chat with us! By default, skipinitialspace=False.

16. skiprowslink | list-like or int or function | optional

The line numbers (in integer indices) to skip over. For instance, skiprows=1 will skip the second row in the file. If a function is provided, then the function takes in as argument the row label, and returns a boolean indicating whether or not to skip the row.

17. skipfooter | int | optional

The number of lines from the bottom to ignore. This is only implemented for when engine="python". By default, skipfooter=0.

18. nrows | int | optional

The number of rows to include. This does not include the row for the column labels. By default, all rows are included.

19. na_value | scalar, string, list-like or dict | optional

The additional values that will be interpreted as NaN. If a dict is provided, then the key is the label of the columns to consider.

By default, all the following values are interpreted as na:

'', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN',
'-NaN', '-nan', '1.#IND', '1.#QNAN', '', 'N/A',
'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'

20. keep_default_na | boolean | optional

Whether or not to keep the default na values. The behaviour depends on whether you've specified the na_value parameter:

na_value is specified

na_value not specified

keep_deafault_na=True

The default na values will be used.

Your specified na_values will be used in addition to the default na values.

keep_default_na=False

No data will be treated as na values.

Your specified na_values will replace the default na values.

By default, keep_default_na=True.

21. na_filterlink | boolean | optional

Whether or not to check for na values. If set to False, then na_value and keep_default_na will be ignored, and so values like "NaN" will be treated as a string - "NaN" and not as a NaN. If you are sure that your file does not contain any na values, then set to False as this will boost performance. By default, na_filter=True.

22. verbose | boolean | optional

Whether or not to print on screen the following metrics on screen:

Tokenization took: ~ ms
Type conversion took: ~ ms
Parser memory cleanup took: ~ ms

By default, verbose=False.

23. skip_blank_lines | boolean | optional

Whether or not to skip over blank lines (e.g. lines with only tabs and whitespaces) instead of treating them as na values. By default, skip_blank_lines=True.

24. parse_dateslink | boolean | optional

The rules to parse dates. The allowed values are as follows:

Value

Description

True

Parse the index as a date. All other columns will be interpreted as a non-date type.

False

Do not parse anything as dates.

list<int or string>

The integer index or label of the columns to parse as a date.

For instance, passing ["A","C"] will separately parse columns "A" and "C" as dates.

list<list<int or string>>

The columns to combine to parse as a single date.

For instance, passing [["A","C"]] means that the combination of columns A and C be parsed as a single date.

dict

The column label (key) and the columns to combine to be parsed as a date. For instance, passing {"A":[1,5]} will parse the combined columns at index 1 and 5 as dates to form column "A" in the resulting DataFrame.

By default, parse_dates=False.

If a value cannot be properly parsed as a date, then that value's column will have type object instead.

25. infer_datetime_format | boolean | optional

If set to True and parse_dates!=False, then the format of the dates will be inferred. The inferred format is heavily optimised, and may potentially lead to 5-10x speedups. By default, infer_datetime_format=False.

26. keep_date_col | boolean | optional

Whether to keep the original dates in the resulting DataFrame. This parameter is only relevant if the specified parse_dates involves combining multiple columns. By default, keep_date_col=False.

27. date_parserlink | function | optional

The function used to convert strings to dates. This function can be called up to 3 times with the following distinct arguments:

(i). one or more arrays of date strings that correspond to the columns you combined as per parse_dates.

(ii). a single array containing the concatenated date strings, depending on your choice of parse_dates.

(iii). one or more date strings that correspond to the columns you combined as per parse_dates. This is called for each row.

Whenever an exception occurs, the subsequent function will be called. For instance, if (i) fails, then (ii) will be called next. By default, dateutil.parser.parser is used.

28. dayfirst | bool | optional

Whether or not to place the day first (e.g. DD/MM). By default, dayfirst=False.

29. cache_dates | boolean | optional

Whether to keep a cache of date conversions performed. When you try to convert duplicates dates, then instead of performing the same conversion multiple times, you can use the cache to save some computation. By default, cache_dates=True.

30. iteratorlink | boolean | optional

Whether or not to return an Iterator that can be used to read the data in chunks. By default, iterator=False.

31. chunksizelink | int | optional

The number of rows to read at a time. If specified, then a TextFileReader object, which is used to iterate over the rows, will be returned. By default, all rows are read as one giant chunk.

32. compression | string | optional

The compressed format of the file. The allowed values are:

"gzip", "bz2", "zip" and "xz"

By default, compression="infer".

33. thousands | string | optional

The character to denote every thousandth digit. For instance, passing "," would make values like 10000 be 10,000. By default, thousands=None.

34. decimal | string | optional

The character denoting a decimal point. By default, decimal=".". This parameter exists since some European countries like France use a , to denote a decimal point instead.

35. lineterminator | char | optional

The character used to indicate a line break. This is only supported for when engine="c".

36. quotechar | string of length one | optional

When a value happens to contain a delimiter, then the value will erroneously be divided up. You can wrap your value using quotechar to prevent such unwanted splits from happening. For instance, if sep=",", and we set quotechar="@", the value @alex,bob@ will be treated as alex,bob as a whole, rather than two separate values of alex and bob. Defaults to ".

37. quoting | int or csv.QUOTE_* | optional

The quoting rule to use - allowed values are follows:

  • QUOTE_MINIMAL,

  • QUOTE_ALL,

  • QUOTE_NONNUMERIC,

  • QUOTE_NONE

By default, quoting=csv.QUOTE_MINIMAL.

38. doublequote | boolean | optional

Whether or not to parse quotechar. By default, doublequote=True.

39. escapecharlink | string | optional

The character to escape the double quotation marks. By default, escapechar=None.

40. commentlink | string | optional

If your input file contains comments, then you can specify what identifies a comment. For instance, setting to "#" will mean that the characters after the # in the same line will be treated as a comment. The passed value must be a single character.

By default, comments=None.

41. encoding | string | optional

The encoding to use when reading the file (e.g. "latin-1", "iso-8859-1"). By default, encoding="utf-8".

42. error_bad_lines | boolean | optional

Whether or not to throw an error when a row cannot be parsed properly. For instance, if your row contains 6 values when you only have 4 columns, an error will be thrown. If set to True, then an error will be raised and no DataFrame will be returned. If set to False, then rows that cannot be parsed will be ignored. By default, error_bad_lines=True.

43. warn_bad_lines | boolean | optional

Whether or not to output warning when a row cannot be parsed properly. By default, warn_bad_lines=True.

44. delim_whitespace | boolean | optional

Whether or not to set whitespaces as the delimiter. If this is set, then the sep parameter should not be specified. By default, delim_whitespace=False.

45. low_memory | boolean | optional

Whether or not to internally process the file in chunks. If set to True, then the resulting type of the columns may become less accurate. To ensure that this does not happen, explicitly specify the data type using the dtype parameter. By default, low_memory=False.

NOTE

The resulting DataFrame would be the same regardless of the value set for low_memory. The file is only parsed in chunks internally so as to save memory during the reading process. As stated, the data type may end up different in some minor cases.

46. memory_map | boolean | optional

If True, then the file to be read will first be copied to memory and read from there. This can result in significant performance boost since there will be no I/O overhead. By default, memory_map=False.

47. float_precision | string | optional

The converter to use for floats. The available options are as follows:

  • None for the standard converter

  • "high" for the high-precision converter

  • "round_trip" for the round-trip converter.

This is only relevant for when engine="C". By default, float_precision=None.

Return Value

If chunksize is set, then a TextFileReader object is returned. Otherwise, a DataFrame with the imported data is returned.

Examples

Basic usage

Suppose we have the following text file called sample.txt, which contains some comma-separated data:

A,B,C
3,4,5
6,7,8

To read this file as a DataFrame:

df = pd.read_csv("sample.txt")
df
A B C
0 3 4 5
1 6 7 8

Make sure that the text file is in the same directory as the Python script.

Specifying header

Suppose our sample.txt file is as follows:


# I am a comment
A,B,C
3,4,5
6,7,8

Here, the first line of the file is empty, and the second line is a comment.

By default, header=0, which means that the first valid line (non-comment and non-empty line) will be regarded as the column labels:

df = pd.read_csv("sample.txt", comment="#") # header=0
df
A B C
0 3 4 5
1 6 7 8

A common mistake is to interpret header=0 as meaning the first line of the file. This is not necessarily true, as demonstrated in the example, since header ignores lines that are:

  • blank when skip_blank_lines=True (default)

  • comments (as indicated by the comment parameter).

Specifying the column labels

Suppose our sample.txt file is as follows:

3,4,5
6,7,8

Notice how the file does not contain any header.

To provide the column labels, set the names parameter like so:

df = pd.read_csv("sample.txt", names=["A","B","C"])
df
A B C
0 3 4 5
1 6 7 8

Overriding the header

Suppose our sample.txt file is as follows:

A,B,C
3,4,5
6,7,8

Notice how the file contains a header line.

To override the header line, set header=0 and names:

df = pd.read_csv("sample.txt", header=0, names=["D","E","F"])
df
D E F
0 3 4 5
1 6 7 8

Specifying index_col

Suppose our sample.txt is as follows:

A,B,C
a,3,4,5
b,6,7,8

Since there are 4 values (instead of 3) from the second row onwards, read_csv(~) will automatically treat the first column as the index:

df = pd.read_csv("sample.txt")
df
A B C
a 3 4 5
b 6 7 8

Now, consider a file that starts with a , like so:

,A,B,C
a,3,4,5
b,6,7,8

Since each line contains the same number of ,, the method will not treat the first column as part of the values. To ensure that the first column is treated as the index, pass index_col=0 like so:

df = pd.read_csv("sample.txt", index_col=0)
df
A B C
a 3 4 5
b 6 7 8

Specifying usecols

By default, all columns are read and parsed as a DataFrame. You can select which columns to include by specifying the usecols parameter.

Consider the following sample.txt file:

A,B,C
3,4,5
6,7,8

To read columns A and C only:

df = pd.read_csv("sample.txt", usecols=["A","C"])
df
A C
0 3 5
1 6 8

You could also pass in a function for usecols, which takes in as argument the label of a column, and returns a boolean indicating whether or not to include the column.

For instance, to only include column with label "B":

df = pd.read_csv("sample.txt", usecols=lambda x: x == "B")
df
B
0 4
1 7

For those who are not familiar with lambdas in Python, this is equivalent to the following:

def foo(col_label): # the name of the function is irrelevant.
return col_label == "B"

df = pd.read_csv("sample.txt", usecols=foo)
df
B
0 4
1 7

Specifying mangle_dupe_cols

When column labels clash, the latter column will not override the former column - instead the default behaviour of mangle_dupe_cols=True is to append a suffix to the latter column.

Consider the following sample.txt file:

A,B,A
3,4,5
6,7,8

Reading this file with the default parameter produces:

df = pd.read_csv("sample.txt")
df
A B A.1
0 3 4 5
1 6 7 8

Although not documented officially, mangle_dupe_cols=False, which is supposed to allow for duplicate labels, is not yet supported.

Specifying dtype

Consider the following sample.txt file:

A,B
3,4
5,6

By default, the column types will be inferred from the data:

df = pd.read_csv("sample.txt")
df.dtypes
A int64
B int64
dtype: object

To set the type of all columns:

df = pd.read_csv("sample.txt", dtype=float)
df.dtypes
A float64
B float64
dtype: object

To set the type separately for the columns:

df = pd.read_csv("sample.txt", dtype={"A":float, "B":str})
df.dtypes
A float64
B object
dtype: object

Specifying converters

The converters parameter is extremely useful as it allows us to apply a mapping on the column values.

Consider the following sample.txt file:

A,B,C
3,4,5
6,7,8

We can transform the values for specific columns like so:

df = pd.read_csv("sample.txt", converters={"A": lambda x : int(x)+10, "C": lambda x : int(x) > 6})
df
A B C
0 13 4 False
1 16 7 True

The argument for the functions is a column value of type string, which means that we must first convert it to numerical type to perform arithmetics.

Specifying true_values and false_values

Consider the following sample.txt file:

A,B,C
a,b,c
d,e,f

To map value "a" to True and "d" to False:

df = pd.read_csv("sample.txt", true_values=["a"], false_values=["d"])
df
A B C
0 True b c
1 False e f

Note that the mapping only works if you map the entire column to booleans like we did above. For instance, the following won't work as intended:

df = pd.read_csv("sample.txt", true_values=["a"], false_values=["f"])
df
A B C
0 a b c
1 d e f

Specifying skiprows

The skiprows parameter is the line numbers (in integer indices) to skip over. It takes in a single integer, a sequence of integers or a function. Since the former two are self-explanatory, we will demonstrate how to pass a function. The function takes in as argument the row label, and outputs a boolean indicating whether or not to skip the row.

Consider the following sample.txt file:

A,B,C
3,4,5
6,7,8

To skip the second row (row with integer index 1):

df = pd.read_csv("sample.txt", skiprows= lambda x : x == 1)
df
A B C
0 6 7 8

Specifying na_filter

Consider the following sample.txt file:

A,B,C
3,4,nan
6,NaN,8

Here, the question is whether or not to treat the nan and NaN as strings or actual missing values (i.e. nan). By default, na_filter=True, which means that they will be treated as nan:

df = pd.read_csv("sample.txt") # na_filter=True
df
A B C
0 3 4.0 NaN
1 6 NaN 8.0

It should be clear that we now have actual nan values, but we can confirm this using the df.isnull() method which checks for nan values:

df.isnull()
A B C
0 False False True
1 False True False

To treat the nan and NaN as strings instead, set na_filter=False like so:

df = pd.read_csv("sample.txt", na_filter=False)
df
A B C
0 3 4 nan
1 6 NaN 8

We then call df.isnull() again to check for actual missing values (nan):

df.isnull()
A B C
0 False False False
1 False False False

We see that the values are all False, meaning they are interpreted as a string instead.

Specifying dates

Parsing index as date

Considering the following sample.txt file:

name,gender
1995,alice,female
2005/12,bob,male

Here, the index (row labels) is to be parsed as date. To do so, set parse_date=True:

df = pd.read_csv("sample.txt", parse_dates=True)
df
name gender
1995-01-01 alice female
2005-12-01 bob male

The default date format is YYYY-MM-DD, and unspecified days and months will be filled with 01 as shown in the output.

Parsing non-index columns as date

Consider the following sample.txt file:

name,admitted,graduated
alice,2014,2018
bob,2016,2020

Setting parse_date=True, as we did before, will mean that only the index will be parsed as date. This is undesirable for this file since the dates to parse are non-index columns.

To parse columns admitted and graduated as dates:

df = pd.read_csv("sample.txt", parse_dates=["admitted","graduated"])
df
name admitted graduated
0 alice 2014-01-01 2018-01-01
1 bob 2016-01-01 2020-01-01

To confirm the data type of our columns:

df.dtypes
name object
admitted datetime64[ns]
graduated datetime64[ns]
dtype: object

Combining multiple date columns

Consider the following sample.txt file:

Year,Month
2020,12
1994,02

Here, we have 2 separate columns that we want to parse as a single date column. We can do this by passing in a nested list, like so:

df = pd.read_csv("sample.txt", parse_dates=[["Year","Month"]])
df
Year_Month
0 2020-12-01
1 1994-02-01

We can also combine dates by passing in a dict like so:

df = pd.read_csv("sample.txt", parse_dates={"A":["Year","Month"]})
df
A
0 2020-12-01
1 1994-02-01

The advantage of using dict instead of a nested list is that you can supply a label for the combined column (A in this case).

Specifying date_parser

Consider the following sample.txt file:

Year,Month
2020,12
1994,02

The date_parser parameter allows you specify how date strings should be parsed.

For instance, to apply an offset of one year:

import dateutil
from dateutil.relativedelta import relativedelta

def my_date_parser(*arg):
date_time = datetime.strptime(arg[0], "%Y %m")
return date_time + relativedelta(years=1)

df = pd.read_csv("sample.txt", parse_dates={"A":["Year","Month"]}, date_parser=my_date_parser)
df
A
0 2021-12-01
1 1995-02-01

Since the date_parser function can be called a maximum of 3 times with varying arguments, we use the syntax for varargs, *args.

Just to break down how the date_parser works, suppose we printed out what arg is:

def my_date_parser(*arg):
print(arg)
date_time = datetime.strptime(arg[0], "%Y %m")
return date_time + relativedelta(years=1)
(array(['2020', '1994'], dtype=object), array(['12', '02'], dtype=object))
('2020 12',)
('1994 02',)
  • The first time my_date_parser is called, 2 Series, each containing the date strings to combine, are returned. Since the datetime.striptime function takes in a string and not a Series, this function actually throws an error.

  • However, instead of terminating the program, the my_date_parser is then called again - this time, the argument passed is a single date string ('2020 12'). This can be parsed properly by striptime so this time the parsing is successful.

Specifying iterator

Consider the following sample.txt file:

A,B
1,2
3,4
5,6
7,8
9,10

By setting iterator=True, the method returns a TextFileReader. This is useful when you want to read the file in chunks like so:

reader = pd.read_csv("sample.txt", iterator=True)
print(reader.get_chunk(2))
print("-----")
print(reader.get_chunk(3))
A B
0 1 2
1 3 4
-----
A B
2 5 6
3 7 8
4 9 10

Here get_chunk(n) returns a DataFrame with n rows. Unless you need to read chunks of different sizes, opt to use chunksize instead.

Specifying chunksize

Consider the following sample.txt file:

A,B
1,2
3,4
5,6
7,8
9,10

To read this file in chunks, set chunksize like so:

for chunk in pd.read_csv("sample.txt", chunksize=2):
print(chunk)
print("-----")
A B
0 1 2
1 3 4
-----
A B
2 5 6
3 7 8
-----
A B
4 9 10
-----

Here, every chunk is of type DataFrame.

NOTE

When dealing with large data that cannot fit in memory, consider using iterator or chunksize.

Specifying thousands

Consider the following sample.txt file:

A B
3,000 6,000,000

Here, our data uses , to indicate every thousand.

By default, these values will be treated as a string:

df = pd.read_csv("sample.txt", sep=" ")
df
A B
0 3,000 6,000,000

To treat the values as a numeric, set thousands like so:

df = pd.read_csv("sample.txt", sep=" ", thousands=",")
df
A B
0 3000 6000000

Specifying quotechar

The quotechar comes in handy when values confusingly contain the delimiter.

Consider the following sample.txt file:

A,B
3,4,5
6,7

Here, notice how the second row contains 3 values, while the others contain only 2 values. Calling read_csv(~) will cause an error in such cases.

If you want to treat 3,4 as a single value as opposed to two separate values, you can wrap them in a quote. By default quotechar=", which means that we can negate delimiters like so:

A,B
"3,4",5
6,7

Now, calling read_csv(~) yields:

df = pd.read_csv("sample.txt")
df
A B
0 3,4 5
1 6 7

Notice how 3,4 is now interpreted as a single entry.

Specifying escapechar

Consider the following sample.txt:

A,B
"3\"4",5
6,7

Here, the inner " is escaped using the escape character "\". This indicates that we want 3"4 to be read as the value for that entry, but calling read_csv(~) erroneously yields the following:

df = pd.read_csv("sample.txt")
df
A B
0 3\4" 5
1 6 7

This happens because Pandas does not recognise "\" as an escape character, and so misinterprets the inner " as a quotechar.

To get around this problem, set escapechar to escape the inner ":

df = pd.read_csv("sample.txt", escapechar="\\")
df
A B
0 3"4 5
1 6 7

Here, we need "\\" because the character \ in Python is an escape character and so just having "\" ends up escaping the closing ", causing a syntax error.

Specifying comment

The comment parameter is used to identify comments - everything after the passed character in the same line will be treated as a comment and thus ignored.

For instance, consider the following sample.txt file:

A,B,C
3,4,5 # I am a comment
6,7,8 # I am a comment

To ignore these comments:

df = pd.read_csv("sample.txt", comment="#")
df
A B C
0 3 4 5
1 6 7 8
mail
Join our newsletter for updates on new DS/ML comprehensive guides (spam-free)
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
Ask a question or leave a feedback...