Pandas | read_csv method
Start your free 7-days trial now!
Pandas' read_csv(~)
method reads a file, and parses its content into a DataFrame.
This mammoth has more than 40 parameters, but only one is required.
Parameters
1. filepath_or_buffer
link | string
or path object
or file-like object
The path to the file you wish to read.
2. sep
| string
| optional
The delimiter that separates the data. If set to None
, and you're using the Python parsing engine (see engine
parameter below), then the method will attempt to infer the delimiter from the file. By default, sep=","
.
3. header
link | int
or list<int>
| optional
The line numbers that represent the header. If a list is passed, then the resulting DataFrame will be multi-index. By default, head=0
, which means that the first line will be treated as the column labels. Note that header
does not take into account comments and blank lines (when skip_blank_lines=True
), so head=0
may not necessarily be the first line of the file.
4. names
link | array-like<string>
| optional
The column labels to use. Duplicate labels will cause an error.
Suppose your file contains a header row (i.e. column labels), but you wish to use some other column labels instead. You can do this by explicitly setting header=0
, and then specifying the names
parameter.
5. index_col
link | int
or string
or sequence<int>
| optional
The integer index or the label of the columns in the file to use as the row labels of the resulting DataFrame. By default, index_col=None
.
6. usecols
link | array-like<int>
or array-like<string>
or function
| optional
The integer index or the label of the columns to include in the resulting DataFrame. If a function is provided, then the function takes in as argument the label of a column, and returns a boolean indicating whether or not to include the column.
By default, all columns are included.
7. squeeze
| boolean
| optional
Whether or not to return a Series
if the resulting DataFrame only contains a single column. By default, squeeze=False
.
8. prefix
| string
| optional
The prefix to append to the column names if header
is not provided. For instance, prefix="A"
would mean that the column labels may be "A0"
, "A1"
and so on. By default, no prefix will be added.
9. mangle_dupe_cols
link | boolean
| optional
Whether or not to append a numerical suffix to columns with duplicate labels. By default, mangle_dupe_cols=True
, which means that instead of overwrites when column labels clash, a suffix will be appended to the column labels. For instance, if there are 2 column labels "A"
and "A"
, then the second one would be called "A.1"
.
Apparently, mangle_dupe_cols=False
is not yet supported, though this isn't officially documented.
10. dtype
link | string
or type
or dict<string, string||type>
| optional
The data-type to use for the columns. If a dict
is provided, then the key
would be the column label and the value
would be its desired type.
11. engine
| string
| optional
Whether to use the C or Python parsing engine. The allowed values are "c"
or "python"
.
The C parsing engine is faster, but has less features compared to the Python counterpart. For instance, the Python parsing engine will be able to infer the delimiter from the file automatically while the C counterpart cannot.
12. converters
link | dict<string, function>
| optional
The functions that will be applied to columns. The key is the label of the column on which to apply the function on. By default, no function is applied.
13. true_values
link | list
| optional
A list of values that will be evaluated as True
in the resulting DataFrame.
14. false_values
link | list
| optional
A list of values that will be evaluated as False
in the resulting DataFrame.
15. skipinitialspace
| boolean
| optional
The official document claims that this parameter specifies whether or not to skip empty spaces after the delimiter. However, we've extensively tested this parameter, and found out that all leading and trailing spaces of values including column and row labels are stripped out regardless of what you set here. If you have some insight, please reach out to have a chat with us! By default, skipinitialspace=False
.
16. skiprows
link | list-like
or int
or function
| optional
The line numbers (in integer indices) to skip over. For instance, skiprows=1
will skip the second row in the file. If a function is provided, then the function takes in as argument the row label, and returns a boolean indicating whether or not to skip the row.
17. skipfooter
| int
| optional
The number of lines from the bottom to ignore. This is only implemented for when engine="python"
. By default, skipfooter=0
.
18. nrows
| int
| optional
The number of rows to include. This does not include the row for the column labels. By default, all rows are included.
19. na_value
| scalar
, string
, list-like
or dict
| optional
The additional values that will be interpreted as NaN
. If a dict
is provided, then the key is the label of the columns to consider.
By default, all the following values are interpreted as na
:
'', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN','-NaN', '-nan', '1.#IND', '1.#QNAN', '', 'N/A','NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'
20. keep_default_na
| boolean
| optional
Whether or not to keep the default na
values. The behaviour depends on whether you've specified the na_value
parameter:
|
| |
---|---|---|
| Your specified | The default |
| Your specified | No data will be treated as |
By default, keep_default_na=True
.
21. na_filter
link | boolean
| optional
Whether or not to check for na
values. If set to False
, then na_value
and keep_default_na
will be ignored, and so values like "NaN"
will be treated as a string - "NaN"
and not as a NaN
. If you are sure that your file does not contain any na
values, then set to False
as this will boost performance. By default, na_filter=True
.
22. verbose
| boolean
| optional
Whether or not to print on screen the following metrics on screen:
Tokenization took: ~ msType conversion took: ~ msParser memory cleanup took: ~ ms
By default, verbose=False
.
23. skip_blank_lines
| boolean
| optional
Whether or not to skip over blank lines (e.g. lines with only tabs and whitespaces) instead of treating them as na
values. By default, skip_blank_lines=True
.
24. parse_dates
link | boolean
| optional
The rules to parse dates. The allowed values are as follows:
Value | Description |
---|---|
| Parse the index as a date. All other columns will be interpreted as a non-date type. |
| Do not parse anything as dates. |
| The integer index or label of the columns to parse as a date. For instance, passing |
| The columns to combine to parse as a single date. For instance, passing |
| The column label (key) and the columns to combine to be parsed as a date. For instance, passing |
By default, parse_dates=False
.
If a value cannot be properly parsed as a date, then that value's column will have type object
instead.
25. infer_datetime_format
| boolean
| optional
If set to True
and parse_dates!=False
, then the format of the dates will be inferred. The inferred format is heavily optimised, and may potentially lead to 5-10x speedups. By default, infer_datetime_format=False
.
26. keep_date_col
| boolean
| optional
Whether to keep the original dates in the resulting DataFrame. This parameter is only relevant if the specified parse_dates
involves combining multiple columns. By default, keep_date_col=False
.
27. date_parser
link | function
| optional
The function used to convert strings to dates. This function can be called up to 3 times with the following distinct arguments:
(i). one or more arrays of date strings that correspond to the columns you combined as per parse_dates
.
(ii). a single array containing the concatenated date strings, depending on your choice of parse_dates
.
(iii). one or more date strings that correspond to the columns you combined as per parse_dates
. This is called for each row.
Whenever an exception occurs, the subsequent function will be called. For instance, if (i) fails, then (ii) will be called next. By default, dateutil.parser.parser
is used.
28. dayfirst
| bool
| optional
Whether or not to place the day first (e.g. DD/MM). By default, dayfirst=False
.
29. cache_dates
| boolean
| optional
Whether to keep a cache of date conversions performed. When you try to convert duplicates dates, then instead of performing the same conversion multiple times, you can use the cache to save some computation. By default, cache_dates=True
.
30. iterator
link | boolean
| optional
Whether or not to return an Iterator that can be used to read the data in chunks. By default, iterator=False
.
31. chunksize
link | int
| optional
The number of rows to read at a time. If specified, then a TextFileReader
object, which is used to iterate over the rows, will be returned. By default, all rows are read as one giant chunk.
32. compression
| string
| optional
The compressed format of the file. The allowed values are:
"gzip", "bz2", "zip" and "xz"
By default, compression="infer"
.
33. thousands
| string
| optional
The character to denote every thousandth digit. For instance, passing ","
would make values like 10000
be 10,000
. By default, thousands=None
.
34. decimal
| string
| optional
The character denoting a decimal point. By default, decimal="."
. This parameter exists since some European countries like France use a ,
to denote a decimal point instead.
35. lineterminator
| char
| optional
The character used to indicate a line break. This is only supported for when engine="c"
.
36. quotechar
| string
of length one | optional
When a value happens to contain a delimiter, then the value will erroneously be divided up. You can wrap your value using quotechar
to prevent such unwanted splits from happening. For instance, if sep=","
, and we set quotechar="@"
, the value @alex,bob@
will be treated as alex,bob
as a whole, rather than two separate values of alex
and bob
. Defaults to "
.
37. quoting
| int
or csv.QUOTE_*
| optional
The quoting rule to use - allowed values are follows:
QUOTE_MINIMAL
,QUOTE_ALL
,QUOTE_NONNUMERIC
,QUOTE_NONE
By default, quoting=csv.QUOTE_MINIMAL
.
38. doublequote
| boolean
| optional
Whether or not to parse quotechar
. By default, doublequote=True
.
39. escapechar
link | string
| optional
The character to escape the double quotation marks. By default, escapechar=None
.
40. comment
link | string
| optional
If your input file contains comments, then you can specify what identifies a comment. For instance, setting to "#"
will mean that the characters after the #
in the same line will be treated as a comment. The passed value must be a single character.
By default, comments=None
.
41. encoding
| string
| optional
The encoding to use when reading the file (e.g. "latin-1"
, "iso-8859-1"
). By default, encoding="utf-8"
.
42. error_bad_lines
| boolean
| optional
Whether or not to throw an error when a row cannot be parsed properly. For instance, if your row contains 6 values when you only have 4 columns, an error will be thrown. If set to True
, then an error will be raised and no DataFrame will be returned. If set to False
, then rows that cannot be parsed will be ignored. By default, error_bad_lines=True
.
43. warn_bad_lines
| boolean
| optional
Whether or not to output warning when a row cannot be parsed properly. By default, warn_bad_lines=True
.
44. delim_whitespace
| boolean
| optional
Whether or not to set whitespaces as the delimiter. If this is set, then the sep
parameter should not be specified. By default, delim_whitespace=False
.
45. low_memory
| boolean
| optional
Whether or not to internally process the file in chunks. If set to True
, then the resulting type of the columns may become less accurate. To ensure that this does not happen, explicitly specify the data type using the dtype
parameter. By default, low_memory=False
.
The resulting DataFrame would be the same regardless of the value set for low_memory
. The file is only parsed in chunks internally so as to save memory during the reading process. As stated, the data type may end up different in some minor cases.
46. memory_map
| boolean
| optional
If True
, then the file to be read will first be copied to memory and read from there. This can result in significant performance boost since there will be no I/O overhead. By default, memory_map=False
.
47. float_precision
| string
| optional
The converter to use for floats. The available options are as follows:
None
for the standard converter"high"
for the high-precision converter"round_trip"
for the round-trip converter.
This is only relevant for when engine="C"
. By default, float_precision=None
.
Return Value
If chunksize
is set, then a TextFileReader
object is returned. Otherwise, a DataFrame
with the imported data is returned.
Examples
Basic usage
Suppose we have the following text file called sample.txt
, which contains some comma-separated data:
A,B,C3,4,56,7,8
To read this file as a DataFrame:
df = pd.read_csv("sample.txt")df
A B C0 3 4 51 6 7 8
Make sure that the text file is in the same directory as the Python script.
Specifying header
Suppose our sample.txt
file is as follows:
# I am a commentA,B,C3,4,56,7,8
Here, the first line of the file is empty, and the second line is a comment.
By default, header=0
, which means that the first valid line (non-comment and non-empty line) will be regarded as the column labels:
df = pd.read_csv("sample.txt", comment="#") # header=0df
A B C0 3 4 51 6 7 8
A common mistake is to interpret header=0
as meaning the first line of the file. This is not necessarily true, as demonstrated in the example, since header
ignores lines that are:
blank when
skip_blank_lines=True
(default)comments (as indicated by the
comment
parameter).
Specifying the column labels
Suppose our sample.txt
file is as follows:
3,4,56,7,8
Notice how the file does not contain any header.
To provide the column labels, set the names
parameter like so:
df = pd.read_csv("sample.txt", names=["A","B","C"])df
A B C0 3 4 51 6 7 8
Overriding the header
Suppose our sample.txt
file is as follows:
A,B,C3,4,56,7,8
Notice how the file contains a header line.
To override the header line, set header=0
and names
:
df = pd.read_csv("sample.txt", header=0, names=["D","E","F"])df
D E F0 3 4 51 6 7 8
Specifying index_col
Suppose our sample.txt
is as follows:
A,B,Ca,3,4,5b,6,7,8
Since there are 4 values (instead of 3) from the second row onwards, read_csv(~)
will automatically treat the first column as the index:
df = pd.read_csv("sample.txt")df
A B Ca 3 4 5b 6 7 8
Now, consider a file that starts with a ,
like so:
,A,B,Ca,3,4,5b,6,7,8
Since each line contains the same number of ,
, the method will not treat the first column as part of the values. To ensure that the first column is treated as the index, pass index_col=0
like so:
df = pd.read_csv("sample.txt", index_col=0)df
A B Ca 3 4 5b 6 7 8
Specifying usecols
By default, all columns are read and parsed as a DataFrame. You can select which columns to include by specifying the usecols
parameter.
Consider the following sample.txt
file:
A,B,C3,4,56,7,8
To read columns A
and C
only:
df = pd.read_csv("sample.txt", usecols=["A","C"])df
A C0 3 51 6 8
You could also pass in a function for usecols
, which takes in as argument the label of a column, and returns a boolean indicating whether or not to include the column.
For instance, to only include column with label "B"
:
df = pd.read_csv("sample.txt", usecols=lambda x: x == "B")df
B0 41 7
For those who are not familiar with lambdas in Python, this is equivalent to the following:
def foo(col_label): # the name of the function is irrelevant. return col_label == "B"
df = pd.read_csv("sample.txt", usecols=foo)df
B0 41 7
Specifying mangle_dupe_cols
When column labels clash, the latter column will not override the former column - instead the default behaviour of mangle_dupe_cols=True
is to append a suffix to the latter column.
Consider the following sample.txt
file:
A,B,A3,4,56,7,8
Reading this file with the default parameter produces:
df = pd.read_csv("sample.txt")df
A B A.10 3 4 51 6 7 8
Although not documented officially, mangle_dupe_cols=False
, which is supposed to allow for duplicate labels, is not yet supported.
Specifying dtype
Consider the following sample.txt
file:
A,B3,45,6
By default, the column types will be inferred from the data:
df = pd.read_csv("sample.txt")df.dtypes
A int64B int64dtype: object
To set the type of all columns:
df = pd.read_csv("sample.txt", dtype=float)df.dtypes
A float64B float64dtype: object
To set the type separately for the columns:
df = pd.read_csv("sample.txt", dtype={"A":float, "B":str})df.dtypes
A float64B objectdtype: object
Specifying converters
The converters parameter is extremely useful as it allows us to apply a mapping on the column values.
Consider the following sample.txt
file:
A,B,C3,4,56,7,8
We can transform the values for specific columns like so:
df = pd.read_csv("sample.txt", converters={"A": lambda x : int(x)+10, "C": lambda x : int(x) > 6})df
A B C0 13 4 False1 16 7 True
The argument for the functions is a column value of type string
, which means that we must first convert it to numerical type to perform arithmetics.
Specifying true_values and false_values
Consider the following sample.txt
file:
A,B,Ca,b,cd,e,f
To map value "a"
to True
and "d"
to False
:
df = pd.read_csv("sample.txt", true_values=["a"], false_values=["d"])df
A B C0 True b c1 False e f
Note that the mapping only works if you map the entire column to booleans like we did above. For instance, the following won't work as intended:
df = pd.read_csv("sample.txt", true_values=["a"], false_values=["f"])df
A B C0 a b c1 d e f
Specifying skiprows
The skiprows
parameter is the line numbers (in integer indices) to skip over. It takes in a single integer, a sequence of integers or a function. Since the former two are self-explanatory, we will demonstrate how to pass a function. The function takes in as argument the row label, and outputs a boolean indicating whether or not to skip the row.
Consider the following sample.txt
file:
A,B,C3,4,56,7,8
To skip the second row (row with integer index 1
):
df = pd.read_csv("sample.txt", skiprows= lambda x : x == 1)df
A B C0 6 7 8
Specifying na_filter
Consider the following sample.txt
file:
A,B,C3,4,nan6,NaN,8
Here, the question is whether or not to treat the nan
and NaN
as strings or actual missing values (i.e. nan
). By default, na_filter=True
, which means that they will be treated as nan
:
df = pd.read_csv("sample.txt") # na_filter=Truedf
A B C0 3 4.0 NaN1 6 NaN 8.0
It should be clear that we now have actual nan
values, but we can confirm this using the df.isnull()
method which checks for nan
values:
df.isnull()
A B C0 False False True1 False True False
To treat the nan
and NaN
as strings instead, set na_filter=False
like so:
df = pd.read_csv("sample.txt", na_filter=False)df
A B C0 3 4 nan1 6 NaN 8
We then call df.isnull()
again to check for actual missing values (nan
):
df.isnull()
A B C0 False False False1 False False False
We see that the values are all False
, meaning they are interpreted as a string instead.
Specifying dates
Parsing index as date
Considering the following sample.txt
file:
name,gender1995,alice,female2005/12,bob,male
Here, the index (row labels) is to be parsed as date. To do so, set parse_date=True
:
df = pd.read_csv("sample.txt", parse_dates=True)df
name gender1995-01-01 alice female2005-12-01 bob male
The default date format is YYYY-MM-DD
, and unspecified days and months will be filled with 01
as shown in the output.
Parsing non-index columns as date
Consider the following sample.txt
file:
name,admitted,graduatedalice,2014,2018bob,2016,2020
Setting parse_date=True
, as we did before, will mean that only the index will be parsed as date. This is undesirable for this file since the dates to parse are non-index columns.
To parse columns admitted
and graduated
as dates:
df = pd.read_csv("sample.txt", parse_dates=["admitted","graduated"])df
name admitted graduated0 alice 2014-01-01 2018-01-011 bob 2016-01-01 2020-01-01
To confirm the data type of our columns:
df.dtypes
name objectadmitted datetime64[ns]graduated datetime64[ns]dtype: object
Combining multiple date columns
Consider the following sample.txt
file:
Year,Month2020,121994,02
Here, we have 2 separate columns that we want to parse as a single date column. We can do this by passing in a nested list, like so:
df = pd.read_csv("sample.txt", parse_dates=[["Year","Month"]])df
Year_Month0 2020-12-011 1994-02-01
We can also combine dates by passing in a dict
like so:
df = pd.read_csv("sample.txt", parse_dates={"A":["Year","Month"]})df
A0 2020-12-011 1994-02-01
The advantage of using dict
instead of a nested list is that you can supply a label for the combined column (A
in this case).
Specifying date_parser
Consider the following sample.txt
file:
Year,Month2020,121994,02
The date_parser
parameter allows you specify how date strings should be parsed.
For instance, to apply an offset of one year:
import dateutilfrom dateutil.relativedelta import relativedelta
def my_date_parser(*arg): date_time = datetime.strptime(arg[0], "%Y %m") return date_time + relativedelta(years=1)
df = pd.read_csv("sample.txt", parse_dates={"A":["Year","Month"]}, date_parser=my_date_parser)df
A0 2021-12-011 1995-02-01
Since the date_parser
function can be called a maximum of 3 times with varying arguments, we use the syntax for varargs, *args
.
Just to break down how the date_parser works, suppose we printed out what arg
is:
def my_date_parser(*arg): print(arg) date_time = datetime.strptime(arg[0], "%Y %m") return date_time + relativedelta(years=1)
(array(['2020', '1994'], dtype=object), array(['12', '02'], dtype=object))('2020 12',)('1994 02',)
The first time
my_date_parser
is called, 2Series
, each containing the date strings to combine, are returned. Since thedatetime.striptime
function takes in a string and not a Series, this function actually throws an error.However, instead of terminating the program, the
my_date_parser
is then called again - this time, the argument passed is a single date string ('2020 12'
). This can be parsed properly bystriptime
so this time the parsing is successful.
Specifying iterator
Consider the following sample.txt
file:
A,B1,23,45,67,89,10
By setting iterator=True
, the method returns a TextFileReader
. This is useful when you want to read the file in chunks like so:
reader = pd.read_csv("sample.txt", iterator=True)print(reader.get_chunk(2))print("-----")print(reader.get_chunk(3))
A B0 1 21 3 4----- A B2 5 63 7 84 9 10
Here get_chunk(n)
returns a DataFrame with n
rows. Unless you need to read chunks of different sizes, opt to use chunksize
instead.
Specifying chunksize
Consider the following sample.txt
file:
A,B1,23,45,67,89,10
To read this file in chunks, set chunksize
like so:
for chunk in pd.read_csv("sample.txt", chunksize=2): print(chunk) print("-----")
A B0 1 21 3 4----- A B2 5 63 7 8----- A B4 9 10-----
Here, every chunk
is of type DataFrame
.
When dealing with large data that cannot fit in memory, consider using iterator
or chunksize
.
Specifying thousands
Consider the following sample.txt
file:
A B3,000 6,000,000
Here, our data uses ,
to indicate every thousand.
By default, these values will be treated as a string:
df = pd.read_csv("sample.txt", sep=" ")df
A B0 3,000 6,000,000
To treat the values as a numeric, set thousands
like so:
df = pd.read_csv("sample.txt", sep=" ", thousands=",")df
A B0 3000 6000000
Specifying quotechar
The quotechar
comes in handy when values confusingly contain the delimiter.
Consider the following sample.txt
file:
A,B3,4,56,7
Here, notice how the second row contains 3 values, while the others contain only 2 values. Calling read_csv(~)
will cause an error in such cases.
If you want to treat 3,4
as a single value as opposed to two separate values, you can wrap them in a quote. By default quotechar="
, which means that we can negate delimiters like so:
A,B"3,4",56,7
Now, calling read_csv(~)
yields:
df = pd.read_csv("sample.txt")df
A B0 3,4 51 6 7
Notice how 3,4
is now interpreted as a single entry.
Specifying escapechar
Consider the following sample.txt
:
A,B"3\"4",56,7
Here, the inner "
is escaped using the escape character "\"
. This indicates that we want 3"4
to be read as the value for that entry, but calling read_csv(~)
erroneously yields the following:
df = pd.read_csv("sample.txt")df
A B0 3\4" 51 6 7
This happens because Pandas does not recognise "\"
as an escape character, and so misinterprets the inner "
as a quotechar
.
To get around this problem, set escapechar
to escape the inner "
:
df = pd.read_csv("sample.txt", escapechar="\\")df
A B0 3"4 51 6 7
Here, we need "\\"
because the character \
in Python is an escape character and so just having "\"
ends up escaping the closing "
, causing a syntax error.
Specifying comment
The comment
parameter is used to identify comments - everything after the passed character in the same line will be treated as a comment and thus ignored.
For instance, consider the following sample.txt
file:
A,B,C3,4,5 # I am a comment6,7,8 # I am a comment
To ignore these comments:
df = pd.read_csv("sample.txt", comment="#")df
A B C0 3 4 51 6 7 8