Resolving ParserError: Error tokenizing data in Pandas
Start your free 7-days trial now!
Common reasons for ParserError: Error tokenizing data
when initiating a Pandas DataFrame include:
Using the wrong delimiter
Number of fields in certain rows do not match with header
To resolve the error we can try the following:
Specifying the delimiter through
sep
parameter inread_csv(~)
Fixing the original source file
Skipping bad rows
Examples
Specifying sep
By default the read_csv(~)
method assumes sep=","
. Therefore when reading files that use a different delimiter, make sure to explicitly specify the delimiter to use.
Consider the following slash-delimited file called test.txt
:
col1/col2/col31/A/42/B/53/C,D,E/6
To initialize a DataFrame using default sep=","
:
import pandas as pddf = pd.read_csv('test.txt')
ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 3
An error is raised as the first line in the file does not contain any commas, so read_csv(~)
expects all lines in the file to only contain 1 field. However, line 4 has 3 fields as it contains two commas, which results in the ParserError
.
To initialize the DataFrame by correctly specifying slash (/
) as the delimiter:
df = pd.read_csv('test.txt', sep='/')df
col1 col2 col30 1 A 41 2 B 52 3 C,D,E 6
We can now see the DataFrame is initialized as expected with each line containing the 3 fields which were separated by slashes (/
) in the original file.
Fixing original source file
Consider the following comma-separated file called test.csv
:
col1,col2,col31,A,42,B,5,3,C,6
To initialize a DataFrame using the above file:
import pandas as pddf = pd.read_csv('test.csv')
ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
Here there is an error on the 3rd line as 4 fields are observed instead of 3, caused by the additional comma at the end of the line.
To resolve this error, we can correct the original file by removing the extra comma at the end of line 3:
col1,col2,col31,A,42,B,53,C,6
To now initialize the DataFrame again using the corrected file:
df = pd.read_csv('test.csv')df
col1 col2 col30 1 A 41 2 B 52 3 C 6
Skipping bad rows
Consider the following comma-separated file called test.csv
:
col1,col2,col31,A,42,B,5,3,C,6
To initialize a DataFrame using the above file:
import pandas as pddf = pd.read_csv('test.csv')
ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
Here there is an error on the 3rd line as 4 fields are observed instead of 3, caused by the additional comma at the end of the line.
To skip bad rows pass on_bad_lines='skip'
to read_csv(~)
:
df = pd.read_csv('test.csv', on_bad_lines='skip')df
col1 col2 col30 1 A 41 3 C 6
Notice how the problematic third line in the original file has been skipped in the resulting DataFrame.
This should be your last resort as valuable information could be contained within the problematic lines. Skipping these rows means you lose this information. As much as possible try to identify the root cause of the error and fix the underlying problem.