NumPy | genfromtxt method
Start your free 7-days trial now!
Numpy's genfromtext(~)
method reads a text file, and parses its content into a Numpy array. Unlike Numpy's loadtxt(~)
method, genfromtxt(~)
works with missing numbers.
Parameters
1. fname
| string
The name of the file. If the file is not in the same directory as the script, make sure to include the path to the file as well.
2. dtype
link | string
or type
or list<string>
or list<type>
| optional
The desired data-type of the constructed array. By default, dtype=float64
. This means that all integers will be converted to floats as well.
If you set dtype=None
, then Numpy will attempt to infer the type from your values. This may be significantly slower than setting the type yourself.
3. comments
link | string
| optional
If your input file contains comments, then you can specify what identifies a comment. By default, comments="#"
, that is, characters after the # in the same line will be treated as a comment. You can set None
if your text file does not include any comment.
4. delimiter
link | string
| optional
The string used to separate your data. By default, the delimiter is a whitespace.
5. skiprows
| int
| optional
This parameter has been replaced by skip_header
in Numpy version 1.10.
6. skip_header
link | int
| optional
The number of rows in the beginning to skip. Note that this includes comments. By default, skiprows=0
.
7. skip_footer
link | int
| optional
The number of rows at the end to skip. Note that this includes comments. By default, skiprows=0
.
8. converters
link | dict<int,function>
| optional
You can apply a mapping to transform your column values. The key is the integer index of the column, and the value is the desired mapping. Check examples below for clarification. By default, dict=None
.
9. missing
| string
| optional
This parameter has been replaced by missing_values
in Numpy version 1.10.
10. missing_values
link | string
or sequence<string>
| optional
The sequence of strings that will be treated as missing values. This is only relevant when usemask=True
. Consult examples for clarification.
11. filling_values
link | value
or dict
or sequence<value>
| optional
If a single value is passed then all missing and invalid values will be replaced by that value. By passing a dict, you can specify different fill values for different columns. The key is the column integer index, and the value is the fill value for that column.
12. usecols
link | int
or sequence
| optional
The integer indices of the columns you want to read. By default, usecols=None
, that is, all columns are read.
13. names
link | None
or True
or string
or sequence<string>
| optional
The field names of the resulting array. This parameter is relevant only for those who wish to create a structured array.
Type | Description |
---|---|
None | A standard array instead of a structured array will be returned. |
True | The first row after the specified skip_header lines will be treated as the field names. |
string | A single string containing the field names separated by comma. |
sequence | An array-like structure containing the field names. |
By default, names=None
.
As a side note, structured arrays are not commonly used since Series and DataFrames in the Pandas library are better alternatives.
14. excludelist
link | sequence
| optional
The passed strings will be appended to the default list of ["return", "file", "print"]. Note that an underscore will be appended to the passed strings (e.g. if "abc" is passed, then "abc_" will be appended to the default list). This is only relevant for those who wish to create a structured array.
15. deletechars
link | string
of length one or sequence
or dict
| optional
The character(s) to delete from the names.
16. defaultfmt
link | string
| optional
The format of the resulting field names. The syntax follows that of Python's standard string formatter:
17. autostrip
link | boolean
| optional
Whether or not to remove leading and trailing in the values. This is only applicable for values that are strings. By default, autostrip=False
.
18. replace_space
link | string
| optional
The string used to replace spaces in the field names. Note that the leading and trailing spaces will be removed. By default, replace_space="_"
.
19. case_sensitive
link | string
or boolean
| optional
How to handle the casing of string characters.
Value | Description |
---|---|
True | Leave the casing as is. |
False | Convert value to uppercase. |
"upper" | Convert value to uppercase. |
"lower" | Convert value to lowercase. |
By default, case_sensitive=True
.
20. unpack
link | boolean
| optional
Instead of having one giant Numpy array, you could retrieve column arrays individually by setting this to True
. For instance, col_one, col_two = np.genfromtxt(~, unpack=True)
. By default, unpack=False
.
21. usemask
| boolean
| optional
Whether or not to return a masked boolean array. By default, usemark=True
.
22. loose
link | boolean
| optional
If True, invalid values will be converted to nan
and no error will be raised. By default, loose=True
.
23. invalid_raise
link | boolean
| optional
If the number of values in a row do not match up with the number of columns, then an error is raised. If set to False, then invalid rows will be omitted from the resulting array. By default, invalid_raise=True
.
24. max_rows
link | int
| optional
The maximum number of rows to read. By default, all lines are read.
25. encoding
| string
| optional
The encoding to use when reading the file (e.g. "latin-1", "iso-8859-1"). By default, encoding="bytes".
Return value
A Numpy array with the imported data.
Examples
Basic usage
Suppose we have the following text-file called my_data.txt
:
1 2 3 45 6 7 8
To import this file:
a = np.genfromtxt("my_data.txt")a
array([[1., 2., 3., 4.], [5., 6., 7., 8.]])
Note that this Python script resides in the same directory as my_data.txt
.
Also, the default data type is float64
, regardless of whether or not the numbers in the text file are all integers:
print(a.dtype)
float64
Specifying the desired data type
Once again, suppose we have the following text-file called my_data.txt
:
1 2 3 45 6 7 8
Instead of using the default float64
, we can specify a type using dtype
:
a = np.genfromtxt("my_data.txt", dtype=int)a
array([[1, 2, 3, 4], [5, 6, 7, 8]])
Now, all the values have type float64
.
You can also pass a list of types to assign different types to different columns:
a = np.genfromtxt("my_data.txt", dtype=[np.int,32 int, np.float,32 float])a
array([(1, 2, 3., 4.), (5, 6, 7., 8.)], dtype=[('f0', '<i4'), ('f1', '<i8'), ('f2', '<f4'), ('f3', '<f8')])
Here, the i4
represents int32
while i8
represents int64
.
Note that this is a special type of Numpy array called structured array. This type of arrays is not often used in practise since Series and DataFrames in the Pandas library are alternatives with more feature.
Specifying a custom delimiter
Suppose our my_data.txt
file is as follows:
1,23,4
Since our data is comma-separated, set delimiter=","
like so:
a = np.genfromtxt("my_data.txt", delimiter=",")a
1,23,4
Handling comments
Suppose our my_data.txt
file is as follows:
1,2,3,4 / I'm the first row!5,6,7,8 / I'm the second row!
To strip out comments in the text-file, specify comments
:
a = np.genfromtxt("my_data.txt", delimiter=",", comments="/")a
array([[1., 2., 3., 4.], [5., 6., 7., 8.]])
Specifying skip_header
Suppose our my_data.txt
file is as follows:
1 2 34 5 67 8 9
To skip the first row:
a = np.genfromtxt("my_data.txt", skip_header=1)a
array([[4., 5., 6.], [7., 8., 9.]])
Specifying skip_footer
Suppose our my_data.txt
file is as follows:
1 2 34 5 67 8 9
To skip the last row:
a = np.genfromtxt("my_data.txt", skip_footer=1)a
array([[1., 2., 3.], [4., 5., 6.]])
Specifying converters
Suppose our my_data.txt
file is as follows:
1 23 4
Just as an arbitrary example, suppose we wanted to add 10 to all values of the 1st column, and make all the values of the 2nd column be 20:
a = np.genfromtxt("my_data.txt", converters={0: lambda x: int(x) + 10, 1: lambda x: 20})a
array([(11, 20), (13, 20)], dtype=[('f0', '<i8'), ('f1', '<i8')])
Here, the "f0"
and "f1"
are the field names, and the "i8"
denote a int64
data type.
Specifying missing_values
Suppose our my_data.txt
file is as follows:
3,??,6
All missing and invalid values are treated as nan
, so you wouldn't need to specify missing_values="??"
here:
a = np.genfromtxt("my_data.txt", delimiter=",")a
array([[ 3., nan], [nan, 6.]])
Note that is not possible to set the value 6, for instance, as a missing value. The missing_values
comes into play only when you set usemask=True
.
Here's usemask=True
without missing_values
:
a = np.genfromtxt("my_data.txt", delimiter=",", usemask=True)a
masked_array( data=[[3.0, nan], [--, 6.0]], mask=[[False, False], [ True, False]], fill_value=1e+20)
Notice how missing and invalid values are differentiated here - ??
has been mapped to nan
with the mask boolean flagged as False
, while an actual missing value has been mapped to --
with the masked boolean set as True
.
Now, here's usemask=True
with missing_values="??"
:
a = np.genfromtxt("my_data.txt", delimiter=",", missing_values="??", usemask=True)a
masked_array( data=[[3.0, --], [--, 6.0]], mask=[[False, True], [ True, False]], fill_value=1e+20)
The key here is that, ??
, which is inherently an invalid value, is now treated like a missing_value.
Specifying filling_values
By default, all missing and invalid values are replaced by nan. To change this, specify the filling_values
like so:
a = np.genfromtxt("my_data.txt", delimiter=",", filling_values=0)a
array([[3., 0.], [0., 6.]])
You could also pass in a dictionary, with the following key-value pairs:
key: the column integer index
value: the fill value
For instance, to set to map all missing and invalid values for first column to -1, and those for the second column to -2:
a = np.genfromtxt("my_data.txt", delimiter=",", filling_values={0:-1, 1:-2})a
array([[ 3., -2.], [-1., 6.]])
Reading only certain columns
Suppose our my_data.txt
file is as follows:
1 2 34 5 6
To read only the 1st and 3rd columns (i.e. column index 0 and 2):
a = np.genfromtxt("my_data.txt", usecols=[0,2])a
array([[1., 3.], [4., 6.]])
Specifying names
Suppose our my_data.txt file is as follows:
3 45 6
To assign a name to each column:
a = np.genfromtxt("my_data.txt", names=("A","B"))a
array([(3., 4.), (5., 6.)], dtype=[('A', '<f8'), ('B', '<f8')])
Here, we have assigned the name A to the first column. Note that f8
just denotes the type float64
.
Specifying excludelist
Suppose our my_data.txt
file is as follows:
3 4 56 7 8
To append a _
to certain names:
a = np.genfromtxt("my_data.txt", names=["A","B","C"], excludelist=["A"])a
array([(3., 4., 5.), (6., 7., 8.)], dtype=[('A_', '<f8'), ('B', '<f8'), ('C', '<f8')])
Notice how we have A_
as the field name for the first column.
Specifying deletechars
Suppose our my_data.txt
file is as follows:
3 45 6
To remove the character "c"
from the field names:
a = np.genfromtxt("my_data.txt", names=["Ab","BcD"], deletechars="c")a
array([(3., 4.), (5., 6.)], dtype=[('Ab', '<f8'), ('BD', '<f8')])
To remove multiple characters:
a = np.genfromtxt("my_data.txt", names=["Ab","BcD"], deletechars=["c","A"])a
array([(3., 4.), (5., 6.)], dtype=[('b', '<f8'), ('BD', '<f8')])
Specifying defaultfmt
Suppose our my_data.txt
file is as follows:
3 45 6
If the returned result is a structured array, and the names
parameter is not defined, then the field names take on the values "f0"
, "f1"
and so on by default:
a = np.genfromtxt("my_data.txt", dtype=[int, float])a
array([(3, 4.), (5, 6.)], dtype=[('f0', '<i8'), ('f1', '<f8')])
To customise this, pass the defaultfmt
parameter:
a = np.genfromtxt("my_data.txt", dtype=[int, float], defaultfmt="my_var_%i")a
array([(3, 4.), (5, 6.)], dtype=[('my_var_0', '<i8'), ('my_var_1', '<f8')])
Here, the %i
is a placeholder for the column integer index.
Specifying autostrip
Suppose our my_data.txt
file is as follows:
3,a, 4 5 ,b c,6
By default, all whitespaces that appear in the values are kept intact:
a = np.genfromtxt("my_data.txt", delimiter=",", dtype="U")a
array([['3', 'a', ' 4'], ['5 ', 'b c', '6']], dtype='<U5')
If you want to strip away the leading and trailing whitespaces, set autostrip=True
like so:
a = np.genfromtxt("my_data.txt", delimiter=",", autostrip=True, dtype="U")a
array([['3', 'a', '4'], ['5', 'b c', '6']], dtype='<U3')
Notice how the whitespace in "b c"
is still there.
Specifying replace_space
Suppose our my_data.txt
is as follows:
3 45 6
By default, the non-leading and non-trailing spaces are replaced by _
:
a = np.genfromtxt("my_data.txt", names=["A B", " C "])a
array([(3., 4.), (5., 6.)], dtype=[('A_B', '<f8'), ('C', '<f8')])
Notice how the leading and trailing spaces have been stripped.
To replace the spaces by a custom string, set the replace_space
parameter like so:
a = np.genfromtxt("my_data.txt", names=["A B", " C "], replace_space="K")a
array([(3., 4.), (5., 6.)], dtype=[('AKB', '<f8'), ('C', '<f8')])
Specifying case_sensitive
Suppose our my_data.txt
is as follows:
3 45 6
By default, case_sensitive is set to True, which means that the field names are left as is.
a = np.genfromtxt("my_data.txt", names=["Ab", "dC"])a
array([(3., 4.), (5., 6.)], dtype=[('Ab', '<f8'), ('dC', '<f8')])
To convert field names to uppercase, either set "upper"
or False
:
a = np.genfromtxt("my_data.txt", names=["Ab", "dC"], case_sensitive=False)a
array([(3., 4.), (5., 6.)], dtype=[('AB', '<f8'), ('DC', '<f8')])
To convert field names to lowercase, set "lower"
:
a = np.genfromtxt("my_data.txt", names=["Ab", "dC"], case_sensitive="lower")a
array([(3., 4.), (5., 6.)], dtype=[('ab', '<f8'), ('dc', '<f8')])
Specifying unpack
Suppose our my_data.txt
file is as follows:
1 23 4
To retrieve the data per column instead of a single Numpy array:
col_one, col_two = np.genfromtxt("my_data.txt", unpack=True)print("col_one:", col_one)print("col_two:", col_two)
col_one: [3. 5.]col_two: [4. 6.]
Specifying loose
Suppose our my_data.txt
file is as follows:
3 45 ??
By default, loose=True
, which means that invalid values (e.g. the ??
here) are converted into nan
:
a = np.genfromtxt("my_data.txt")a
array([[ 3., 4.], [ 5., nan]])
To raise an error if our file contains invalid values, set loose=False
, like so:
a = np.genfromtxt("my_data.txt", loose=False)a
ValueError: Cannot convert string '??'
Specifying invalid_raise
Suppose our my_data.txt file is as follows:
3,457,8
Here, the second row only contains 1 value even though the array seemingly has 2 columns.
By default, invalid_raise=True
, which means that if the file contains invalid rows, then an error is raised:
a = np.genfromtxt("my_data.txt", delimiter=",")a
ValueError: Some errors were detected! Line #2 (got 1 columns instead of 2)
We can choose to omit invalid rows by setting it to False
, like so:
a = np.genfromtxt("my_data.txt", delimiter=",", invalid_raise=False)a
array([[3., 4.], [7., 8.]])
No error is raised, but Numpy is nice enough to give us a warning:
ConversionWarning: Some errors were detected! Line #2 (got 1 columns instead of 2)
Specifying the desired dimension
Suppose our sample.txt
only had one row:
1 2 3 4
By default, loadtxt(~)
will generate an one-dimensional array:
a = np.loadtxt("sample.txt")a
array([1., 2., 3., 4.])
We can specify that we want our array to be two-dimensional by:
a = np.loadtxt("sample.txt", ndmin=2)a
array([[1., 2., 3., 4.]])
Specifying max_rows
Suppose our my_data.txt
file is as follows:
1 23 45 6
To read only the first two rows instead of the entire file:
a = np.genfromtxt("myy_data.txt", max_rows=2)a
array([[1., 2.], [3., 4.]])