PySpark DataFrame | dropna method
Start your free 7-days trial now!
PySpark DataFrame's dropna(~)
method removes row with missing values.
Parameters
1. how
| string
| optional
If
'any'
, then drop rows that contains any null value.If
'all'
, then drop rows that contain all null values.
By default, how='any'
.
2. thresh
| int
| optional
Drop rows that have less non-null values than thresh
. Note that this overrides the how
parameter.
3. subset
| string
or tuple
or list
| optional
The rows to check for null values. By default, all rows will be checked.
Return Value
A PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
+-----+----+| name| age|+-----+----+| Alex| 20|| null|null||Cathy|null|+-----+----+
Dropping rows with at least one missing value in PySpark DataFrame
To drop rows with at least one missing value:
+----+---+|name|age|+----+---+|Alex| 20|+----+---+
Dropping rows with at least n non-missing values in PySpark DataFrame
To drop rows with at least 2 non-missing values:
n_non_missing_vals = 2
+----+---+|name|age|+----+---+|Alex| 20|+----+---+
Dropping rows with at least n missing values in PySpark DataFrame
To drop rows with at least 2 missing values:
Dropping rows with all missing values in PySpark DataFrame
To drop rows with all missing values:
+-----+----+| name| age|+-----+----+| Alex| 20||Cathy|null|+-----+----+
Dropping rows where certain value is missing in PySpark DataFrame
To drop rows where the value for age
is missing:
+----+---+|name|age|+----+---+|Alex| 20|+----+---+
Dropping rows where certain values are missing (either) in PySpark DataFrame
To drop rows where either the name
or age
column value is missing:
+----+---+|name|age|+----+---+|Alex| 20|+----+---+
Dropping rows where certain values are missing (all) in PySpark DataFrame
To drop rows where the name
and age
column values are both missing:
+-----+----+| name| age|+-----+----+| Alex| 20||Cathy|null|+-----+----+