Extracting substrings from PySpark DataFrame column
Start your free 7-days trial now!
There are mainly two methods you can use to extract substrings from column values in a PySpark DataFrame:
substr(~)
extracts a substring using position and lengthregexp_extract(~)
extracts a substring using regular expression
Extracting substring using position and length (substr)
Consider the following PySpark DataFrame:
+----+---+|name|age|+----+---+|Alex| 20|| Bob| 30|+----+---+
To extract the substring from the 1st position (inclusive) with a length of 3:
Here, note the following:
the first argument of
substr(1,3)
is the non-indexed-based starting position (inclusive). The second argument (3
in this case) is the maximum number of characters to extract. Emphasis on the maximum here because If we setsubstr(1,4)
,'Bob'
would still be returned.we are using the
alias(~)
method to assign a label to the PySparkColumn
returned bysubstr(~)
.
You could also specify a negative starting position for substr(~)
:
Here, we are starting from the 3rd character (inclusive) from the end.
Extracting substring using regular expression (regexp_extract)
Consider the following PySpark DataFrame:
+----+---+| id|age|+----+---+|id-6| 20||id-8| 30|+----+---+
To extract the id number from the id
column, use the regexp_extract(~)
method:
from pyspark.sql import functions as F
+---------+|id_number|+---------+| 6|| 8|+---------+
Here, note the following:
the first argument of
regexp_extract(~)
is the label of the target column.the regular expression
(\d+)
captures one or more digits. The parentheses are important here because they are used to capture and extract substrings.the third argument value
1
means that we capture the substring obtained from the first group. This argument is useful when we have multiple capturing groups (e.g.(\d+)-(\d+)
). For a more detailed discussion onregexp_extract(~)
, please consult our documentation here.