PySpark SQL Functions | regexp_extract method
Start your free 7-days trial now!
PySpark SQL Functions' regexp_extract(~)
method extracts a substring using regular expression.
Parameters
1. str
| string
or Column
The column whose substrings will be extracted.
2. pattern
| string
or Regex
The regular expression pattern used for substring extraction.
3. idx
| int
The group from which to extract values. Consult the examples below for clarification.
Return Value
A new PySpark Column.
Examples
Consider the following PySpark DataFrame:
+--------+---+| id|age|+--------+---+|id_20_30| 10||id_40_50| 30|+--------+---+
Extracting a specific substring
To extract the first number in each id
value, use regexp_extract(~)
like so:
Here, the regular expression (\d+)
matches one or more digits (20
and 40
in this case). We set the third argument value as 1
to indicate that we are interested in extracting the first matched group - this argument is useful when we capture multiple groups.
Extracting the n-th captured substring
We can use multiple (~)
capture groups for regexp_extract(~)
like so:
Here, we set the third argument value to 2
to indicate that we are interested in extracting the values captured by the second group.