PySpark SQL Functions | split method
Start your free 7-days trial now!
PySpark SQL Functions' split(~)
method returns a new PySpark column of arrays containing splitted tokens based on the specified delimiter.
Parameters
1. str
| string
or Column
The column in which to perform the splitting.
2. pattern
| string
The regular expression that serves as the delimiter.
3. limit
| int
| optional
if
limit > 0
, then the resulting array of splitted tokens will contain at mostlimit
tokens.if
limit <=0
, then there is no limit as to how many splits we perform.
By default, limit=-1
.
Return Value
A new PySpark column.
Examples
Consider the following PySpark DataFrame:
+-------+| x|+-------+| A#A|| B##B||#C#C#C#|| null|+-------+
Splitting strings by delimiter in PySpark Column
To split the strings in column x
by "#"
, use the split(~)
method:
Here, note the following:
the second delimiter parameter is actually parsed as a regular expression - we will see an example of this later.
splitting
null
results innull
.
We can also specify the maximum number of splits to perform using the optional parameter limit
:
Here, the array containing the splitted tokens can be at most length 2
. This is the reason why we still see our delimiter substring "#"
in there.
Splitting strings using regular expression in PySpark Column
Consider the following PySpark DataFrame:
+----+| x|+----+| A#A|| B@B||C#@C|+----+
To split by either the characters #
or @
, we can use a regular expression as the delimiter:
Here, the regular expression [#@]
denotes either #
or @
.