df = spark.createDataFrame([['Alex', 20], ['Bob', 30]], ['name', 'age'])
df.show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+

To extract the substring from the 1st position (inclusive) with a length of 3:


        
        
            
                
                
                    from pyspark.sql import functions as F
df.select(F.col('name').substr(1,3).alias('shorter_name')).show()
                
            
            +------------+
|shorter_name|
+------------+
|         Ale|
|         Bob|
+------------+

Here, note the following:

the first argument of substr(1,3) is the non-indexed-based starting position (inclusive). The second argument (3 in this case) is the maximum number of characters to extract. Emphasis on the maximum here because If we set substr(1,4), 'Bob' would still be returned.
we are using the alias(~) method to assign a label to the PySpark Column returned by substr(~).

You could also specify a negative starting position for substr(~):


        
        
            
                
                
                    from pyspark.sql import functions as F
df.select(F.col('name').substr(-3,2).alias('shorter_name')).show()
                
            
            +------------+
|shorter_name|
+------------+
|          le|
|          Bo|
+------------+

Here, we are starting from the 3rd character (inclusive) from the end.

Extracting substring using regular expression (regexp_extract)

Consider the following PySpark DataFrame:


        
        
            
                
                
                    df = spark.createDataFrame([['id-6', 20], ['id-8', 30]], ['id', 'age'])
df.show()
                
            
            +----+---+
|  id|age|
+----+---+
|id-6| 20|
|id-8| 30|
+----+---+

To extract the id number from the id column, use the regexp_extract(~) method:


        
        
            
                
                
                    from pyspark.sql import functions as F
df.select(F.regexp_extract('id', '(\d+)', 1).alias('id_number')).show()
                
            
            +---------+
|id_number|
+---------+
|        6|
|        8|
+---------+

Here, note the following:

the first argument of regexp_extract(~) is the label of the target column.
the regular expression (\d+) captures one or more digits. The parentheses are important here because they are used to capture and extract substrings.
the third argument value 1 means that we capture the substring obtained from the first group. This argument is useful when we have multiple capturing groups (e.g. (\d+)-(\d+)). For a more detailed discussion on regexp_extract(~), please consult our documentation here.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!