Extracting the n-th value of lists in PySpark DataFrame
Start your free 7-days trial now!
Consider the following PySpark DataFrame:
+--------+| my_col|+--------+|[10, 20]||[30, 40]|+--------+
Here, my_col
contains some lists.
Extracting a single value from arrays in PySpark Column
To extract the second value of each list in my_col
:
Here, we are assigning a label to the Column
returned by F.col('my_col')[0]
using alias(~)
.
Equivalently, we can use the element_at(~)
method instead of using the [~]
syntax:
+------------+|second_value|+------------+| 20|| 40|+------------+
Note that element_at(~)
does not use index-based positioning - the second value in a list is denoted by position 2.
Extracting values from the back
I recommend using element_at(~)
rather than [~]
syntax because element_at(~)
allows you to extract elements from the back using negative positioning:
+--------+|last_val|+--------+| 20|| 40|+--------+
This is not possible using the [~]
syntax or the getItem(~)
method.
In case of out-of-bound indexes
Specifying out-of-bound indexes will return null
values:
+---------------------+|element_at(my_col, 5)|+---------------------+| null|| null|+---------------------+
Extracting multiple values from arrays in PySpark Column
To extract multiple values from arrays in a PySpark Column:
Here, we are extracting the first as well as second values of each list.
Equivalently, we could use element_at(~)
once again:
+---------------------+----------------------+|element_at(my_col, 1)|element_at(my_col, -1)|+---------------------+----------------------+| 10| 20|| 30| 40|+---------------------+----------------------+
Again, you can provide an alias for each column by using the alias(~)
method:
Related
element_at(~)
method is used to extract values from lists or maps in a PySpark Column.getItem(~)
method extracts a value from the lists or dictionaries in a PySpark Column.