PySpark SQL Functions | col method
Start your free 7-days trial now!
PySpark SQL Functions' col(~)
method returns a Column
object.
Parameters
1. col
| string
The label of the column to return.
Return Value
A Column
object.
Examples
Consider the following PySpark DataFrame:
+----+---+|name|age|+----+---+|Alex| 20|| Bob| 30|+----+---+
Selecting a column in PySpark
To select the name
column:
Note that we could also select the name
column without the explicit use of F.col(~)
like so:
Creating a new column
To create a new column called status
whose values are dependent on the age
column:
Note the following:
the
"*"
refers to all the columns ofdf
.we are using the
when(~)
andotherwise(~)
pattern to fill the values of our column conditionallywe use the
alias(~)
method to assign a label to new column
Note F.col("age")
can also be replaced by df["age"]
:
How does col know which DataFrame's column to refer to?
Notice how the col(~)
method only takes in as argument the name of the column. PySpark executes our code lazily and waits until an action is invoked (e.g. show()
) to run all the transformations (e.g. df.select(~)
). Therefore, PySpark will have the needed context to decipher to which DataFrame's column the col(~)
is referring.
For example, suppose we have the following two PySpark DataFrames with the same schema:
my_col = F.col("name")
Let's select the name
column from df1
:
+----+|name|+----+|Alex|| Bob|+----+
Here, PySpark knows that we are referring to df1
's name column because df1
is invoking the transformation (select(~)
).
Let's now select the name
column from df2
:
+-----+| name|+-----+|Cathy|| Doge|+-----+
Again, PySpark is aware that this time the name
column is referring to df2
's column.