PySpark DataFrame | select method
Start your free 7-days trial now!
The select(~)
method of PySpark DataFrame returns a new DataFrame with the specified columns.
Parameters
1. *cols
| string
, Column
or list
The columns to include in the returned DataFrame.
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Selecting a single column of PySpark DataFrame
To select a single column, pass the name of the column as a string:
+----+|name|+----+|Alex|| Bob|+----+
Or equivalently, we could pass in a Column
object:
+----+|name|+----+|Alex|| Bob|+----+
Here, df["name"]
is of type Column
. Here, you can think of the role of select(~)
as converting a Column
object into a PySpark DataFrame.
Or equivalently, the Column
object can also be obtained using sql.function
:
Selecting multiple columns of a PySpark DataFrame
To select the columns name
and age
:
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Or equivalently, we can supply multiple Column
objects:
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Or equivalently, we can supply Column
objects obtained from sql.functions
:
import pyspark.sql.functions as F
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Selecting all columns of a PySpark DataFrame
To select all columns, pass "*"
:
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Selecting columns given a list of column labels
To select columns given a list of column labels, use the *
operator:
cols = ["name", "age"]
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Here, the *
operator is used to convert the list into positional arguments.
Selecting columns that begin with a certain substring
To select columns that begin with a certain substring:
+----+|name|+----+|Alex|| Bob|+----+
Here, we are using Python's list comprehension to get a list of column labels that begin with the substring "na"
:
cols
['name']