17 - Column Functions(Python)

Loading...

Lesson 17 - Column Functions

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr
from pyspark.sql.types import StringType, IntegerType
 
spark = SparkSession.builder.getOrCreate()

Built-In Functions

Spark provides several built-in functions that can be applied to column objects. These functions are stored in the pyspark.sql.functions module, which is often imported under the alias F, as shown in the the import statement below.

import pyspark.sql.functions as F

This module contains both element-wise functions and aggregation functions that can be applied to columns. We will discuss both types of functions in this lesson, and will provide a few examples of each type. A complete list of available functions can be found in the Apache Spark Documentation

We will create a small DataFrame to use in illustrating the concepts introduced in this lesson.

my_schema = 'name STRING, x1 DOUBLE, x2 INTEGER'
 
data = [
  ['Emma White', 5.2, 215],
  ['Art Brown', 4.1, 473],
  ['Carly Black', 3.7, 260],
  ['Beth Green', 4.5, 303],
  ['Dan Gray', 2.9, 185]
]
 
df = spark.createDataFrame(data, schema=my_schema)
 
df.show()
+-----------+---+---+ | name| x1| x2| +-----------+---+---+ | Emma White|5.2|215| | Art Brown|4.1|473| |Carly Black|3.7|260| | Beth Green|4.5|303| | Dan Gray|2.9|185| +-----------+---+---+

Element-Wise Functions

Element-wise functions represent an operation that is to be applied independently to every element of a column in a DataFrame. In the cell below, we provide an example that makes use of the following element-wise functions:

  • F.upper() - This function can be applied to string columns. It returns a column in which all of the strings have been converted to upper case.
  • F.length() - This function can be applied to string column. It returns a column containing the lengths of the strings in the original column.
  • F.exp() - This function accepts a column of numerical values and applies the natural exponential function to each value in the column.
  • F.log() - This function accepts a column of numerical values and applies the natural logarithm function to each value in the column.

Note that in the example below, we use alias() to name the newly created column. If we did not do this then the name of the new column would be an unsightly string representing the operation used to create the column.

df.select(
    F.upper(col('name')).alias('name_upper'),
    F.length(col('name')).alias('name_length'),
    F.exp(col('x1')).alias('exp_x1'),
    F.log(col('x2')).alias('log_x2')
).show()
+-----------+-----------+------------------+------------------+ | name_upper|name_length| exp_x1| log_x2| +-----------+-----------+------------------+------------------+ | EMMA WHITE| 10|181.27224187515122|5.3706380281276624| | ART BROWN| 9| 60.34028759736195| 6.159095388491933| |CARLY BLACK| 11| 40.4473043600674| 5.560681631015528| | BETH GREEN| 10| 90.01713130052181| 5.713732805509369| | DAN GRAY| 8| 18.17414536944306| 5.220355825078325| +-----------+-----------+------------------+------------------+

Rounding

The built-in F.round() function can be used to round values in a numerical column. It accepts two arguments. The first is the column to which the function is being applied and the second is the number of decimal deigits to which the values should be rounded. We will modify our previous example by rounding the third and fourth columns to 2 and 4 decimal places, respectively.

df.select(
    F.upper(col('name')).alias('name_upper'),
    F.length(col('name')).alias('name_length'),
    F.round(F.exp(col('x1')),2).alias('exp_x1'),
    F.round(F.log(col('x2')),4).alias('log_x2')
).show()
+-----------+-----------+------+------+ | name_upper|name_length|exp_x1|log_x2| +-----------+-----------+------+------+ | EMMA WHITE| 10|181.27|5.3706| | ART BROWN| 9| 60.34|6.1591| |CARLY BLACK| 11| 40.45|5.5607| | BETH GREEN| 10| 90.02|5.7137| | DAN GRAY| 8| 18.17|5.2204| +-----------+-----------+------+------+

Using Functions in SQL Expression Strings

We can also use expr() and SQL expression strings to apply functions to columns. When doing so, the expression string is sent to Spark where it is parsed and executed. Since the strings gets parsed on the backend by Spark, we do not need to import any modules or functions when using this approach. Every function found in pyspark.sql.functions has an SQL equivalent of the same name. Note that SQL functions is not case-sensitive and you will often see them written in all-caps.

The cell below illustrates how to use expr() and SQL expression strings to recreate the previous example that we considered.

df.select(
    expr('UPPER(name) AS name_upper'),
    expr('LENGTH(name) AS name_length'),
    expr('ROUND(EXP(x1), 2) AS exp_x1'),
    expr('ROUND(LOG(x2), 4) AS log_x2')
).show()
+-----------+-----------+------+------+ | name_upper|name_length|exp_x1|log_x2| +-----------+-----------+------+------+ | EMMA WHITE| 10|181.27|5.3706| | ART BROWN| 9| 60.34|6.1591| |CARLY BLACK| 11| 40.45|5.5607| | BETH GREEN| 10| 90.02|5.7137| | DAN GRAY| 8| 18.17|5.2204| +-----------+-----------+------+------+

Aggregation Functions

Aggregation functions combine all of the entries in a column into a single value, and can be applied using the select() method. In the example below, we will illustrate the use of the following aggregation functions:

  • F.sum() - Returns the sum of the elements in a column.
  • F.mean() - Returns the arithmetic mean of the elements in a column.
  • F.stddev() - Returns the standard deviation of the elements in a column.
  • F.min() - Returns the minimum of the elements in a column.
  • F.max() - Returns the maximum of the elements in a column.

Again, we will use alias() to assign friendly names to the new columns.

df.select(
    F.sum(col('x2')).alias('sum_x2'),
    F.mean(col('x2')).alias('mean_x2'),
    F.stddev(col('x2')).alias('stddev_x2'),
    F.min(col('x2')).alias('min_x2'),
    F.max(col('x2')).alias('max_x2')
).show()
+------+-------+------------------+------+------+ |sum_x2|mean_x2| stddev_x2|min_x2|max_x2| +------+-------+------------------+------+------+ | 1436| 287.2|113.10260828115327| 185| 473| +------+-------+------------------+------+------+

We will now recreate the results above, this time using SQL expressing strings.