Column and DataFrame Functions

A counterpart of pyspark.sql.functions providing useful shortcuts:

  • a cleaner alternative to chaining together multiple when/otherwise statements.
  • an easy way to join multiple dataframes at once and disambiguate fields with the same name.

API documentation

sparkly.functions.multijoin(dfs, on=None, how=None, coalesce=None)[source]

Join multiple dataframes.

Parameters:
  • dfs (list[pyspark.sql.DataFrame]) –
  • on – same as pyspark.sql.DataFrame.join.
  • how – same as pyspark.sql.DataFrame.join.
  • coalesce (list[str]) – column names to disambiguate by coalescing across the input dataframes. A column must be of the same type across all dataframes that define it; if different types appear coalesce will do a best-effort attempt in merging them. The selected value is the first non-null one in order of appearance of the dataframes in the input list. Default is None - don’t coalesce any ambiguous columns.
Returns:

pyspark.sql.DataFrame or None if provided dataframe list is empty.

Example

Assume we have two DataFrames, the first is first = [{'id': 1, 'value': None}, {'id': 2, 'value': 2}] and the second is second = [{'id': 1, 'value': 1}, {'id': 2, 'value': 22}]

Then collecting the DataFrame produced by

multijoin([first, second], on='id', how='inner', coalesce=['value'])

yields [{'id': 1, 'value': 1}, {'id': 2, 'value': 2}].

sparkly.functions.switch_case(switch, case=None, default=None, **additional_cases)[source]

Switch/case style column generation.

Parameters:
  • switch (str, pyspark.sql.Column) – column to “switch” on; its values are going to be compared against defined cases.
  • case (dict) – case statements. When a key matches the value of the column in a specific row, the respective value will be assigned to the new column for that row. This is useful when your case condition constants are not strings.
  • default – default value to be used when the value of the switch column doesn’t match any keys.
  • additional_cases – additional “case” statements, kwargs style. Same semantics with cases above. If both are provided, cases takes precedence.
Returns:

pyspark.sql.Column

Example

switch_case('state', CA='California', NY='New York', default='Other')

is equivalent to

>>> F.when(
... F.col('state') == 'CA', 'California'
).when(
... F.col('state') == 'NY', 'New York'
).otherwise('Other')