Column and DataFrame Functions¶
A counterpart of pyspark.sql.functions providing useful shortcuts:
- a cleaner alternative to chaining together multiple when/otherwise statements.
- an easy way to join multiple dataframes at once and disambiguate fields with the same name.
API documentation¶
-
sparkly.functions.
multijoin
(dfs, on=None, how=None, coalesce=None)[source]¶ Join multiple dataframes.
Parameters: - dfs (list[pyspark.sql.DataFrame]) –
- on – same as
pyspark.sql.DataFrame.join
. - how – same as
pyspark.sql.DataFrame.join
. - coalesce (list[str]) – column names to disambiguate by coalescing across the input dataframes. A column must be of the same type across all dataframes that define it; if different types appear coalesce will do a best-effort attempt in merging them. The selected value is the first non-null one in order of appearance of the dataframes in the input list. Default is None - don’t coalesce any ambiguous columns.
Returns: pyspark.sql.DataFrame or None if provided dataframe list is empty.
Example
Assume we have two DataFrames, the first is
first = [{'id': 1, 'value': None}, {'id': 2, 'value': 2}]
and the second issecond = [{'id': 1, 'value': 1}, {'id': 2, 'value': 22}]
Then collecting the
DataFrame
produced bymultijoin([first, second], on='id', how='inner', coalesce=['value'])
yields
[{'id': 1, 'value': 1}, {'id': 2, 'value': 2}]
.
-
sparkly.functions.
switch_case
(switch, case=None, default=None, **additional_cases)[source]¶ Switch/case style column generation.
Parameters: - switch (str, pyspark.sql.Column) – column to “switch” on; its values are going to be compared against defined cases.
- case (dict) – case statements. When a key matches the value of the column in a specific row, the respective value will be assigned to the new column for that row. This is useful when your case condition constants are not strings.
- default – default value to be used when the value of the switch column doesn’t match any keys.
- additional_cases – additional “case” statements, kwargs style. Same semantics with cases above. If both are provided, cases takes precedence.
Returns: pyspark.sql.Column
Example
switch_case('state', CA='California', NY='New York', default='Other')
is equivalent to
>>> F.when( ... F.col('state') == 'CA', 'California' ).when( ... F.col('state') == 'NY', 'New York' ).otherwise('Other')