Hive Metastore Utils

About Hive Metastore

The Hive Metastore is a database with metadata for Hive tables.

To configure `SparklySession to work with external Hive Metastore, you need to set hive.metastore.uris option. You can do this via hive-site.xml file in spark config ($SPARK_HOME/conf/hive-site.xml):

<property>
  <name>hive.metastore.uris</name>
  <value>thrift://<n.n.n.n>:9083</value>
  <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>

or set it dynamically via SparklySession options:

class MySession(SparklySession):
    options = {
        'hive.metastore.uris': 'thrift://<n.n.n.n>:9083',
    }

Tables management

Why: you need to check if tables exist, rename them, drop them, or even overwrite existing aliases in your catalog.

from sparkly import SparklySession


spark = SparklySession()

assert spark.catalog_ext.has_table('my_table') in {True, False}
spark.catalog_ext.rename_table('my_table', 'my_new_table')
spark.catalog_ext.create_table('my_new_table', path='s3://my/parquet/data', source='parquet', mode='overwrite')
spark.catalog_ext.drop_table('my_new_table')

Table properties management

Why: sometimes you want to assign custom attributes for your table, e.g. creation time, last update, purpose, data source. The only way to interact with table properties in spark - use raw SQL queries. We implemented a more convenient interface to make your code cleaner.

from sparkly import SparklySession


spark = SparklySession()
spark.catalog_ext.set_table_property('my_table', 'foo', 'bar')
assert spark.catalog_ext.get_table_property('my_table', 'foo') == 'bar'
assert spark.catalog_ext.get_table_properties('my_table') == {'foo': 'bar'}

Note properties are stored as strings. In case if you need other types, consider using a serialisation format, e.g. JSON.

Using non-default database

Why to split your warehouse into logical groups (for example by system components). In all catalog_ext.* methods you can specify full table names <db-name>.<table-name> and it should operate properly

from time import time
from sparkly import SparklySession

spark = SparklySession()

if spark.catalog_ext.has_database('my_database'):
    self.catalog_ext.rename_table(
        'my_database.my_badly_named_table',
        'new_shiny_name',
    )
    self.catalog_ext.set_table_property(
        'my_database.new_shiny_name',
        'last_update_at',
        time(),
    )

Note be careful using ‘USE’ statements like: spark.sql(‘USE my_database’), it’s stateful and may lead to weird errors, if code assumes correct current database.

API documentation

class sparkly.catalog.SparklyCatalog(spark)[source]

A set of tools to interact with HiveMetastore.

create_table(table_name, path=None, source=None, schema=None, **options)[source]

Create table in the metastore.

Extend SparkSession.Catalog.createExternalTable by accepting a mode='overwrite' option which creates the table even if a table with the same name already exists. All other args are exactly the same.

Note

If the table exists, create two unique names, one for the new and one for the old instance, then try to swap names and drop the “old” instance. If any step fails, the metastore might be currently left at a broken state.

Parameters:mode (str) – if set to 'overwrite', drop any table of the same name from the metastore. Given as a kwarg. Default is error out if table already exists.
Returns:DataFrame associated with the created table.
Return type:pyspark.sql.DataFrame
drop_table(table_name, checkfirst=True)[source]

Drop table from the metastore.

Note

Follow the official documentation to understand DROP TABLE semantic. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL #LanguageManualDDL-DropTable

Parameters:
  • table_name (str) – A table name.
  • checkfirst (bool) – Only issue DROPs for tables that are presented in the database.
get_table_properties(table_name)[source]

Get table properties from the metastore.

Parameters:table_name (str) – A table name.
Returns:Key/value for properties.
Return type:dict[str,str]
get_table_property(table_name, property_name, to_type=None)[source]

Get table property value from the metastore.

Parameters:
  • table_name (str) – A table name. Might contain a db name. E.g. “my_table” or “default.my_table”.
  • property_name (str) – A property name to read value for.
  • to_type (function) – Cast value to the given type. E.g. int or float.
Returns:

Any

has_database(db_name)[source]

Check if database exists in the metastore.

Parameters:db_name (str) – Database name.
Returns:bool
has_table(table_name)[source]

Check if table is available in the metastore.

Parameters:table_name (str) – A table name.
Returns:bool
rename_table(old_table_name, new_table_name)[source]

Rename table in the metastore.

Note

Follow the official documentation to understand ALTER TABLE semantic. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL #LanguageManualDDL-RenameTable

Parameters:
  • old_table_name (str) – The current table name.
  • new_table_name (str) – An expected table name.
set_table_property(table_name, property_name, value)[source]

Set value for table property.

Parameters:
  • table_name (str) – A table name.
  • property_name (str) – A property name to set value for.
  • value (Any) – Will be automatically casted to string.
sparkly.catalog.get_db_name(table_name)[source]

Get database name from full table name.

sparkly.catalog.get_table_name(table_name)[source]

Get table name from full table name.