Hive Metastore Utils

About Hive Metastore

Hive metastore is a database storing metadata about Hive tables, which you operate in your Sparkly (Hive) Context. Read more about Hive Metastore

To configure a SparklyContext to work with your Hive Metastore, you have to set hive.metastore.uris option. You can do this via hive-site.xml file in spark config ($SPARK_HOME/conf/hive-site.xml), like this:

<property>
  <name>hive.metastore.uris</name>
  <value>thrift://<n.n.n.n>:9083</value>
  <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>

or set it dynamically in SparklyContext options, like this:

class MySparklyContext(SparklyContext):
    options = {
        'hive.metastore.uris': 'thrift://<n.n.n.n>:9083',
    }

After this your sparkly context will operate on the configured Hive Metastore.

Use cases

Check for existence

Why: sometimes logic of your program may depend on existance of a table in a Hive Metastore. For example: to know if we should create a new table, or we need to replace an existing one.

from sparkly import SparklyContext
hc = SparklyContext()
assert(hc.hms.table('my_table').exists() in {True, False})

Create a table in hive metastore

Why: You may want to unify access to all your data via Hive Metastore tables. To do this you generally need to perform ‘CREATE TABLE ...’ statement for each data you have. To simplify this we implemented this method which generates the CREATE TABLE statements by passed parameters and executes them on Hive Metastore.

Input: table name, data on some data storage hdfs or s3, stored in some specific format (parquet, avro, csv, etc.)

Output: table available in HiveMetastore

from sparkly import SparkeContext
# input
hc = SparklyContext()
df = hc.read_ext.by_url('parquet:s3://path/to/data/')
# operation
hc.hms.create_table(
     'new_shiny_table',
     df,
     location='s3://path/to/data/',
     partition_by=['partition', 'fields'],
     output_format='parquet'
)
new_df = hc.read_ext.by_url('table://new_shiny_table')

Replace table in hive metastore

Why: some times you want to quickly replace data underlying some table in Hive Metastore. For example, if you exported a new snapshot of your data to a new location and want to point Hive Metastore table to this new location. This method avoids downtime during which data in the table won’t be accessible. It first creates a new table separately (slow operation) and then operating on meta data (quick renaming operation).

*Input:* table name to replace, data schema, location, partitioning, format.

Output: updated table in Hive Metastore.

from sparkly import SparkeContext
# input
hc = SparklyContext()
df = hc.read_ext.by_url('csv:s3://path/to/data/new/')
# operation
table = hc.hms.replace_table(
    'old_table',
    df,
    location='s3://path/to/data/',
    partition_by=['partition', 'fields'],
)

Operating on table properties

Why: some times you want to assign some metadata to your table like creation time, last update, purpose, data source, etc. Table properties is a perfect place for this. Generally you have to execute Sql queries and parse results to manipulate table properties. We implemented a more convenient interface on top of this.

Set/Get property

from sparkly import SparklyContext
hc = SparklyContext()
table = hc.hms.table('my_table')
table.set_property('foo', 'bar')
assert table.get_property('foo') == 'bar'
assert table.get_all_properties() == {'foo': 'bar'}

Note properties may only have string keys and values, so you have to think on serialization from other data types by yourself.

API documentation

class sparkly.hive_metastore_manager.SparklyHiveMetastoreManager(hc)[source]

A set of tools to interact with HiveMetastore.

create_table(table_name, schema, location, partition_by=None, table_format=None, properties=None)[source]

Creates table in Hive Metastore.

Parameters:
  • table_name (str) – name of new Table.
  • schema (pyspark.sql.dataframe.DataFrame) – schema.
  • location (str) – location of data.
  • partition_by (list) – partitioning columns.
  • table_format (str) – default is parquet.
  • properties (dict) – properties to assign to the table.
Returns:

Table

get_all_tables()[source]

Returns all tables available in metastore.

Returns:list
replace_table(table_name, schema, location, partition_by=None, table_format=None)[source]

Replaces table table_name with data represented by schema, location.

Parameters:
  • table_name (str) – Table name.
  • schema (pyspark.sql.dataframe.DataFrame) – schema.
  • location (str) – data location, ex.: s3://path/tp/data.
  • partition_by (list) – fields the data partitioned by.
Returns:

Table

class sparkly.hive_metastore_manager.Table(hms, table_name)[source]

Represents a table in HiveMetastore.

Provides meta data operations on a Table.

df()[source]

Returns dataframe for the managed table.

Returns:pyspark.sql.dataframe.DataFrame
exists()[source]

Checks if table exists.

Returns:bool
get_all_properties()[source]

Returns all table properties.

Returns:Property names to values.
Return type:dict
get_property(name, to_type=None)[source]

Gets table property.

Parameters:
  • name (str) – Name of the property.
  • to_type (type) – Type to coarce to, str by default.
Returns:

any

set_property(name, value)[source]

Sets table property.

Parameters:
  • name (str) – Name of the property.
  • value (str) – Value of the proporty.
Returns:

Self.

Return type:

Table