Hive Metastore Utils¶
About Hive Metastore¶
Hive metastore is a database storing metadata about Hive tables, which you operate in your Sparkly (Hive) Context. Read more about Hive Metastore
To configure a SparklyContext to work with your Hive Metastore, you have to set hive.metastore.uris option. You can do this via hive-site.xml file in spark config ($SPARK_HOME/conf/hive-site.xml), like this:
<property>
<name>hive.metastore.uris</name>
<value>thrift://<n.n.n.n>:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
or set it dynamically in SparklyContext options, like this:
class MySparklyContext(SparklyContext):
options = {
'hive.metastore.uris': 'thrift://<n.n.n.n>:9083',
}
After this your sparkly context will operate on the configured Hive Metastore.
Use cases¶
Check for existence¶
Why: sometimes logic of your program may depend on existance of a table in a Hive Metastore. For example: to know if we should create a new table, or we need to replace an existing one.
from sparkly import SparklyContext
hc = SparklyContext()
assert(hc.hms.table('my_table').exists() in {True, False})
Create a table in hive metastore¶
Why: You may want to unify access to all your data via Hive Metastore tables. To do this you generally need to perform ‘CREATE TABLE ...’ statement for each data you have. To simplify this we implemented this method which generates the CREATE TABLE statements by passed parameters and executes them on Hive Metastore.
Input: table name, data on some data storage hdfs or s3, stored in some specific format (parquet, avro, csv, etc.)
Output: table available in HiveMetastore
from sparkly import SparkeContext
# input
hc = SparklyContext()
df = hc.read_ext.by_url('parquet:s3://path/to/data/')
# operation
hc.hms.create_table(
'new_shiny_table',
df,
location='s3://path/to/data/',
partition_by=['partition', 'fields'],
output_format='parquet'
)
new_df = hc.read_ext.by_url('table://new_shiny_table')
Replace table in hive metastore¶
Why: some times you want to quickly replace data underlying some table in Hive Metastore. For example, if you exported a new snapshot of your data to a new location and want to point Hive Metastore table to this new location. This method avoids downtime during which data in the table won’t be accessible. It first creates a new table separately (slow operation) and then operating on meta data (quick renaming operation).
*Input:* table name to replace, data schema, location, partitioning, format.
Output: updated table in Hive Metastore.
from sparkly import SparkeContext
# input
hc = SparklyContext()
df = hc.read_ext.by_url('csv:s3://path/to/data/new/')
# operation
table = hc.hms.replace_table(
'old_table',
df,
location='s3://path/to/data/',
partition_by=['partition', 'fields'],
)
Operating on table properties¶
Why: some times you want to assign some metadata to your table like creation time, last update, purpose, data source, etc. Table properties is a perfect place for this. Generally you have to execute Sql queries and parse results to manipulate table properties. We implemented a more convenient interface on top of this.
Set/Get property
from sparkly import SparklyContext
hc = SparklyContext()
table = hc.hms.table('my_table')
table.set_property('foo', 'bar')
assert table.get_property('foo') == 'bar'
assert table.get_all_properties() == {'foo': 'bar'}
Note properties may only have string keys and values, so you have to think on serialization from other data types by yourself.
API documentation¶
-
class
sparkly.hive_metastore_manager.
SparklyHiveMetastoreManager
(hc)[source]¶ A set of tools to interact with HiveMetastore.
-
create_table
(table_name, schema, location, partition_by=None, table_format=None, properties=None)[source]¶ Creates table in Hive Metastore.
Parameters: - table_name (str) – name of new Table.
- schema (pyspark.sql.dataframe.DataFrame) – schema.
- location (str) – location of data.
- partition_by (list) – partitioning columns.
- table_format (str) – default is parquet.
- properties (dict) – properties to assign to the table.
Returns: Table
-
replace_table
(table_name, schema, location, partition_by=None, table_format=None)[source]¶ Replaces table table_name with data represented by schema, location.
Parameters: - table_name (str) – Table name.
- schema (pyspark.sql.dataframe.DataFrame) – schema.
- location (str) – data location, ex.: s3://path/tp/data.
- partition_by (list) – fields the data partitioned by.
Returns: Table
-
-
class
sparkly.hive_metastore_manager.
Table
(hms, table_name)[source]¶ Represents a table in HiveMetastore.
Provides meta data operations on a Table.
-
get_all_properties
()[source]¶ Returns all table properties.
Returns: Property names to values. Return type: dict
-