Pythian Blog: Technical Track

Comparing schemas between hive clusters

  When running several different hive instances, we found the process of maintaining/confirming synchronization to be quite time consuming. While several tools made available by Cloudera and other frameworks do provide a visual interface for an individual instance of hive, there was no available tool for comparing across clusters. We came up with a useful way to compare hive schemas between clusters. Our process contains 3 main steps:
  1. Fetch the schema from all hive environments, store the schema as a dictionary in a file.
  2. Create a class to perform comparisons leveraging python list, dict and pandas dataframe structures.
  3. Display the results/visualize the changes using a web interface (this approach uses python Flask).
  Step 1- Fetch the Schema From all Environments To fetch the schema we use the hive metastore api via the hive-thrift.py package. First, we create a socket to connect to the hive thrift server. transport = TSocket.TSocket(host, port) transport = TTransport.TBufferedTransport(transport) protocol = TBinaryProtocol.TBinaryProtocol(transport) global metastore_client metastore_client = ThriftHiveMetastore.Client(protocol) transport.open() Once that is done, it's straightforward to get all of the tables in a db (there are lots of other great methods that have been documented). db = 'myDatabase' tables = metastore_client.get_all_tables(db) Once we get the tables, we can call the get_fields() method for table in tables: for field in metastore_client.get_fields(db, table): # field is a named tuple FieldSchema(comment, type, name) print field Using the methods above, we retrieve all of the values and store them in a dictionary, locally in a text file, one file per environment. {'myDatabase': { 'tbl_customer': [ FieldSchema(comment=None, type=int, name='user_id'), FieldSchema(comment=None, type='string', name='first_name'), FieldSchema(comment=None, type='string', name='last_name') ], 'tbl_orders': [ FieldSchema(comment=None, type=int, name='order_id'), FieldSchema(comment=None, type='string', name='item_id'), FieldSchema(comment=None, type='string', name='price') ], etc.. Note that thrift gives us a tuple for the ddl, so for ease, we declare this named tuple locally. FieldSchema = collections.namedtuple(‘FieldSchema’, [‘comment’,’type’, ‘name’])   Step 2 - Create Class to Compare Dictionaries Now that we have a separate schema file for each hive instance, we can load these files into dictionaries and begin to compare them. f_stage_dict = open('schema_files/stage_schema_thrift.out') f_prod_dict = open('schema_files/prod_schema_thrift.out') d_stage = eval(f_stage_dict.read()) d_prod = eval(f_prod_dict.read()) Create a class to compare the dictionaries. Comparisons are done leveraging a variety of nice python tools (lists, set comparison, pandas dataframes, etc..). More detail in the repo. Class DictCompare: def get_fields_by_table(self, db, table):... # returns list of fields for tableA and tableB def get_databases(self):... # union of all dbs in all schemas A,B def compare_fields(self, db, table)... # returns tablediff tuple showing all field differences def compare_dbs(self, db)... #returns list of differences between dbs tables_a = [table for table in self.dict_a[db]] tables_b = [table for table in self.dict_b[db]] db_diffs_a = set(tables_a).difference(set(tables_b)) db_diffs_b = set(tables_b).difference(set(tables_a)) ..... Instantiate the class by passing the path of the schema files, and the friendly names used in the config.json file. dc = sc.DictCompare(schema_files_path,"dev", "stage" ) dbs = dc.get_databases() # show all of the databases.. print "DBS=",dbs # get all diffs for a given database compare_dict = dc.compare_dbs("my_db")   Step 3 Display the Results / Visualize the Changes I used a quick and simple python flask website to display the differences. Note there is a form on the page that allows users to select the hive instances to compare. @app.route('/home.html') @app.route('/home') def homepage(): form=EnvForm(csrf_enabled=False) env_a = form.env_a.data # 'stage', for example env_b = form.env_b.data # 'prod', for example dc = sc.DictCompare(schema_files_path,env_a, env_b ) dbs = dc.get_databases() return render_template("home.html", dbs=dbs, a_name =dc.a_name, b_name = dc.b_name,form =form ) Source code (open sourced under Apache2 license): I Look forward to any comments or feedback.   Discover more about our expertise in Big Data and Infrastructure technologies.

No Comments Yet

Let us know what you think

Subscribe by email