Pythian Blog: Technical Track

GoldenGate 12.2 big data adapters: part 4 - HBASE

This is the next post in my series about Oracle GoldenGate Big Data adapters. Here is list of all posts in the series:
  1. GoldenGate 12.2 Big Data Adapters: part 1 - HDFS
  2. GoldenGate 12.2 Big Data Adapters: part 2 - Flume
  3. GoldenGate 12.2 Big Data Adapters: part 3 - Kafka
  4. GoldenGate 12.2 Big Data Adapters: part 4 - HBASE
In this post I am going to explore HBASE adapter for GoldenGate. Let's start by recalling what we know about HBASE. The Apache HBASE is non-relational, distributed database. It has been modelled after the Google's Bigtable distributed database. It can provide read write access to the data and is based on top of Hadoop or HDFS. So, what does it tell us? First, we can write and change the data. Second, we need to remember that it is non-relation database and it is a bit of a different approach to data in comparison with traditional relation databases. You can think about HBase as about a key-value store. We are not going deep inside HBASE architecture and internals here, since our main task is to test Oracle GoldenGate adapter and see how it works. Our configuration has an Oracle database as a source with a GoldenGate extract and target system where we have Oracle GoldenGate for BigData. We have more information about setting up the source and target in the first post in the series about HDFS adapter. The source side replication part has already been configured and started. We have initial trail file for data initialization and trails for the ongoing replication. We capture changes for all tables in the ggtest schema on the oracle database. Now we need to prepare our target site. Let's start from HBase. I used a pseudo-distributed mode for my tests where I ran a fully-distributed mode on a single host. It is not acceptable for any production configuration but will suffice for our tests. On the same box I have HDFS to serve as a main storage. Oracle documentation for the adapter states that they support HBase from version 1.0.x . In my first attempt I tried to use HBase version 1.0.0 (Cloudera 5.6) but it didn't work. I got errors in the GoldenGate and my extract was aborted. Here is the error : [code lang="text"] 2016-03-29 11:51:31 ERROR OGG-15051 Oracle GoldenGate Delivery, irhbase.prm: Java or JNI exception: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HTableDescriptor.addFamily(Lorg/apache/hadoop/hbase/HColumnDescriptor;)Lorg/apache/hadoop/hbase/HTableDescriptor;. 2016-03-29 11:51:31 ERROR OGG-01668 Oracle GoldenGate Delivery, irhbase.prm: PROCESS ABENDING. [/code] So, I installed another version HBase and the version 1.1.4 worked just fine. I used simple, standard HBase configuration for pseudo-distributed mode where region server was on the same host as master and hbase.rootdir point to local hdfs. Here is example of configuration: [code lang="text"] <configuration> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.rootdir</name> <value>hdfs://localhost:8020/user/oracle/hbase</value> </property> </configuration> [root@sandbox conf]# cat regionservers localhost [root@sandbox conf]# [/code] As soon as we have HBase setup and running we can switch our attention to GoldenGate instead. We have already a trail file with initial load. Now we need to prepare our configuration files for initial and ongoing replication. Let's go to our GoldenGate for Big Data home directory and prepare everything. In first, we need a hbase.conf file copied from $OGG_HOME/AdapterExamples/big-data/hbase directory to $OGG_HOME/dirprm. I left everything as it used to be in the original file changing only gg.classpath parameter to point it to my configuration files and libs for HBase. Here is an example of the configuration files: [code lang="text"] [oracle@sandbox oggbd]$ cat dirprm/hbase.props gg.handlerlist=hbase gg.handler.hbase.type=hbase gg.handler.hbase.hBaseColumnFamilyName=cf gg.handler.hbase.keyValueDelimiter=CDATA[=] gg.handler.hbase.keyValuePairDelimiter=CDATA[,] gg.handler.hbase.encoding=UTF-8 gg.handler.hbase.pkUpdateHandling=abend gg.handler.hbase.nullValueRepresentation=CDATA[NULL] gg.handler.hbase.authType=none gg.handler.hbase.includeTokens=false gg.handler.hbase.mode=tx goldengate.userexit.timestamp=utc goldengate.userexit.writers=javawriter javawriter.stats.display=TRUE javawriter.stats.full=TRUE gg.log=log4j gg.log.level=INFO gg.report.time=30sec gg.classpath=/u01/hbase/lib/*:/u01/hbase/conf:/usr/lib/hadoop/client/* javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar [/code] In second, we have to prepare a parameter file for our initial load. I used a simple file with minimum parameters. [code lang="text"] [oracle@sandbox oggbd]$ cat dirprm/irhbase.prm -- passive REPLICAT irhbase -- Trail file for this example is located in "./dirdat/initld" file -- Command to add REPLICAT -- run replicat irhbase: -- ./replicat paramfile dirprm/irhbase.prm reportfile dirrpt/irhbase.rpt SPECIALRUN END RUNTIME EXTFILE /u01/oggbd/dirdat/initld TARGETDB LIBFILE libggjava.so SET property=dirprm/hbase.props REPORTCOUNT EVERY 1 MINUTES, RATE GROUPTRANSOPS 10000 MAP GGTEST.*, TARGET BDTEST.*; [/code] Having that configuration file we can run the replicat in passive mode from command line and see the result. Here is initial status for HBASE: [code lang="text"] hbase(main):001:0> version 1.1.4, r14c0e77956f9bb4c6edf0378474264843e4a82c3, Wed Mar 16 21:18:26 PDT 2016 hbase(main):001:0> list TABLE 0 row(s) in 0.3340 seconds => [] hbase(main):002:0> [/code] Running the replicat: [code lang="text"] oracle@sandbox oggbd]$ ./replicat paramfile dirprm/irhbase.prm reportfile dirrpt/irhbase.rpt [oracle@sandbox oggbd]$ [/code] Now we have 2 tables in HBASE: [code lang="text"] hbase(main):002:0> list TABLE BDTEST:TEST_TAB_1 BDTEST:TEST_TAB_2 2 row(s) in 0.3680 seconds => ["BDTEST:TEST_TAB_1", "BDTEST:TEST_TAB_2"] hbase(main):003:0> [/code] Let's have a look to the tables structure and contains: [code lang="text"] hbase(main):004:0> describe 'BDTEST:TEST_TAB_1' Table BDTEST:TEST_TAB_1 is ENABLED BDTEST:TEST_TAB_1 COLUMN FAMILIES DESCRIPTION {NAME => 'cf', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MI N_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} 1 row(s) in 0.2090 seconds hbase(main):005:0> scan 'BDTEST:TEST_TAB_1' ROW COLUMN+CELL 1 column=cf:ACC_DATE, timestamp=1459269153102, value=2014-01-22:12:14:30 1 column=cf:PK_ID, timestamp=1459269153102, value=1 1 column=cf:RND_STR, timestamp=1459269153102, value=371O62FX 1 column=cf:RND_STR_1, timestamp=1459269153102, value=RJ68QYM5 1 column=cf:USE_DATE, timestamp=1459269153102, value=2014-01-24:19:09:20 2 column=cf:ACC_DATE, timestamp=1459269153102, value=2014-05-11:05:23:23 2 column=cf:PK_ID, timestamp=1459269153102, value=2 2 column=cf:RND_STR, timestamp=1459269153102, value=371O62FX 2 column=cf:RND_STR_1, timestamp=1459269153102, value=HW82LI73 2 column=cf:USE_DATE, timestamp=1459269153102, value=2014-01-24:19:09:20 3 column=cf:ACC_DATE, timestamp=1459269153102, value=2014-01-22:12:14:30 3 column=cf:PK_ID, timestamp=1459269153102, value=3 3 column=cf:RND_STR, timestamp=1459269153102, value=RXZT5VUN 3 column=cf:RND_STR_1, timestamp=1459269153102, value=RJ68QYM5 3 column=cf:USE_DATE, timestamp=1459269153102, value=2013-09-04:23:32:56 4 column=cf:ACC_DATE, timestamp=1459269153102, value=2014-05-11:05:23:23 4 column=cf:PK_ID, timestamp=1459269153102, value=4 4 column=cf:RND_STR, timestamp=1459269153102, value=RXZT5VUN 4 column=cf:RND_STR_1, timestamp=1459269153102, value=HW82LI73 4 column=cf:USE_DATE, timestamp=1459269153102, value=2013-09-04:23:32:56 4 row(s) in 0.1630 seconds hbase(main):006:0> scan 'BDTEST:TEST_TAB_2' ROW COLUMN+CELL 7|IJWQRO7T|2013-07-07:08:13:52 column=cf:ACC_DATE, timestamp=1459269153132, value=2013-07-07:08:13:52 7|IJWQRO7T|2013-07-07:08:13:52 column=cf:PK_ID, timestamp=1459269153132, value=7 7|IJWQRO7T|2013-07-07:08:13:52 column=cf:RND_STR_1, timestamp=1459269153132, value=IJWQRO7T 1 row(s) in 0.0390 seconds hbase(main):007:0> [/code] Everything looks good for me. We have structure and records as expected. Let's go forward and setup ongoing replication. I have created a parameter file for my replicat using the the initial load parameters as a basis: [code lang="text"] [oracle@sandbox oggbd]$ cat dirprm/rhbase.prm REPLICAT rhbase -- Trail file for this example is located in "dirdat/or" directory -- Command to add REPLICAT -- add replicat rhbase, exttrail dirdat/or TARGETDB LIBFILE libggjava.so SET property=dirprm/hbase.props REPORTCOUNT EVERY 1 MINUTES, RATE GROUPTRANSOPS 10000 MAP ggtest.*, TARGET bdtest.*; [/code] We are checking our trail files and starting our replicat using the latest trail file. By default, a replicat would be looking for a trail with sequential number 0, but, since I have a purging policy on my GoldenGate it deletes old files and I need tell to replicat where to start exactly. [code lang="text"] [oracle@sandbox oggbd]$ ll dirdat/ total 4940 -rw-r-----. 1 oracle oinstall 3028 Feb 16 14:17 initld -rw-r-----. 1 oracle oinstall 2015199 Mar 24 13:07 or000043 -rw-r-----. 1 oracle oinstall 2015229 Mar 24 13:08 or000044 -rw-r-----. 1 oracle oinstall 1018490 Mar 24 13:09 or000045 [oracle@sandbox oggbd]$ ggsci Oracle GoldenGate Command Interpreter Version 12.2.0.1.0 OGGCORE_12.2.0.1.0_PLATFORMS_151101.1925.2 Linux, x64, 64bit (optimized), Generic on Nov 10 2015 16:18:12 Operating system character set identified as UTF-8. Copyright (C) 1995, 2015, Oracle and/or its affiliates. All rights reserved. GGSCI (sandbox.localdomain) 1> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING GGSCI (sandbox.localdomain) 2> add replicat rhbase, exttrail dirdat/or,EXTSEQNO 45 REPLICAT added. GGSCI (sandbox.localdomain) 3> start replicat rhbase Sending START request to MANAGER ... REPLICAT RHBASE starting GGSCI (sandbox.localdomain) 4> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING REPLICAT RUNNING RHBASE 00:00:00 00:00:06 GGSCI (sandbox.localdomain) 5> info rhbase REPLICAT RHBASE Last Started 2016-03-29 12:56 Status RUNNING Checkpoint Lag 00:00:00 (updated 00:00:08 ago) Process ID 27277 Log Read Checkpoint File dirdat/or000045 2016-03-24 13:09:02.000274 RBA 1018490 GGSCI (sandbox.localdomain) 6> [/code] I inserted number of rows to test_tab_1 on oracle side and all of them were successfully replicated to HBASE. [code lang="text"] hbase(main):015:0> count 'BDTEST:TEST_TAB_1' Current count: 1000, row: 1005694 Current count: 2000, row: 442 Current count: 3000, row: 6333 3473 row(s) in 1.0810 seconds => 3473 hbase(main):016:0> [/code] Let's have a look bit close to test_tab_1 and test_tab_2: [code lang="text"] hbase(main):005:0> scan 'BDTEST:TEST_TAB_1' ROW COLUMN+CELL 1 column=cf:ACC_DATE, timestamp=1459269153102, value=2014-01-22:12:14:30 1 column=cf:PK_ID, timestamp=1459269153102, value=1 1 column=cf:RND_STR, timestamp=1459269153102, value=371O62FX 1 column=cf:RND_STR_1, timestamp=1459269153102, value=RJ68QYM5 1 column=cf:USE_DATE, timestamp=1459269153102, value=2014-01-24:19:09:20 2 column=cf:ACC_DATE, timestamp=1459269153102, value=2014-05-11:05:23:23 2 column=cf:PK_ID, timestamp=1459269153102, value=2 2 column=cf:RND_STR, timestamp=1459269153102, value=371O62FX 2 column=cf:RND_STR_1, timestamp=1459269153102, value=HW82LI73 2 column=cf:USE_DATE, timestamp=1459269153102, value=2014-01-24:19:09:20 .............................................. hbase(main):006:0> scan 'BDTEST:TEST_TAB_2' ROW COLUMN+CELL 7|IJWQRO7T|2013-07-07:08:13:52 column=cf:ACC_DATE, timestamp=1459269153132, value=2013-07-07:08:13:52 7|IJWQRO7T|2013-07-07:08:13:52 column=cf:PK_ID, timestamp=1459269153132, value=7 7|IJWQRO7T|2013-07-07:08:13:52 column=cf:RND_STR_1, timestamp=1459269153132, value=IJWQRO7T 1 row(s) in 0.0390 seconds hbase(main):007:0> [/code] You can see that row identifier for test_tab_1 is value for pk_id and for test_tab_2 it is concatenation of all values for all columns. Why is it so? The difference is in constraints for the tables. Since we don't have a primary key or unique index for test_tab_2 it uses all columns as a key value. We can try to add a constraint and see the result. [code lang="sql"] select * from dba_constraints where owner='GGTEST' and table_name='TEST_TAB_2'; no rows selected alter table ggtest.test_tab_2 add constraint pk_test_tab_2 primary key (pk_id); Table altered. insert into ggtest.test_tab_2 values(9,'PK_TEST',sysdate,null); 1 row created. commit; Commit complete. orcl> [/code] And let us comare with result on the HBASE: [code lang="text"] hbase(main):012:0> scan 'BDTEST:TEST_TAB_2' ROW COLUMN+CELL 7|IJWQRO7T|2013-07-07:08:13:52 column=cf:ACC_DATE, timestamp=1459275116849, value=2013-07-07:08:13:52 7|IJWQRO7T|2013-07-07:08:13:52 column=cf:PK_ID, timestamp=1459275116849, value=7 7|IJWQRO7T|2013-07-07:08:13:52 column=cf:RND_STR_1, timestamp=1459275116849, value=IJWQRO7T 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:ACC_DATE, timestamp=1459278884047, value=2016-03-29:15:14:37 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:PK_ID, timestamp=1459278884047, value=8 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:RND_STR_1, timestamp=1459278884047, value=TEST_INS1 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:TEST_COL, timestamp=1459278884047, value=TEST_ALTER 9 column=cf:ACC_DATE, timestamp=1462473865704, value=2016-05-05:14:44:19 9 column=cf:PK_ID, timestamp=1462473865704, value=9 9 column=cf:RND_STR_1, timestamp=1462473865704, value=PK_TEST 9 column=cf:TEST_COL, timestamp=1462473865704, value=NULL 3 row(s) in 0.0550 seconds hbase(main):013:0> [/code] It is fully dynamic and changed row id column on the fly. Will it work with unique index? Yes it will : [code lang="sql"] delete from ggtest.test_tab_2 where pk_id=9; 1 row deleted. alter table ggtest.test_tab_2 drop constraint pk_test_tab_2; Table altered. create unique index ggtest.ux_test_tab_2 on ggtest.test_tab_2 (pk_id); Index created. insert into ggtest.test_tab_2 values(10,'UX_TEST',sysdate,null); 1 row created. commit; [/code] Here is the newly inserted row. [code lang="text"] hbase(main):017:0> scan 'BDTEST:TEST_TAB_2' ROW COLUMN+CELL 10 column=cf:ACC_DATE, timestamp=1462474389145, value=2016-05-05:14:53:03 10 column=cf:PK_ID, timestamp=1462474389145, value=10 10 column=cf:RND_STR_1, timestamp=1462474389145, value=UX_TEST 10 column=cf:TEST_COL, timestamp=1462474389145, value=NULL 7|IJWQRO7T|2013-07-07:08:13:52 column=cf:ACC_DATE, timestamp=1459275116849, value=2013-07-07:08:13:52 7|IJWQRO7T|2013-07-07:08:13:52 column=cf:PK_ID, timestamp=1459275116849, value=7 [/code] But it will not make any difference if we just create an index on the source. It will not change anything. So, if we need to identify a key for a table we have to have at least unique constraint. Of course it is just default behavior for a schema replication and we may use KEYCOLS to identify keys for some tables. Interesting that if we change a table structure it will affect all newly inserted rows but will not change existing even if we update some values. It works by this way if you have an unique identifier and it was not changed by your DDL operation. Here is an example. We have a column "TEST_COL" in the table test_tab_2. Let's drop the column and update the row. Keep in mind that our primary key is column PK_ID and we are not modifying the key. [code lang="sql"] alter table ggtest.test_tab_2 drop column TEST_COL; Table altered. update ggtest.test_tab_2 set rnd_str_1='TEST_COL' where pk_id=9; 1 row updated. commit; [/code] In HBASE we can see the same set of columns: [code lang="text"] hbase(main):030:0> scan 'BDTEST:TEST_TAB_2' ROW COLUMN+CELL 9 column=cf:ACC_DATE, timestamp=1462477581440, value=2016-05-05:15:46:13 9 column=cf:PK_ID, timestamp=1462477794597, value=9 9 column=cf:RND_STR_1, timestamp=1462477794597, value=TEST_COL 9 column=cf:TEST_COL, timestamp=1462477581440, value=NULL 1 row(s) in 0.0200 seconds [/code] We still have the deleted column TEST_COL even we've updated the row. But if we insert any new row it will have the new set of columns: [code lang="sql"] insert into ggtest.test_tab_2 values(10,'TEST_COL',sysdate); 1 row created. commit; Commit complete. [/code] And in HBASE: [code lang="text"] hbase(main):031:0> scan 'BDTEST:TEST_TAB_2' ROW COLUMN+CELL 10 column=cf:ACC_DATE, timestamp=1462477860649, value=2016-05-05:15:50:55 10 column=cf:PK_ID, timestamp=1462477860649, value=10 10 column=cf:RND_STR_1, timestamp=1462477860649, value=TEST_COL 9 column=cf:ACC_DATE, timestamp=1462477581440, value=2016-05-05:15:46:13 9 column=cf:PK_ID, timestamp=1462477794597, value=9 9 column=cf:RND_STR_1, timestamp=1462477794597, value=TEST_COL 9 column=cf:TEST_COL, timestamp=1462477581440, value=NULL 2 row(s) in 0.0340 seconds [/code] And, as for all other cases, truncate on source table is not going to be replicated to the target and the operation will be ignored. You have to truncate the table in HBASE by yourself to keep the data in sync. In case you insert data again the data in HBASE will be "updated". But it will not delete other rows. It will be more like a "merge" operation. Here is an example: [code lang="text"] truncate table ggtest.test_tab_2; Table truncated. insert into ggtest.test_tab_2 values(10,'TEST_COL2',sysdate); 1 row created. commit; Commit complete. select * from ggtest.test_tab_2; PK_ID RND_STR_1 ACC_DATE ---------------- ---------- ----------------- 10 TEST_COL2 05/05/16 16:01:20 orcl> HBASE: hbase(main):033:0> scan 'BDTEST:TEST_TAB_2' ROW COLUMN+CELL 10 column=cf:ACC_DATE, timestamp=1462478485067, value=2016-05-05:16:01:20 10 column=cf:PK_ID, timestamp=1462478485067, value=10 10 column=cf:RND_STR_1, timestamp=1462478485067, value=TEST_COL2 9 column=cf:ACC_DATE, timestamp=1462477581440, value=2016-05-05:15:46:13 9 column=cf:PK_ID, timestamp=1462477794597, value=9 9 column=cf:RND_STR_1, timestamp=1462477794597, value=TEST_COL 9 column=cf:TEST_COL, timestamp=1462477581440, value=NULL 2 row(s) in 0.0300 seconds hbase(main):034:0> [/code] I spent some time testing performance and found the main bottleneck was my Oracle source rather than GoldenGate and HBASE. I was able to sustain transaction rate up to 60 DML per second and my Oracle DB started to struggle to keep pace because of waiting for a commit. The HBASE and replicat were absolutely fine. I also checked how it handles big transactions and inserted about 2 billion rows by one transaction. It worked fine. Of course it doesn't prove that any of your production configurations will be without any performance issues. To conduct real performance tests I need to use much bigger environment. In addition, I noticed one more minor error in Oracle documentation for adapter related to "keyValuePairDelimiter" parameter. In documentation it is replaced by "keyValueDelimiter". It just small mistype and the "keyValueDelimiter" is repeated twice. First time it is correct and the second time it stands on the place where "keyValuePairDelimiter" is supposed to be. Here is the link. As a summary I can say that despite some minor issues the adapters and GoldenGate for Big Data showed quite mature status and readiness for real work. I think it is good robust technology and, hopefully, its development will continue improving it with new releases. I am looking forward to use it in a real production environment with significant workload. In following posts I will try to test different DDL operations and maybe some other datatypes. Stay tuned.

No Comments Yet

Let us know what you think

Subscribe by email