Pythian Blog: Technical Track

Calculating business days in HiveQL

One of the common tasks in data processing is to calculate the number of days between two given dates. You can easily achieve this by using Hive DATEDIFF function. You can also get weekday number by using this more obscure function:

SELECT FROM_UNIXTIME(your_date,'u') FROM some_table;

This will return 1 for Monday, 7 for Sunday and is based on Java SimpleDateFormat —
https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html

It becomes more challenging if you need to calculate the number of business days (excluding Saturdays and Sundays) between two dates. There is no built-in function in Hive to perform such calculation, but fortunately it is rather simple to write your own Hive UDFs. I couldn’t quickly find any existing open source functions to solve this problem, so I wrote my own using Scala — https://github.com/pythian/hive-udfs

There are actually three functions in hive-udfs package: CountBusinessDays, CountSaturdays and CountSundays. These functions accept start date and end date as UNIX_TIMESTAMP and return the count of different types of full days in this interval, excluding start and end points.

Here is how you can use this UDF in Hive:

ADD JAR hdfs:///user/hive/udfs/pythian-hive-udfs-assembly-0.1.jar;
CREATE TEMPORARY FUNCTION count_business_days AS 'com.pythian.udf.CountBusinessDays';
CREATE TEMPORARY FUNCTION count_saturdays AS 'com.pythian.udf.CountSaturdays';
CREATE TEMPORARY FUNCTION count_sundays AS 'com.pythian.udf.CountSundays';

SELECT count_business_days(UNIX_TIMESTAMP(start_date), UNIX_TIMESTAMP(end_date)) FROM some_table;

The code is open source and free to use. Comments and suggestions are always welcome.

No Comments Yet

Let us know what you think

Subscribe by email