The payoff

Your query

Suppose you have to write a (greatly simplified) query with the following requirements:

  • It runs every day
  • It gathers data over the past 5 days from a single table
  • It must accomodate dates specified the America/Los_Angeles, though the data is stored in UTC in Hive
  • It must additionally fetch data from the partitions for the 1 hour just before and after the time range the query covers
  • The ts column is a string data type representing unix time since the epoch in milliseconds

Middle of the month

Here’s what your query might look like for the middle of May 2016:

SET BEGIN='2016-05-15'
SET END='2016-05-20'
SELECT * FROM my_table WHERE
ts > CAST(unix_timestamp(${hiveconf:BEGIN}, 'yyyy-MM-dd') as bigint) * 1000 AND
ts < CAST(unix_timestamp(${hiveconf:END}, 'yyyy-MM-dd') as bigint) * 1000 AND
(YYYY=2016 AND MM=05 AND ((DD=14 AND HH>=16) OR (DD>14 AND DD<19) OR (DD=19 AND HH<18)))

At the end of the month, when you cross the May/June border, you’d have this:

SET BEGIN='2016-05-30'
SET END='2016-06-04'
SELECT * FROM my_table WHERE
ts > CAST(unix_timestamp(${hiveconf:BEGIN}, 'yyyy-MM-dd') as bigint) * 1000 AND
ts < CAST(unix_timestamp(${hiveconf:END}, 'yyyy-MM-dd') as bigint) * 1000 AND
(YYYY=2016 AND ((MM=05 AND DD=29 AND HH>=16) OR (MM=05 AND DD>29) OR (MM=06 AND DD<03) OR (MM=06 AND DD=03 AND HH<18)))

Or, you can create a hpdr template. It’s just a query with HPDR_ variables in it:

# q.hql
SELECT * FROM my_table WHERE
ts > ${HPDR_slop_begin_unixtime_ms} AND
ts < ${HPDR_slop_end_unixtime_ms} AND
${HPDR_range}

Then use it to create your query by using a hpdr date range. For example, with the main.py example:

> main.py -b 20160515 -e 20160520 --dzone America/Los_Angeles -s 1hours -f q.hql
-- tmp.hql
SELECT * FROM my_table WHERE
ts > 1463266800000 AND
ts < 1463706000000 AND
(YYYY=2016 AND MM=05 AND ((DD=14 AND HH>=16) OR (DD>14 AND DD<19) OR (DD=19 AND HH<18)))

And:

> main.py -b 20160530 -e 20160604 --dzone America/Los_Angeles -s 1hours -f q.hql
-- tmp.hql
SELECT * FROM my_table WHERE
ts > 1464562800000 AND
ts < 1465002000000 AND
(YYYY=2016 AND ((MM=05 AND DD=29 AND HH>=16) OR (MM=05 AND DD>29) OR (MM=06 AND DD<03) OR (MM=06 AND DD=03 AND HH<18)))

You can get a list of all the HPDR_ variables with the -v flag:

> main.py -b 20160530 -e 20160604 --dzone America/Los_Angeles -s 1hours -f /tmp/q.hql -v
 -- tmp.hql
 SELECT * FROM my_table WHERE
 ts > 1464562800000 AND
 ts < 1465002000000 AND
 (YYYY=2016 AND ((MM=05 AND DD=29 AND HH>=16) OR (MM=05 AND DD>29) OR (MM=06 AND DD<03) OR (MM=06 AND DD=03 AND HH<18)))
 -----------------------------------------------------------------------
 -- Parts of this query were auto-generated with hpdr (pip install hpdr)
 --
 --  /home/mlibucha/Envs/3hpdr/bin/python ../main.py -b 20160530 -e 20160604 --dzone America/Los_Angeles -s 1hours -f /tmp/q.hql -v
 --
 --
 -- Input:
 ---------
 -- -- tmp.hql
 -- SELECT * FROM my_table WHERE
 -- ts > ${HPDR_slop_begin_unixtime_ms} AND
 -- ts < ${HPDR_slop_end_unixtime_ms} AND
 -- ${HPDR_range}
 ----------
 -- Output:
 ----------
 -- -- tmp.hql
 -- SELECT * FROM my_table WHERE
 -- ts > 1464562800000 AND
 -- ts < 1465002000000 AND
 -- (YYYY=2016 AND ((MM=05 AND DD=29 AND HH>=16) OR (MM=05 AND DD>29) OR (MM=06 AND DD<03) OR (MM=06 AND DD=03 AND HH<18)))
 ----------
 --
 -- This is a complete list of the available template variables and their values:
 --
 -- variable                     value
 -- ---------------------------  -------------------
 -- HPDR_dzone                   UTC
 -- HPDR_qzone                   America/Los_Angeles
 -- HPDR_begin_ts                2016-05-29 17:00:00
 -- HPDR_end_ts                  2016-06-03 17:00:00
 -- HPDR_slop_begin_ts           2016-05-29 16:00:00
 -- HPDR_slop_end_ts             2016-06-03 18:00:00
 -- HPDR_begin_unixtime          1464566400
 -- HPDR_begin_unixtime_ms       1464566400000
 -- HPDR_begin_yyyymmdd          20160529
 -- HPDR_begin_yyyy              2016
 -- HPDR_begin_mm                05
 -- HPDR_begin_dd                29
 -- HPDR_begin_hh                17
 -- HPDR_begin_min               00
 -- HPDR_begin_sec               00
 -- HPDR_end_unixtime            1464998400
 -- HPDR_end_unixtime_ms         1464998400000
 -- HPDR_end_yyyymmdd            20160603
 -- HPDR_end_yyyy                2016
 -- HPDR_end_mm                  06
 -- HPDR_end_dd                  03
 -- HPDR_end_hh                  17
 -- HPDR_end_min                 00
 -- HPDR_end_sec                 00
 -- HPDR_slop_begin_unixtime     1464562800
 -- HPDR_slop_begin_unixtime_ms  1464562800000
 -- HPDR_slop_begin_yyyymmdd     20160529
 -- HPDR_slop_begin_yyyy         2016
 -- HPDR_slop_begin_mm           05
 -- HPDR_slop_begin_dd           29
 -- HPDR_slop_begin_hh           16
 -- HPDR_slop_begin_min          00
 -- HPDR_slop_begin_sec          00
 -- HPDR_slop_end_unixtime       1465002000
 -- HPDR_slop_end_unixtime_ms    1465002000000
 -- HPDR_slop_end_yyyymmdd       20160603
 -- HPDR_slop_end_yyyy           2016
 -- HPDR_slop_end_mm             06
 -- HPDR_slop_end_dd             03
 -- HPDR_slop_end_hh             18
 -- HPDR_slop_end_min            00
 -- HPDR_slop_end_sec            00
 --
 -- Note that all values have been shifted to the query time zone (HPDR_qzone)