I would ideally like to not have to store individual records in my database and instead use my ETL (which does 1-minute / 1-hour rollups in memory) to calculate the HLL state and store that in MemSQL.
Yes, this is possible. As long as you configure your HLL library to have the same precision as we do internally it will be compatible. Ours uses 2^14 bytes.
This python snippet gives an example of doing of generating the hll binary externally and merging them to get the final result in MemSQL as well as the opposite scenario:
from memsql.common import database
import hyperloglog
import random
vals = []
for i in range(100):
vals.append(random.randint(0,50))
c = database.connect(host='127.1', user='root')
c.query('drop database if exists db')
c.query('create database db')
c.query('use db')
c.query('create table vals(id int auto_increment primary key, a int)')
for v in vals:
c.query('insert into vals value(null, %d)' % v)
print c.query("select approx_count_distinct(a) d from vals")
hll1 = hyperloglog.HyperLogLog(.01)
for v in vals[50:]:
hll1.add(v)
hll2 = hyperloglog.HyperLogLog(.01)
for v in vals[:50]:
hll2.add(v)
c.query('create table synopsis(a varbinary(16384))')
c.query('insert into synopsis values(\'%s\')' % bytearray(hll1.M))
c.query('insert into synopsis values(\'%s\')' % bytearray(hll2.M))
print c.query('select approx_count_distinct_estimate(a) d from synopsis')
hll_final = hyperloglog.HyperLogLog(.01)
hll_temp = hyperloglog.HyperLogLog(.01)
hll_temp.M = list(bytearray(c.query('select approx_count_distinct_accumulate(a) a from vals where id < 50')[0]['a']))
hll_final.update(hll_temp)
hll_temp.M = list(bytearray(c.query('select approx_count_distinct_accumulate(a) a from vals where id >= 50')[0]['a']))
hll_final.update(hll_temp)
print hll_final.card()
Could you clarify what log2m / regwidth you’re using internally? From the example, it looks like you’re using a log2m of 14 and regwidth of 8, but I wanted to verify that was the case before I moved forward with my implementation.