Insert HyperLogLog data


#1

Hi -

Is it possible for my application to build the binary HyperLogLog data used by the APPROX_COUNT_DISTINCT? (the “state” referred to here: https://docs.memsql.com/sql-reference/v6.7/distinct-count-estimation-functions/)

I would ideally like to not have to store individual records in my database and instead use my ETL (which does 1-minute / 1-hour rollups in memory) to calculate the HLL state and store that in MemSQL.

Thanks for any guidance!


#2

Yes, this is possible. As long as you configure your HLL library to have the same precision as we do internally it will be compatible. Ours uses 2^14 bytes.

This python snippet gives an example of doing of generating the hll binary externally and merging them to get the final result in MemSQL as well as the opposite scenario:

from memsql.common import database
import hyperloglog
import random

vals = []
for i in range(100):
vals.append(random.randint(0,50))

c = database.connect(host='127.1', user='root')
c.query('drop database if exists db')
c.query('create database db')
c.query('use db')
c.query('create table vals(id int auto_increment primary key, a int)')
for v in vals:
c.query('insert into vals value(null, %d)' % v)

print c.query("select approx_count_distinct(a) d from vals")

hll1 = hyperloglog.HyperLogLog(.01)
for v in vals[50:]:
hll1.add(v)
hll2 = hyperloglog.HyperLogLog(.01)
for v in vals[:50]:
hll2.add(v)

c.query('create table synopsis(a varbinary(16384))')
c.query('insert into synopsis values(\'%s\')' % bytearray(hll1.M))
c.query('insert into synopsis values(\'%s\')' % bytearray(hll2.M))
print c.query('select approx_count_distinct_estimate(a) d from synopsis')

hll_final = hyperloglog.HyperLogLog(.01)
hll_temp = hyperloglog.HyperLogLog(.01)

hll_temp.M = list(bytearray(c.query('select approx_count_distinct_accumulate(a) a from vals where id < 50')[0]['a']))
hll_final.update(hll_temp)
hll_temp.M = list(bytearray(c.query('select approx_count_distinct_accumulate(a) a from vals where id >= 50')[0]['a']))
hll_final.update(hll_temp)

print hll_final.card()

#3

That’s perfect – Thanks rob!


#4

Hey rob!

Could you clarify what log2m / regwidth you’re using internally? From the example, it looks like you’re using a log2m of 14 and regwidth of 8, but I wanted to verify that was the case before I moved forward with my implementation.


#6

Additionally I’d be curious what hashing algorithm / variant is used. Thanks!


#7

From my testing, it appears as though the log2m is 14 and the regwidth is 6 bits, but each register is stored as a full byte.

It’s not obvious what hashing algorithm is used and whether the register index is read in little or big endian format from the hashed value.