Insert HyperLogLog data

Hi -

Is it possible for my application to build the binary HyperLogLog data used by the APPROX_COUNT_DISTINCT? (the “state” referred to here: SingleStoreDB Cloud · SingleStore Documentation)

I would ideally like to not have to store individual records in my database and instead use my ETL (which does 1-minute / 1-hour rollups in memory) to calculate the HLL state and store that in MemSQL.

Thanks for any guidance!

Yes, this is possible. As long as you configure your HLL library to have the same precision as we do internally it will be compatible. Ours uses 2^14 bytes.

This python snippet gives an example of doing of generating the hll binary externally and merging them to get the final result in MemSQL as well as the opposite scenario:

from memsql.common import database
import hyperloglog
import random

vals = []
for i in range(100):
vals.append(random.randint(0,50))

c = database.connect(host='127.1', user='root')
c.query('drop database if exists db')
c.query('create database db')
c.query('use db')
c.query('create table vals(id int auto_increment primary key, a int)')
for v in vals:
c.query('insert into vals value(null, %d)' % v)

print c.query("select approx_count_distinct(a) d from vals")

hll1 = hyperloglog.HyperLogLog(.01)
for v in vals[50:]:
hll1.add(v)
hll2 = hyperloglog.HyperLogLog(.01)
for v in vals[:50]:
hll2.add(v)

c.query('create table synopsis(a varbinary(16384))')
c.query('insert into synopsis values(\'%s\')' % bytearray(hll1.M))
c.query('insert into synopsis values(\'%s\')' % bytearray(hll2.M))
print c.query('select approx_count_distinct_estimate(a) d from synopsis')

hll_final = hyperloglog.HyperLogLog(.01)
hll_temp = hyperloglog.HyperLogLog(.01)

hll_temp.M = list(bytearray(c.query('select approx_count_distinct_accumulate(a) a from vals where id < 50')[0]['a']))
hll_final.update(hll_temp)
hll_temp.M = list(bytearray(c.query('select approx_count_distinct_accumulate(a) a from vals where id >= 50')[0]['a']))
hll_final.update(hll_temp)

print hll_final.card()

That’s perfect – Thanks rob!

Hey rob!

Could you clarify what log2m / regwidth you’re using internally? From the example, it looks like you’re using a log2m of 14 and regwidth of 8, but I wanted to verify that was the case before I moved forward with my implementation.

Additionally I’d be curious what hashing algorithm / variant is used. Thanks!

From my testing, it appears as though the log2m is 14 and the regwidth is 6 bits, but each register is stored as a full byte.

It’s not obvious what hashing algorithm is used and whether the register index is read in little or big endian format from the hashed value.