Skip to content

Minimal Perfect Hash function via Compress Hash Displace

License

Notifications You must be signed in to change notification settings

opencoff/go-chd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

f2fe705 · Apr 4, 2021

History

7 Commits
Apr 4, 2021
Jan 2, 2021
Jan 2, 2021
Jan 23, 2021
Apr 4, 2021
Jan 23, 2021
Jan 23, 2021
Apr 4, 2021
Feb 20, 2021
Apr 4, 2021
Apr 4, 2021
Apr 4, 2021
Jan 2, 2021
Jan 2, 2021
Jan 2, 2021
Jan 2, 2021
Jan 2, 2021
Jan 2, 2021
Jan 2, 2021
Jan 2, 2021
Apr 3, 2021
Feb 20, 2021

Repository files navigation

GoDoc Go Report Card

go-chd - Minimal Perfect Hash Function using Compress Hash Displace

What is it?

A library to create, query and serialize/de-serialize minimal perfect hash function ("MPHF").

This is an implementation of CHD - inspired by this gist.

The library exposes the following types:

  • ChdBuilder: Represents the construction phase of the MPHF. function as described in the paper above.
  • Chd: Represents a frozen MPHF over a given set of keys. You can only do lookups on this type.
  • DBWriter: Used to construct a constant database of key-value pairs - where the lookup of a given key is done in constant time using ChdBuilder. Essentially, this type serializes a collection of key-value pairs using ChdBuilder as the underlying index.
  • DBReader: Used for looking up key-values from a previously constructed (serialized) database.

NOTE Minimal Perfect Hash functions take a fixed input and generate a mapping to lookup the items in constant time. In particular, they are NOT a replacement for a traditional hash-table; i.e., it may yield false-positives when queried using keys not present during construction. In concrete terms:

Let S = {k0, k1, ... kn} be your input key set.

If H: S -> {0, .. n} is a minimal perfect hash function, then H(kx) for kx NOT in S may yield an integer result (indicating that kx was successfully "looked up").

Thus, if users of Chd are unsure of the input being passed to such a Lookup() function, they should add an additional comparison against the actual key to verify. Look at dbreader.go:Find() for an example.

DBWriter optimizes the database if there are no values present - i.e., keys-only. This optimization significantly reduces the file-size.

How do I use it?

Like any other golang library: go get github.com/opencoff/go-chd.

Example Program

There is a working example of the DBWriter and DBReader interfaces in the file example/mphdb.go. This example demonstrates the following functionality:

  • add one or more space delimited key/value files (first field is key, second field is value)
  • add one or more CSV files (first field is key, second field is value)
  • Write the resulting MPH DB to disk
  • Read the DB and verify its integrity

First, lets run some tests and make sure chd is working fine:

  $ git clone https://github.com/opencoff/go-chd
  $ cd go-chd
  $ make test

Now, lets build and run the example program:

  $ make
  $ ./mphdb -h

There is a helper python script to generate a very large text file of hostnames and IP addresses: genhosts.py. You can run it like so:

  $ python ./example/genhosts.py 192.168.0.0/16 > a.txt

The above example generates 65535 hostnames and corresponding IP addresses; each of the IP addresses is sequentially drawn from the given subnet.

NOTE If you use a "/8" subnet mask you will generate a lot of data (~430MB in size).

Once you have the input generated, you can feed it to the example program above to generate a MPH DB:

  $ ./mphdb foo.db a.txt
  $ ./mphdb -V foo.db

It is possible that "mphdb" fails to construct a DB after trying 1,000,000 times. In that case, try lowering the "load" factor (default is 0.85).

  $ ./mphdb -l 0.75 foo.db a.txt

Basic Usage of ChdBuilder

Assuming you have read your keys, hashed them into uint64, this is how you can use the library:

        builder, err := chd.New(0.9)
        if err != nil { panic(err) }

        for i := range keys {
            builder.Add(keys[i])
        }

        lookup, err := builder.Freeze()

        // Now, call Find() with each key to gets its unique mapping.
        // Note: Find() returns values in the range closed-interval [1, len(keys)]
        for i, k := range keys {
                j := lookup.Find(k)
                fmt.Printf("%d: %#x maps to %d\n", i, k, j)
        }

Writing a DB Once, but lookup many times

One can construct an on-disk constant-time lookup using ChdBuilder as the underlying indexing mechanism. Such a DB is useful in situations where the key/value pairs are NOT changed frequently; i.e., read-dominant workloads. The typical pattern in such situations is to build the constant-DB once for efficient retrieval and do lookups multiple times.

The example program in example/ has helper routines to add from a text or CSV delimited file: see example/text.go.

Implementation Notes

  • chd.go: The main implementation of the CHD algorithm. It has two types: one to construct and freeze a MPHF (ChdBuilder) and another to do constant time lookups from a frozen CHD MPHF (Chd).

  • dbwriter.go: Create a read-only, constant-time MPH lookup DB. It can store arbitrary byte stream "values" - each of which is identified by a unique uint64 key. The DB structure is optimized for reading on the most common architectures - little-endian: amd64, arm64 etc.

  • dbreader.go: Provides a constant-time lookup of a previously constructed CHD MPH DB. DB reads use mmap(2) to reduce I/O bottlenecks. For little-endian architectures, there is no data "parsing" of the lookup tables, offset tables etc. They are interpreted in-situ from the mmap'd data. To keep the code generic, every multi-byte int is converted to little-endian order before use. These conversion routines are in endian_XX.go.

  • mmap.go: Utility functions to map byte-slices to uintXX slices and vice versa.

  • marshal.go: Marshal/Unmarshal CHD MPH

License

GPL v2.0