go-chd - Minimal Perfect Hash Function using Compress Hash Displace

What is it?

A library to create, query and serialize/de-serialize minimal perfect hash function ("MPHF").

This is an implementation of CHD - inspired by this gist.

The library exposes the following types:

ChdBuilder: Represents the construction phase of the MPHF. function as described in the paper above.
Chd: Represents a frozen MPHF over a given set of keys. You can only do lookups on this type.
DBWriter: Used to construct a constant database of key-value pairs - where the lookup of a given key is done in constant time using ChdBuilder. Essentially, this type serializes a collection of key-value pairs using ChdBuilder as the underlying index.
DBReader: Used for looking up key-values from a previously constructed (serialized) database.

NOTE Minimal Perfect Hash functions take a fixed input and generate a mapping to lookup the items in constant time. In particular, they are NOT a replacement for a traditional hash-table; i.e., it may yield false-positives when queried using keys not present during construction. In concrete terms:

Let S = {k0, k1, ... kn} be your input key set.

If H: S -> {0, .. n} is a minimal perfect hash function, then H(kx) for kx NOT in S may yield an integer result (indicating that kx was successfully "looked up").

Thus, if users of Chd are unsure of the input being passed to such a Lookup() function, they should add an additional comparison against the actual key to verify. Look at dbreader.go:Find() for an example.

DBWriter optimizes the database if there are no values present - i.e., keys-only. This optimization significantly reduces the file-size.

How do I use it?

Like any other golang library: go get github.com/opencoff/go-chd.

Example Program

There is a working example of the DBWriter and DBReader interfaces in the file example/mphdb.go. This example demonstrates the following functionality:

add one or more space delimited key/value files (first field is key, second field is value)
add one or more CSV files (first field is key, second field is value)
Write the resulting MPH DB to disk
Read the DB and verify its integrity

First, lets run some tests and make sure chd is working fine:

  $ git clone https://github.com/opencoff/go-chd
  $ cd go-chd
  $ make test

Now, lets build and run the example program:

  $ make
  $ ./mphdb -h

There is a helper python script to generate a very large text file of hostnames and IP addresses: genhosts.py. You can run it like so:

  $ python ./example/genhosts.py 192.168.0.0/16 > a.txt

The above example generates 65535 hostnames and corresponding IP addresses; each of the IP addresses is sequentially drawn from the given subnet.

NOTE If you use a "/8" subnet mask you will generate a lot of data (~430MB in size).

Once you have the input generated, you can feed it to the example program above to generate a MPH DB:

  $ ./mphdb foo.db a.txt
  $ ./mphdb -V foo.db

It is possible that "mphdb" fails to construct a DB after trying 1,000,000 times. In that case, try lowering the "load" factor (default is 0.85).

  $ ./mphdb -l 0.75 foo.db a.txt

Basic Usage of ChdBuilder

Assuming you have read your keys, hashed them into uint64, this is how you can use the library:

        builder, err := chd.New(0.9)
        if err != nil { panic(err) }

        for i := range keys {
            builder.Add(keys[i])
        }

        lookup, err := builder.Freeze()

        // Now, call Find() with each key to gets its unique mapping.
        // Note: Find() returns values in the range closed-interval [1, len(keys)]
        for i, k := range keys {
                j := lookup.Find(k)
                fmt.Printf("%d: %#x maps to %d\n", i, k, j)
        }

Writing a DB Once, but lookup many times

One can construct an on-disk constant-time lookup using ChdBuilder as the underlying indexing mechanism. Such a DB is useful in situations where the key/value pairs are NOT changed frequently; i.e., read-dominant workloads. The typical pattern in such situations is to build the constant-DB once for efficient retrieval and do lookups multiple times.

The example program in example/ has helper routines to add from a text or CSV delimited file: see example/text.go.

Implementation Notes

chd.go: The main implementation of the CHD algorithm. It has two types: one to construct and freeze a MPHF (ChdBuilder) and another to do constant time lookups from a frozen CHD MPHF (Chd).
dbwriter.go: Create a read-only, constant-time MPH lookup DB. It can store arbitrary byte stream "values" - each of which is identified by a unique uint64 key. The DB structure is optimized for reading on the most common architectures - little-endian: amd64, arm64 etc.
dbreader.go: Provides a constant-time lookup of a previously constructed CHD MPH DB. DB reads use mmap(2) to reduce I/O bottlenecks. For little-endian architectures, there is no data "parsing" of the lookup tables, offset tables etc. They are interpreted in-situ from the mmap'd data. To keep the code generic, every multi-byte int is converted to little-endian order before use. These conversion routines are in endian_XX.go.
mmap.go: Utility functions to map byte-slices to uintXX slices and vice versa.
marshal.go: Marshal/Unmarshal CHD MPH

License

GPL v2.0

Name	Name	Last commit message	Last commit date
Latest commit opencoff Optimize DB for keys-only storage Apr 4, 2021 f2fe705 · Apr 4, 2021 History 7 Commits
example	example	Optimize DB for keys-only storage	Apr 4, 2021
.gitignore	.gitignore	Tested, working version of CHD.	Jan 2, 2021
LICENSE	LICENSE	Initial commit	Jan 2, 2021
Makefile	Makefile	Minor fixes - example program now does:	Jan 23, 2021
README.md	README.md	Optimize DB for keys-only storage	Apr 4, 2021
bitvector.go	bitvector.go	Minor fixes - example program now does:	Jan 23, 2021
bitvector_test.go	bitvector_test.go	Minor fixes - example program now does:	Jan 23, 2021
chd.go	chd.go	Optimize DB for keys-only storage	Apr 4, 2021
chd_test.go	chd_test.go	Fixed error in chd construction; reduced seed size:	Feb 20, 2021
db_test.go	db_test.go	Optimize DB for keys-only storage	Apr 4, 2021
dbreader.go	dbreader.go	Optimize DB for keys-only storage	Apr 4, 2021
dbwriter.go	dbwriter.go	Optimize DB for keys-only storage	Apr 4, 2021
endian_be.go	endian_be.go	Tested, working version of CHD.	Jan 2, 2021
endian_be_test.go	endian_be_test.go	Tested, working version of CHD.	Jan 2, 2021
endian_le.go	endian_le.go	Tested, working version of CHD.	Jan 2, 2021
endian_le_test.go	endian_le_test.go	Tested, working version of CHD.	Jan 2, 2021
errors.go	errors.go	Tested, working version of CHD.	Jan 2, 2021
go.mod	go.mod	Tested, working version of CHD.	Jan 2, 2021
go.sum	go.sum	Tested, working version of CHD.	Jan 2, 2021
helpers_test.go	helpers_test.go	Tested, working version of CHD.	Jan 2, 2021
mmap.go	mmap.go	Compress the seed table to smallest possible size	Apr 3, 2021
rand.go	rand.go	Fixed error in chd construction; reduced seed size:	Feb 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

go-chd - Minimal Perfect Hash Function using Compress Hash Displace

What is it?

How do I use it?

Example Program

Basic Usage of ChdBuilder

Writing a DB Once, but lookup many times

Implementation Notes

License

About

Releases

Packages

Languages

License

opencoff/go-chd

Folders and files

Latest commit

History

Repository files navigation

go-chd - Minimal Perfect Hash Function using Compress Hash Displace

What is it?

How do I use it?

Example Program

Basic Usage of ChdBuilder

Writing a DB Once, but lookup many times

Implementation Notes

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages