Serialisable partitioning spec #291

fjetter · 2020-06-02T17:30:07Z

Problem description

The physical layout and indexing of the dataset dominantly impacts read performances. Often dataset are designed in such a way to support a rather specific use case where many of the partitioning parameters must be set and even minor deviations or omittances would cause severe changes in performance. We offer increasingly many levers to control the dataset layout but do not offer a concise way to store, share, verify or reproduce this easily. Many of the performance critical parameters are not easily reconstructable

Things I have in mind which should be part of this specification are

Partition keys
Secondary indices
Bucket_by
Number of buckets
Columns we sorted the columns by
What hash function was used to calculate the buckets
Parquet chunk sizes used for write (assuming constant over the dataset)
Parquet compression algorithm

Benefits

Groundwork for more concise sanity checks, e.g. when updating a dataset
More efficient communication to consumers. So far we mostly communicate dataset schemas and rely on implicit knowledge about expected performance. With these information we can offer more informed decisions
Might offer a more streamlined interface (partition spec via config file?)

Open questions

Do we persist this information with the dataset or merely offer this as an interface?
How would we handle inhomogeneous attributes (e.g. parquet attributes)

I'm curious to know if other people consider this useful or not

fjetter mentioned this issue Jul 22, 2020

New format specification for single table datasets #318

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialisable partitioning spec #291

Serialisable partitioning spec #291

fjetter commented Jun 2, 2020

Serialisable partitioning spec #291

Serialisable partitioning spec #291

Comments

fjetter commented Jun 2, 2020

Problem description

Benefits

Open questions