You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The physical layout and indexing of the dataset dominantly impacts read performances. Often dataset are designed in such a way to support a rather specific use case where many of the partitioning parameters must be set and even minor deviations or omittances would cause severe changes in performance. We offer increasingly many levers to control the dataset layout but do not offer a concise way to store, share, verify or reproduce this easily. Many of the performance critical parameters are not easily reconstructable
Things I have in mind which should be part of this specification are
Partition keys
Secondary indices
Bucket_by
Number of buckets
Columns we sorted the columns by
What hash function was used to calculate the buckets
Parquet chunk sizes used for write (assuming constant over the dataset)
Parquet compression algorithm
Benefits
Groundwork for more concise sanity checks, e.g. when updating a dataset
More efficient communication to consumers. So far we mostly communicate dataset schemas and rely on implicit knowledge about expected performance. With these information we can offer more informed decisions
Might offer a more streamlined interface (partition spec via config file?)
Open questions
Do we persist this information with the dataset or merely offer this as an interface?
How would we handle inhomogeneous attributes (e.g. parquet attributes)
I'm curious to know if other people consider this useful or not
The text was updated successfully, but these errors were encountered:
Problem description
The physical layout and indexing of the dataset dominantly impacts read performances. Often dataset are designed in such a way to support a rather specific use case where many of the partitioning parameters must be set and even minor deviations or omittances would cause severe changes in performance. We offer increasingly many levers to control the dataset layout but do not offer a concise way to store, share, verify or reproduce this easily. Many of the performance critical parameters are not easily reconstructable
Things I have in mind which should be part of this specification are
Benefits
Open questions
I'm curious to know if other people consider this useful or not
The text was updated successfully, but these errors were encountered: