Add parquet support #20

kevmo314 · 2024-01-02T14:19:53Z

This is going to be substantially more challenging than the csv format, but might be rewarding. We want to add Apache Parquet support as Parquet is used in a very large number of real-world data science applications.

But one challenge right now is that we require a byte pointer to a specific record whereas Parquet is columnar, meaning records are split across different locations.

https://github.com/apache/parquet-format

We likely will need to write our own Parquet file parser to figure out the correct byte offset, then in the js library be pretty particular about how exactly that record is fetched/parsed. This might involve needing to return additional metadata beyond just the byte offset, which we can do via an intermediate pointer in the index.

Anyways, let's talk about this one before working on it, it'll be super educational about how Parquet works but I don't want us to get lost in the complexity.

kevmo314 assigned friendlymatthew Jan 2, 2024

friendlymatthew added the feature New feature or request label Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parquet support #20

Add parquet support #20

kevmo314 commented Jan 2, 2024

Add parquet support #20

Add parquet support #20

Comments

kevmo314 commented Jan 2, 2024