Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet support #20

Open
kevmo314 opened this issue Jan 2, 2024 · 0 comments
Open

Add parquet support #20

kevmo314 opened this issue Jan 2, 2024 · 0 comments
Assignees
Labels
feature New feature or request

Comments

@kevmo314
Copy link
Owner

kevmo314 commented Jan 2, 2024

This is going to be substantially more challenging than the csv format, but might be rewarding. We want to add Apache Parquet support as Parquet is used in a very large number of real-world data science applications.

But one challenge right now is that we require a byte pointer to a specific record whereas Parquet is columnar, meaning records are split across different locations.

https://github.com/apache/parquet-format

We likely will need to write our own Parquet file parser to figure out the correct byte offset, then in the js library be pretty particular about how exactly that record is fetched/parsed. This might involve needing to return additional metadata beyond just the byte offset, which we can do via an intermediate pointer in the index.

Anyways, let's talk about this one before working on it, it'll be super educational about how Parquet works but I don't want us to get lost in the complexity.

@friendlymatthew friendlymatthew added the feature New feature or request label Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants