Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reader and writer for Puffin, indexes and stats file format #4537

Merged
merged 1 commit into from
Jun 15, 2022

Conversation

findepi
Copy link
Member

@findepi findepi commented Apr 11, 2022

Format documentation: #4944

build.gradle Outdated Show resolved Hide resolved
@findepi findepi force-pushed the findepi/stats-file branch from e823511 to 99ce6e5 Compare April 13, 2022 12:42
@findepi
Copy link
Member Author

findepi commented Apr 13, 2022

@rdblue i applied most of the comments (no more jackson-datatype-jdk8, json-related annotations, get in accessor names). I also switched over from base16-encoded test "fixtures" to test resources.

I switched writing to use ByteBuffer, this removes removed data copying upon compression.
This is as a fixup for now. Please let me know if you want me to use ByteBuffer in the reader/writer APIs as well?
It seems byte[] works there well (eg because uncompressed size is known, so no need to re-allocation), so please confirm before I apply further changes.

@rdblue
Copy link
Contributor

rdblue commented Apr 13, 2022

@findepi, I would prefer to use ByteBuffer everywhere. If the backing buffer was easy to allocate and work with, that's great. But you're not forced to reallocate if it is not.

@findepi findepi force-pushed the findepi/stats-file branch from ce65bcc to 5e88a18 Compare April 20, 2022 09:25
@findepi
Copy link
Member Author

findepi commented Apr 20, 2022

( squashed and rebased on top of current version of #4534, no other changes in this push )

@findepi findepi force-pushed the findepi/stats-file branch from 5e88a18 to 6086888 Compare April 20, 2022 09:52
@findepi
Copy link
Member Author

findepi commented Apr 20, 2022

AC
@rdblue @singhpk234 thanks for your review, let me know what else I can change.

@findepi findepi force-pushed the findepi/stats-file branch from 6086888 to 9e8bbc6 Compare April 20, 2022 09:59
@findepi findepi force-pushed the findepi/stats-file branch from 9e8bbc6 to bcdd2f1 Compare April 22, 2022 07:40
@findepi
Copy link
Member Author

findepi commented Apr 22, 2022

AC & rebased after #4534 merged.

@findepi findepi force-pushed the findepi/stats-file branch from bcdd2f1 to 1b15939 Compare April 22, 2022 14:46
@findepi findepi force-pushed the findepi/stats-file branch 2 times, most recently from 872e253 to 9430d36 Compare June 9, 2022 12:54
@findepi findepi requested review from rdblue and danielcweeks June 13, 2022 08:18
@findepi findepi force-pushed the findepi/stats-file branch from 9430d36 to b6472ec Compare June 14, 2022 08:31
@findepi
Copy link
Member Author

findepi commented Jun 14, 2022

Rebased after #5019 merged.

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM, just a few suggestions here and there

return type;
}

public List<Integer> inputFields() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the spec mentions that input fields are JSON longs, so I'm just wondering whether this should be a List<Long>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg field IDs are integers, so the implementation is chosen to be limited to integers.

We can maybe change fields | list of JSON long in the puffin spec to be list of integers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should update the spec to match the type used by the table spec for field IDs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi findepi force-pushed the findepi/stats-file branch 2 times, most recently from d3ca971 to be6bb83 Compare June 15, 2022 11:01
@findepi
Copy link
Member Author

findepi commented Jun 15, 2022

Thanks @rdblue and @nastra for your reviews!

Updated the code accordingly. Please take another look.

@findepi findepi requested review from danielcweeks, nastra and rdblue and removed request for rdblue and danielcweeks June 15, 2022 11:52
ByteBuffer footerPayload = PuffinFormat.compress(footerCompression, footerJson);
outputStream.write(MAGIC);
int footerPayloadLength = footerPayload.remaining();
writeFully(footerPayload);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: why not call IOUtil.writeFully(outputStream, footerPayload) directly rather than having a private writeFully method that just adds the output stream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!buffer.hasRemaining()) {
return;
}
byte[] chunk = new byte[WRITE_CHUNK_SIZE];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than allocating every time this is called, can you create a ThreadLocal to share this buffer? Alternatively, you could pass the temporary buffer in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you could pass the temporary buffer in.

This poses sizing challenge. I.e. the caller needs to provide a reasonably sized buffer.

Rather than allocating every time this is called, can you create a ThreadLocal to share this buffer?

Sure, this is feasible. Do you happen to know what would be the expected reuse ratio for such a buffer?

Alternatively we can have a static buffer pool that lends buffers to the current thread.
(assuming we have a problem that we want to fix here)

@rdblue rdblue merged commit a7f7c1a into apache:master Jun 15, 2022
@rdblue
Copy link
Contributor

rdblue commented Jun 15, 2022

Thanks, @findepi! This looks good to me. We can still follow up, but I think the majority of the changes are ready so I've merged it to avoid keeping a big PR outstanding.

@findepi findepi deleted the findepi/stats-file branch June 20, 2022 07:32
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants