-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reader and writer for Puffin, indexes and stats file format #4537
Conversation
984dce9
to
e823511
Compare
e823511
to
99ce6e5
Compare
@rdblue i applied most of the comments (no more I switched writing to use |
@findepi, I would prefer to use |
core/src/main/java/org/apache/iceberg/stats/FileMetadataParser.java
Outdated
Show resolved
Hide resolved
ce65bcc
to
5e88a18
Compare
( squashed and rebased on top of current version of #4534, no other changes in this push ) |
5e88a18
to
6086888
Compare
AC |
6086888
to
9e8bbc6
Compare
core/src/main/java/org/apache/iceberg/stats/StatsCompressionCodec.java
Outdated
Show resolved
Hide resolved
9e8bbc6
to
bcdd2f1
Compare
AC & rebased after #4534 merged. |
bcdd2f1
to
1b15939
Compare
872e253
to
9430d36
Compare
9430d36
to
b6472ec
Compare
Rebased after #5019 merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM, just a few suggestions here and there
return type; | ||
} | ||
|
||
public List<Integer> inputFields() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the spec mentions that input fields are JSON longs, so I'm just wondering whether this should be a List<Long>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg field IDs are integers, so the implementation is chosen to be limited to integers.
We can maybe change fields | list of JSON long
in the puffin spec to be list of integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should update the spec to match the type used by the table spec for field IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
core/src/test/java/org/apache/iceberg/puffin/TestFileMetadataParser.java
Show resolved
Hide resolved
d3ca971
to
be6bb83
Compare
be6bb83
to
ff02934
Compare
ByteBuffer footerPayload = PuffinFormat.compress(footerCompression, footerJson); | ||
outputStream.write(MAGIC); | ||
int footerPayloadLength = footerPayload.remaining(); | ||
writeFully(footerPayload); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: why not call IOUtil.writeFully(outputStream, footerPayload)
directly rather than having a private writeFully
method that just adds the output stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!buffer.hasRemaining()) { | ||
return; | ||
} | ||
byte[] chunk = new byte[WRITE_CHUNK_SIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than allocating every time this is called, can you create a ThreadLocal
to share this buffer? Alternatively, you could pass the temporary buffer in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, you could pass the temporary buffer in.
This poses sizing challenge. I.e. the caller needs to provide a reasonably sized buffer.
Rather than allocating every time this is called, can you create a ThreadLocal to share this buffer?
Sure, this is feasible. Do you happen to know what would be the expected reuse ratio for such a buffer?
Alternatively we can have a static buffer pool that lends buffers to the current thread.
(assuming we have a problem that we want to fix here)
Thanks, @findepi! This looks good to me. We can still follow up, but I think the majority of the changes are ready so I've merged it to avoid keeping a big PR outstanding. |
Format documentation: #4944