You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
This is very similar to #12122 but is intended for configs like
spark.sql.files.maxPartitionBytes
The simpler part is that we know how to override this value for GPU reads. We have total control over that. The harder part is that we have less information and may need to rely more on historical information like #12121 to augment any heuristics we can come up with.
The issue is that we know the size of the compressed data on disk, but we don't know the compression ratio of the data in that. We don't know the selectivity of the predicate push down. We don't know the selectivity of the filter after the predicate push down. We also don't really know the size of the row groups so that we can avoid launching tasks that are just going to read a footer and exit.
Just like in the others we want to start off by implementing the triple buffering feature #11343 and then work on heuristics to figure out what the decompressed data sizes are going to look like. After that the input size estimation should look very similar to how shuffle ends up working.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
This is very similar to #12122 but is intended for configs like
spark.sql.files.maxPartitionBytes
The simpler part is that we know how to override this value for GPU reads. We have total control over that. The harder part is that we have less information and may need to rely more on historical information like #12121 to augment any heuristics we can come up with.
The issue is that we know the size of the compressed data on disk, but we don't know the compression ratio of the data in that. We don't know the selectivity of the predicate push down. We don't know the selectivity of the filter after the predicate push down. We also don't really know the size of the row groups so that we can avoid launching tasks that are just going to read a footer and exit.
Just like in the others we want to start off by implementing the triple buffering feature #11343 and then work on heuristics to figure out what the decompressed data sizes are going to look like. After that the input size estimation should look very similar to how shuffle ends up working.
The text was updated successfully, but these errors were encountered: