GPU implementation of the Kudo shuffle format. #2800

nvdbaranec · 2025-01-24T22:17:47Z

This PR provides the GPU implementation of the split and assemble halves of the Kud0 shuffle format. It is dependent on #2799. That PR exists solely to help reduce the amount of code that needs to get reviewed for this PR. shuffle_split.cu should be a little easier to review once it gets merged.

shuffle_split implements the sender side half of the Kud0 format. Splitting an input table up into the final output binary and associated metadata. shuffle_split.cu is where it is implemented, and is based on cudf::contiguous_split

shuffle_assemble implements the receiver side of the process - reassembling one or more blobs produced by one or more shuffle_split calls.

This is feature complete and in a reviewable state, but there is a known issue with struct<list> and struct<string> columns that I have to fix. I don't expect it to alter the code much. I also need to add more tests. But in the interest of getting reviews going, I'm putting this up in draft form.

…a basis for shuffle_split and shuffle_assemble.

…fixed width types.

…the header.

…assemble) to make it more manageable.

revans2

I am no where near done with the review. I am planning on writing some JNI APIs for this and put together some kudo interoperability tests. Would you prefer to have them be a separate PR based on this one, or should I make a PR to your branch so they can be in one place?

src/main/cpp/src/shuffle_split_detail.hpp

src/main/cpp/src/shuffle_split.hpp

Signed-off-by: Dave Baranec <[email protected]>

…e some padding values and convert 64 bit header values into 32 bit. Make sure to swap header fields to big endian. Add clarifying comments on why shuffle_split_col_data is storing values as union (kud0 size purposes).

revans2 · 2025-01-31T21:39:17Z

@nvdbaranec your other PR was merged in. Could you upmerge so I can better see the changes.

…gs inside of a struct. The 'branching' case similar to what happens in cudf::row_bit_count.

nvdbaranec · 2025-01-31T22:13:24Z

Upmerged. It...kind of helps I guess. There's a few big obvious blocks of code that got removed. Lots of little changes spread throughout though.

… in the source data isn't necessarily just (num_rows + 7) / 8. We have to account for leading rows/bits in the byte we are copying from.

hyperbolic2346

Nits and some comments/questions for now. Still working through this.

hyperbolic2346 · 2025-02-06T13:57:19Z

src/main/cpp/src/shuffle_split.hpp

+namespace spark_rapids_jni {
+
+// TODO: this is duplicated from cudf because the cudf function is not marked as constexpr, so it
+// cannot be called on the gpu there is an issue filed against cudf to make


Will you link that issue here?

hyperbolic2346 · 2025-02-06T14:02:39Z

src/main/cpp/src/shuffle_split.hpp

+  // - each partition is prepended with a metadata buffer
+  //   the metadata is of the format:
+  //   - 4 byte row count for the partition
+  //   - for each string column in huffle_split_metadata.col_info


Suggested change

// - for each string column in huffle_split_metadata.col_info

// - for each string column in shuffle_split_metadata.col_info

hyperbolic2346 · 2025-02-06T15:54:19Z

src/main/cpp/src/shuffle_split.hpp

+ *
+ * The result is returned as a blob of bytes representing the individual partitions resulting from
+ * the splits, and a set of offsets indicating the beginning of each resulting partition in the
+ * result. The function also returns a shuffle_split_metadat struct which contains additional


Suggested change

* result. The function also returns a shuffle_split_metadat struct which contains additional

* result. The function also returns a shuffle_split_metadata struct which contains additional

hyperbolic2346 · 2025-02-06T15:56:04Z

src/main/cpp/src/shuffle_split.hpp

+ * @param partitions A buffer of anonymous bytes representing multiple partitions of data to be
+ * merged
+ * @param partition_offsets Offsets into the partitions buffer indicating where each individual
+ * partition begins. The number of partitions is partition_offsets.size()


Is this not number of partitions + 1?

hyperbolic2346 · 2025-02-06T15:58:46Z

src/main/cpp/src/shuffle_split_detail.hpp

+struct partition_header {
+  uint32_t magic_number;
+  uint32_t row_index;  // row index in the source table that this partition started at
+  uint32_t num_rows;


Initially not supporting large tables?

hyperbolic2346 · 2025-02-06T16:00:20Z

src/main/cpp/src/shuffle_split_detail.hpp

+
+// padding values for each validity type, as applied at the end of that data type
+// in each partition. so for example all of the grouped-together validity buffers for
+// a given partition will have a final 4 byte pad applied before the offset buffers begin


It isn't 4 bytes of padding as much as it is padded to align on a 4 byte boundary, right? I wonder if there is better wording we could use here.

hyperbolic2346 · 2025-02-06T16:07:38Z

src/main/cpp/src/shuffle_split_detail.hpp

+ */
+constexpr size_t compute_per_partition_metadata_size(size_t total_columns)
+{
+  auto const has_validity_length = (total_columns + 7) / 8;  // has-validity bit per column


I believe you should prefer bitmask_allocation_size_bytes() here, but it isn't constexpr. Maybe use div_rounding_up_safe directly?

hyperbolic2346 · 2025-02-06T16:14:24Z

src/main/cpp/tests/shuffle_split.cu

+                                             "mauve",
+                                             "ultraviolet"}};
+    cudf::test::strings_column_wrapper col2{
+      {"left", "up", "right", "down", "", "space", "", "delete", "end", "insert"}};


I'm disturbed by your lack of Contra cheat code.

hyperbolic2346

Still going, only about halfway through shuffle_assemble unfortunately.

hyperbolic2346 · 2025-02-07T01:50:32Z

src/main/cpp/src/shuffle_assemble.cu

+                                                int& max_branch_depth)
+{
+  bool const is_struct = col->type == cudf::type_id::STRUCT;
+  if (col->type == cudf::type_id::STRING || col->type == cudf::type_id::LIST) {


We're doing this twice, should we bool const is_string_or_list this thing?

hyperbolic2346 · 2025-02-07T02:12:02Z

src/main/cpp/src/shuffle_assemble.cu

+  cudf::device_span<size_t const> partition_offsets,
+  size_t per_partition_metadata_size)
+{
+  if (threadIdx.x != 0) { return; }


This makes me sad, but I assume there isn't a good way around it.

hyperbolic2346 · 2025-02-07T02:17:36Z

src/main/cpp/src/shuffle_assemble.cu

+                              validity_pad);
+  size_type const* offsets = reinterpret_cast<size_type const*>(partitions.begin() + offsets_begin);
+
+  auto base_col_index = column_metadata.size() * partition_index;


Should can we const this?

hyperbolic2346 · 2025-02-07T02:36:37Z

src/main/cpp/src/shuffle_assemble.cu

+          // fallthrough
+        default: {
+          if (rc_stack_pos >= 0) {
+            auto const num_children = --rc_stack[(rc_stack_pos * 2) + 1];


A comment describing this calculation would be welcome.

hyperbolic2346 · 2025-02-07T02:42:34Z

src/main/cpp/src/shuffle_assemble.cu

+// returns:
+// - a vector of assemble_column_info structs representing the destination column data.
+//   the vector is of length global_metadata.col_info.size()  that is, the flattened list of columns
+//   in the table.
+//
+// - the same vector as above, but in host memory.
+//
+// - a vector of assemble_column_info structs, representing the source column data.
+//   the vector is of length global_metadata.col_info.size() * the # of partitions.
+//


Suggested change

// returns:

// - a vector of assemble_column_info structs representing the destination column data.

// the vector is of length global_metadata.col_info.size() that is, the flattened list of columns

// in the table.

//

// - the same vector as above, but in host memory.

//

// - a vector of assemble_column_info structs, representing the source column data.

// the vector is of length global_metadata.col_info.size() * the # of partitions.

//

Seems out of date and out of place.

hyperbolic2346 · 2025-02-07T02:58:33Z

src/main/cpp/src/shuffle_assemble.cu

+          partitions + partition_offsets[partition_index] + sizeof(partition_header));
+        auto const col_index = i / num_partitions;
+
+        return has_validity_buf[col_index / 32] & (1 << (col_index % 32)) ? 1 : 0;


Can we use a helper function in the null mask code for this computation?

hyperbolic2346 · 2025-02-07T03:15:06Z

src/main/cpp/src/shuffle_assemble.cu

+      bitmask_type const* const has_validity_buf =
+        reinterpret_cast<bitmask_type const*>(buf_start + sizeof(partition_header));
+      cinstance_info.has_validity =
+        has_validity_buf[col_index / 32] & (1 << (col_index % 32)) ? 1 : 0;


Here is that calculation again. Should we bump this out into a function?

hyperbolic2346 · 2025-02-07T03:25:58Z

src/main/cpp/src/shuffle_assemble.cu

+    int shmem_limit_per_block;
+    CUDF_CUDA_TRY(cudaDeviceGetAttribute(
+      &shmem_limit_per_block, cudaDevAttrMaxSharedMemoryPerBlock, device_id));
+    CUDF_EXPECTS(shmem_per_block <= shmem_limit_per_block,


What is the failure plan here? Will Java catch this and re-run on CPU? How deep of a table is required?

hyperbolic2346 · 2025-02-07T03:38:26Z

src/main/cpp/src/shuffle_assemble.cu

+ *
+ * @return Size of the required allocation in bytes
+ */
+constexpr size_t bitmask_allocation_size_bytes(size_type number_of_bits, int pad)


Are we reproducing this code to make it constexpr?

hyperbolic2346 · 2025-02-07T03:58:18Z

src/main/cpp/src/shuffle_assemble.cu

+                                 // we need to ignore them because the split-copy happens at
+                                 // byte boundaries, not bit/row boundaries. so we may have
+                                 // irrelevant rows at the very beginning.
+  int dst_bit_shift;             // dest bit (left) shift


Does right shift not equal left shift? I would think the number of bits you ignore would be the number you shift back.

revans2

Still looking at this. I have a very basic java API and I am just getting started to test conf compatibility with the two.

revans2 · 2025-02-03T19:05:30Z

src/main/cpp/src/shuffle_split.hpp

+namespace spark_rapids_jni {
+
+// TODO: this is duplicated from cudf because the cudf function is not marked as constexpr, so it
+// cannot be called on the gpu there is an issue filed against cudf to make


nit: Could you include a link to that issue here? I don't see the issue when I search for it.

revans2 · 2025-02-03T20:10:35Z

src/main/cpp/src/shuffle_split.cu

-  if (input.num_columns() == 0) { return {}; }
+  // empty inputs
+  if (input.num_columns() == 0 || input.num_rows() == 0) {
+    rmm::device_uvector<size_t> empty_offsets(1, stream, mr);


I am not sure if we want this to be an empty buffer or an error. Kudo CPU explicitly supports rows with no columns, but CUDF cannot support this, so we cannot do it here. Because of that we have to make sure that we use the CPU version for those cases, and an error would make it more obvious that we messed something up.

revans2 · 2025-02-07T17:05:52Z

src/main/cpp/src/shuffle_split_detail.hpp

+  uint32_t num_rows;
+  uint32_t validity_size;
+  uint32_t offset_size;
+  uint32_t data_size;


The last value in here is supposed to be the number of columns.
uint32_t num_columns; This feels redundant, seeing how we need the schema for this to be read back in. So should I remove it from the CPU code or should we add it here?

@liurenjie1024 the numColummns field in the header looks like it should be something that we do not need. Is there a reason we are including it instead of relying on the schema to get it?

Yes, it could be inferred from schema. But this field helps simplifying code a little since there is a bit array to indicate whether each column has validity buffer.

One extra usage of num_cols is that it could be used for identity if it's a row count only record, 0 means row count only.

revans2 · 2025-02-07T19:54:51Z

I am seeing another difference between this and the CPU code. It looks like this is padding the data after the header, whereas the CPU is not doing that. I think is because the header is parsed totally on the host, and then only the data part is sent to the GPU. I was able to remove the columnCount from the header, so not I am going to try and add in padding too. But I would like some feedback from @liurenjie1024 and @nvdbaranec on what we want a final solution to look like.

revans2 · 2025-02-07T20:28:22Z

@nvdbaranec can you please explain to me how padding works for this PR? It looks like there is padding after the header, but there is no padding in between columns? Is that right?

revans2 · 2025-02-07T20:34:38Z

@liurenjie1024 can you please explain to me why we have different padding for concat vs over the wire? It is very confusing to see that some are padded to 4 bytes and others are padded to 64 bytes.

…te copy sizes and bit shifts.

nvdbaranec · 2025-02-07T21:32:13Z

@nvdbaranec can you please explain to me how padding works for this PR? It looks like there is padding after the header, but there is no padding in between columns? Is that right?

Fixed-size component of the header
(no padding)
variable-length component of the header (has-validity bit for each column), written as bitpacked bytes
(no padding)
all validity buffers for a partition, no padding between them
(pad to 4 bytes)
all offsets buffers for a partition. 4 byte padding implicitly between them
(pad to 4 bytes)
all data buffers for a partition. no padding between them
(pad to 4 bytes). end of the partition

liurenjie1024 · 2025-02-08T01:45:50Z

@liurenjie1024 can you please explain to me why we have different padding for concat vs over the wire? It is very confusing to see that some are padded to 4 bytes and others are padded to 64 bytes.

Padding over the wire is 4 bytes is for reducing padding size, which is very important for small partitions.
Doing 64 byte padding when concating is learned from jcudf, I thought it was a required for gpu? The concated result will be copied to device memory for gpu to process directly.

revans2 · 2025-02-10T17:08:13Z

Doing 64 byte padding when concating is learned from jcudf, I thought it was a required for gpu?

64-bytes was me being overly cautious. Cuda malloc does 64-byte alignment (I think), because like malloc it does not know the data types so it cannot make any assumption about how it needs to be aligned. In reality we need each column to be aligned based on how it is accessed.

byte/bool = 1 byte aligned (no padding needed)
short = 2 byte aligned
int/float/timestamp_days/decimal_32 = 4 byte aligned
long/double/timestamp_micros/decimal_64/validity = 8 byte aligned
decimal_128 = 16 byte aligned

Offsets we could have as 4 byte aligned, but because CUDF supports longer offsets we might want to think about 8 byte aligned (unless we have a way to store what type of offset is being stored in the header)

In JCUDF serialization I thought about reordering the columns to reduce the amount of padding that would be needed, but I never did it because it didn't feel like it saved that much, but I wasn't looking at it from thousands of partitions.

@nvdbaranec how is the padding needed for the kernels in this patch to work? Would it be alright if there was no padding at all? Or is the data copied in a non-byte wise way?

pmattione-nvidia · 2025-02-10T18:06:11Z

src/main/cpp/tests/shuffle_split.cu

+  }
+}
+
+TEST_F(ShuffleSplitTests, Simple)


We need tests for bools and decimal32/64/128 (nothing here tests scale factor). Especially 128 if the data is really only going to be 8-byte-aligned.

pmattione-nvidia · 2025-02-10T18:29:53Z

src/main/cpp/src/shuffle_split.cu

-  }
+  auto temp_mr = cudf::get_current_device_resource_ref();
+
+  size_t const num_partitions   = splits.size() + 1;


So if the last split is at the end of the column, will we generate an empty partition at the end then? Is that what we want? Still reading everything.

revans2 · 2025-02-10T18:57:16Z

@nvdbaranec I was looking for the hasValidity part of the header and I am not finding it in this code, but it is in the CPU code, just like the numColumns part.

spark-rapids-jni/src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoTableHeader.java

Lines 79 to 82 in ef7bb92

    
           int numColumns = din.readInt(); 
        
           int validityBufferLength = lengthOfHasValidityBuffer(numColumns); 
        
           byte[] hasValidityBuffer = new byte[validityBufferLength]; 
        
           din.readFully(hasValidityBuffer);

Is this intentional or is it an oversight?

src/main/cpp/src/shuffle_split.cu

revans2 · 2025-02-11T15:53:39Z

@nvdbaranec can you please explain to me how padding works for this PR? It looks like there is padding after the header, but there is no padding in between columns? Is that right?

Fixed-size component of the header

(no padding)

variable-length component of the header (has-validity bit for each column), written as bitpacked bytes

(no padding)

all validity buffers for a partition, no padding between them

(pad to 4 bytes)

all offsets buffers for a partition. 4 byte padding implicitly between them

(pad to 4 bytes)

all data buffers for a partition. no padding between them

(pad to 4 bytes). end of the partition

That sounds reasonable to me. The CPU version has the following padding

Fixed-size component of the header
(no padding)
variable-length component of the header (has-validity bit for each column), written as bitpacked bytes
(no padding)
all validity buffers for a partition, padded to 4 byte boundary at the end of each buffer
(no extra padding)
all offsets buffers for a partition. padded to 4 byte boundary at the end of each buffer
(no extra padding)
all data buffers for a partition. padded to 4 byte boundary at the end of each buffer
(no extra padding). end of the partition

@nvdbaranec I modified your patch to have a num_columns field.

diff --git a/src/main/cpp/src/shuffle_split.cu b/src/main/cpp/src/shuffle_split.cu
index 92f7dfc92..4ecbb3eef 100644
--- a/src/main/cpp/src/shuffle_split.cu
+++ b/src/main/cpp/src/shuffle_split.cu
@@ -637,6 +637,8 @@ __global__ void pack_per_partition_metadata_kernel(uint8_t* out_buffer,
     pheader->offset_size =
       cudf::hashing::detail::swap_endian(static_cast<uint32_t>(psize.offset_size));
     pheader->data_size = cudf::hashing::detail::swap_endian(static_cast<uint32_t>(psize.data_size));
+    pheader->num_flattened_columns =
+      cudf::hashing::detail::swap_endian(static_cast<uint32_t>(columns_per_partition));
   }
 
   bitmask_type* has_validity =
diff --git a/src/main/cpp/src/shuffle_split_detail.hpp b/src/main/cpp/src/shuffle_split_detail.hpp
index da1841e80..d53cfeeb4 100644
--- a/src/main/cpp/src/shuffle_split_detail.hpp
+++ b/src/main/cpp/src/shuffle_split_detail.hpp
@@ -61,6 +61,7 @@ struct partition_header {
   uint32_t validity_size;
   uint32_t offset_size;
   uint32_t data_size;
+  uint32_t num_flattened_columns; // used by the CPU version, but not the GPU version
 };
 
 // padding values for each validity type, as applied at the end of that data type
@@ -92,4 +93,4 @@ enum class buffer_type { VALIDITY = 0, OFFSETS = 1, DATA = 2 };
 
 }  // namespace detail
 
-}  // namespace spark_rapids_jni
\ No newline at end of file
+}  // namespace spark_rapids_jni

I have been trying to modify the CPU code to do padding like you are doing, but it is proving to be a little difficult. I'll keep trying to see if I can make this work.

pmattione-nvidia · 2025-02-11T17:09:01Z

src/main/cpp/src/shuffle_assemble.cu

+{
+  // if the input is empty, just generate an empty table
+  if (partition_offsets.size() == 1) { return build_empty_table(metadata.col_info, stream, mr); }
+


Should we have a shortcut here for there being only one partition

pmattione-nvidia · 2025-02-11T18:08:13Z

Has this been tested yet on complex customer data?

pmattione-nvidia · 2025-02-11T18:51:29Z

src/main/cpp/src/shuffle_assemble.cu

+          // fallthrough
+        default: {
+          if (rc_stack_pos >= 0) {
+            src_row_index           = rc_stack[(rc_stack_pos * 2) + 2];


Doesn't this need to be rc_stack_pos * 3 instead of * 2?

revans2 · 2025-02-11T19:20:41Z

@nvdbaranec I got an Illegal Memory Access issue when I added the following test (note it was in java, but I ported it to C++)

diff --git a/src/main/cpp/tests/shuffle_split.cpp b/src/main/cpp/tests/shuffle_split.cpp
index 97f68f0e3..0d9a59c66 100644
--- a/src/main/cpp/tests/shuffle_split.cpp
+++ b/src/main/cpp/tests/shuffle_split.cpp
@@ -104,6 +104,23 @@ TEST_F(ShuffleSplitTests, Strings)
   }
 }
 
+TEST_F(ShuffleSplitTests, SimpleWithStrings)
+{
+  cudf::test::fixed_width_column_wrapper<int8_t> col0(
+      {0, 0xF0, 0x0F, 0xAA, 0},
+      {1,    0,    0,    0, 1});
+  cudf::test::strings_column_wrapper col1(
+    {"0xFF", "", "0x0F", "0xAA", "0x55"},
+    {0,       1,      0,      0,      0});
+
+  // 2 columns split once
+  {
+    cudf::table_view tbl{{static_cast<cudf::column_view>(col0),
+                          static_cast<cudf::column_view>(col1)}};
+    run_split(tbl, {5});
+  }
+}
+
 TEST_F(ShuffleSplitTests, Lists)
 {
   // list<uint64_t>
@@ -360,4 +377,4 @@ TEST_F(ShuffleSplitTests, NestedTypes)
     run_split(tbl, {2});
     run_split(tbl, {2, 4});
   }
-}
\ No newline at end of file
+}

Running main() from /.../spark-rapids-jni/target/jni/cmake-build/_deps/gtest-src/googletest/src/gtest_main.cc
[==========] Running 10 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 10 tests from ShuffleSplitTests
[ RUN      ] ShuffleSplitTests.Simple
[       OK ] ShuffleSplitTests.Simple (347 ms)
[ RUN      ] ShuffleSplitTests.Strings
[       OK ] ShuffleSplitTests.Strings (51 ms)
[ RUN      ] ShuffleSplitTests.SimpleWithStrings
========= Invalid __global__ read of size 4 bytes
=========     at void cub::CUB_200700_860_NS::detail::MultiBlockBatchMemcpyKernel<cub::CUB_200700_860_NS::detail::DeviceBatchMemcpyPolicy<unsigned int, unsigned int>::Policy700, unsigned int, void **, void **, unsigned long *, unsigned int *, cub::CUB_200700_860_NS::ScanTileState<unsigned int, (bool)1>, unsigned int, (bool)1>(T3, T4, T5, T6, T7, T8)+0x1800
=========     by thread (64,0,0) in block (0,0,0)
=========     Address 0x404 is out of bounds
=========     and is 134,934,914,137,084 bytes before the nearest allocation at 0x7ab8fba00000 of size 64 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2b765f]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:__cudart1071 [0x360835a]
=========                in /.../spark-rapids-jni/target/jni/cmake-build/libcudf.so
=========     Host Frame:cudaLaunchKernel_ptsz [0x3647fa7]
=========                in /.../spark-rapids-jni/target/jni/cmake-build/libcudf.so

pmattione-nvidia · 2025-02-11T21:12:41Z

src/main/cpp/src/shuffle_assemble.cu

+    word = (word >> batch.src_bit_shift) & relevant_row_mask;
+
+    // any bits that are not being stored in the current dest word get overflowed to the next copy
+    prev_word[0] = word >> (32 - batch.dst_bit_shift);


Do any of the existing test trigger these misaligned cases? I just want to make sure our tests are actually hitting them. Do printfs in here show all kinds of alignments? Ditto for the main loop below.

pmattione-nvidia · 2025-02-11T22:06:49Z

src/main/cpp/src/shuffle_assemble.cu

+ * included in the buffer. So the source buffer will be 2 bytes (10 bits) even though the we are
+ * only using 8 bits of that data in the output. Therefore, ths size returned from this functor is
+ * only valid for -source- data.
+ *


Is there a guarantee that the src buffer validity bits are aligned to the beginning of a byte? Suppose we have 10 rows and an offset of 7, wouldn't we need 3 bytes? E.g.: xxxxxxxB BBBBBBBB Bxxxxxxx

nvdbaranec added 12 commits December 10, 2024 13:28

Port of pre-iterative contiguous_split from cudf. This will serve as …

2488d8b

…a basis for shuffle_split and shuffle_assemble.

First end-to-end working version of shuffle split/assemble for plain …

e8696b8

…fixed width types.

Fixed several bugs. Added basic tests.

0e1516b

Temporarily disallow some column types up front.

06975c6

First stab at validity copy step. Many bugs fixed.

633cd6e

Validity bug fixes and additional tests.

4e46a6d

Multiple bug fixes for strings. New tests.

fa3cad0

Lists working.

e7ad88f

Fix for nested lists.

f158d90

Make sure to pack scale for fixed-width columns and split indices in …

45aee7b

…the header.

Cleanup pass. Split code into two modules (shuffle_split and shuffle_…

5479b88

…assemble) to make it more manageable.

Formatting.

8a8e920

nvdbaranec marked this pull request as draft January 24, 2025 22:18

nvdbaranec requested review from hyperbolic2346, pmattione-nvidia, mythrocks and abellina January 24, 2025 22:19

nvdbaranec added the enhancement New feature or request label Jan 24, 2025

revans2 reviewed Jan 27, 2025

View reviewed changes

src/main/cpp/src/shuffle_split.hpp Outdated Show resolved Hide resolved

nvdbaranec added 4 commits January 28, 2025 13:21

Update copyright and fix a few typos.

b4e4c72

Signed PR.

f6ee0d2

Signed-off-by: Dave Baranec <[email protected]>

Merge branch 'contiguous_split_port' into shuffle_split

246cc6b

Signed-off-by: Dave Baranec <[email protected]>

nvdbaranec requested a review from revans2 January 28, 2025 20:53

nvdbaranec added 2 commits January 31, 2025 15:44

Handle a specific corner case: nested types with multiple lists/strin…

3bc1c9c

…gs inside of a struct. The 'branching' case similar to what happens in cudf::row_bit_count.

Merge branch 'branch-25.02' into shuffle_split

108f48b

nvdbaranec marked this pull request as ready for review February 3, 2025 16:46

nvdbaranec added 2 commits February 4, 2025 11:27

Added eof newlines.

8468034

Handle an edge case with validity buffer copying: the number of bytes…

e8d8b9e

… in the source data isn't necessarily just (num_rows + 7) / 8. We have to account for leading rows/bits in the byte we are copying from.

hyperbolic2346 requested changes Feb 6, 2025

View reviewed changes

hyperbolic2346 reviewed Feb 7, 2025

View reviewed changes

revans2 reviewed Feb 7, 2025

View reviewed changes

Make sure to use -source- buffer row index when computing validity by…

9ea606e

…te copy sizes and bit shifts.

pmattione-nvidia reviewed Feb 10, 2025

View reviewed changes

src/main/cpp/src/shuffle_split.cu Show resolved Hide resolved

pmattione-nvidia reviewed Feb 11, 2025

View reviewed changes

sameerz changed the title ~~GPU implementation of the Kud0 shuffle format.~~ GPU implementation of the Kudo shuffle format. Feb 11, 2025

pmattione-nvidia reviewed Feb 11, 2025

View reviewed changes

Merge branch 'branch-25.04' into shuffle_split

836a087

nvdbaranec requested a review from a team as a code owner February 12, 2025 16:40

nvdbaranec changed the base branch from branch-25.02 to branch-25.04 February 12, 2025 16:42

	// - for each string column in huffle_split_metadata.col_info
	// - for each string column in shuffle_split_metadata.col_info

	* result. The function also returns a shuffle_split_metadat struct which contains additional
	* result. The function also returns a shuffle_split_metadata struct which contains additional

GPU implementation of the Kudo shuffle format. #2800

Are you sure you want to change the base?

GPU implementation of the Kudo shuffle format. #2800

Conversation

nvdbaranec commented Jan 24, 2025 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

revans2 commented Jan 31, 2025

nvdbaranec commented Jan 31, 2025

hyperbolic2346 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hyperbolic2346 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 commented Feb 7, 2025

revans2 commented Feb 7, 2025

revans2 commented Feb 7, 2025

nvdbaranec commented Feb 7, 2025 • edited Loading

liurenjie1024 commented Feb 8, 2025

revans2 commented Feb 10, 2025 • edited Loading

pmattione-nvidia Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 commented Feb 10, 2025

revans2 commented Feb 11, 2025

Choose a reason for hiding this comment

pmattione-nvidia commented Feb 11, 2025

Choose a reason for hiding this comment

revans2 commented Feb 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvdbaranec commented Jan 24, 2025 •

edited

Loading

nvdbaranec commented Feb 7, 2025 •

edited

Loading

revans2 commented Feb 10, 2025 •

edited

Loading

pmattione-nvidia Feb 10, 2025 •

edited

Loading