Add initial Arrow PyCapsule support #517

jonmmease · 2024-10-12T13:59:26Z

Supersedes #501 now that everything needed has been updated.

Adopts arro3 / pyo3-arrow / narwhals for managing DataFrames in Python and the passing to Rust in arrow format with zero copy.

Uses get_column_usage to determine which columns are needed for each input inline dataset, and then uses narwhals to downproject prior to wrapping in an arro3 table to pass to rust. This approach removes the need for the Python data source framework, so I removed this all together.

Performance of this workflow is improved from prior handling of pandas/polars, and is faster than the duckdb code path in my limited testing. We'll want to do more comprehensive benchmarks before release.

jonmmease · 2024-10-12T14:32:57Z

vegafusion-python/tests/test_pretransform.py

-    for i in range(20):
-        # Break cache by removing one row each iteration
-        movies_inner = movies.iloc[i:]
+    movies["Title"] = movies["Title"].astype(str)


wish polars/pyarrow could handle loading mixed str/num columns from pandas

jonmmease · 2024-10-12T14:33:44Z

vegafusion-runtime/src/transform/timeunit.rs

+                func: Arc::new((*STR_TO_UTC_TIMESTAMP_UDF).clone()),
+                args: vec![field_col, lit(default_input_tz)],
+            })
+        }


I'm sure there are a lot more of these that will need to be updated, but just did this one to get tests passing

jonmmease added 5 commits October 12, 2024 09:55

Add initial Arrow PyCapsule support

9a09bdf

Bump python in actions to 3.11

bbfb563

fmt

c5652dc

update hang test to use polars with pycapsule path

6e83863

Add system python of 3.11

50c7163

jonmmease commented Oct 12, 2024

View reviewed changes

jonmmease added 18 commits October 12, 2024 16:16

Add more efficient hashing and rechunk for DataFusion

751af26

fmt

8974fad

use narwhals to remove unused columns.

380de1e

toward removing arrow/pyarrow flag

67fb559

Remove arrow-rs pyarrow flag, use pyo3-arrow

b9bdf13

Rename pyarrow feature flag to py

68b8324

toward using narwhals to process transformed data

80fd0da

Remove python datasource

b9b65e1

Handle dict and non-narwhals pycapsule types

57e5efc

update default extraction to arro3

f9f7961

fix type checking

f27c7f7

build py before type checking

7d8b9ca

skip empty fields in window transform

47f675f

Try normalize category order

e029456

lower min Python back to 3.9

1dad584

clear wheel build dir first

9cefb4c

try rename artifacts

23efb08

cache?

56fa42a

jonmmease merged commit 12e6d97 into v2 Oct 15, 2024
19 checks passed

jonmmease mentioned this pull request Oct 15, 2024

Read Arrow C Stream from Arrow PyCapsule Interface #501

Closed

MarcoGorelli mentioned this pull request Oct 16, 2024

docs: add vegafusion to "used by" on readme narwhals-dev/narwhals#1191

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial Arrow PyCapsule support #517

Add initial Arrow PyCapsule support #517

jonmmease commented Oct 12, 2024 •

edited

Loading

jonmmease Oct 12, 2024

jonmmease Oct 12, 2024

Add initial Arrow PyCapsule support #517

Add initial Arrow PyCapsule support #517

Conversation

jonmmease commented Oct 12, 2024 • edited Loading

jonmmease Oct 12, 2024

Choose a reason for hiding this comment

jonmmease Oct 12, 2024

Choose a reason for hiding this comment

jonmmease commented Oct 12, 2024 •

edited

Loading