feat: add `tab_spanner_delim()` method #604

jrycw · 2025-02-12T16:34:13Z

Fix issue: #582.

Hello team:

This PR sets up a scaffold for tab_spanner_delim().

Basic Logic:

Use str.split or str.rsplit to split column names, corresponding to delimi= and limit= parameters.
Iterate through selected columns, splitting each name and recording its level and root_col (original column).
Process the recorded info:
- If level != 0, call tab_spanner().
- If level == 0, call cols_label().

Example:

import polars as pl
from great_tables import GT

data = {
    "province.NL_ZH.pop": [1, 2, 3],
    "province.NL_ZH.gdp": [4, 5, 6],
    "province.NL_NH.pop": [7, 8, 9],
    "province.NL_NH.gdp": [10, 11, 12],
}

gt = GT(pl.DataFrame(data))
gt.tab_spanner_delim()

gt.tab_spanner_delim(limit=1)

gt.tab_spanner_delim(split="first", limit=1)

Remaining Tasks:

The reverse= logic is not implemented—need the team's help to clarify the expected behavior.
Docs are ported from {gt} but need refinement.
Tests are missing since the logic isn't fully settled yet.

codecov · 2025-02-12T16:35:53Z

Codecov Report

Attention: Patch coverage is 77.35849% with 12 lines in your changes missing coverage. Please review.

Project coverage is 90.70%. Comparing base (3d6ad09) to head (ef0a9ee).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
great_tables/_spanners.py	76.92%	12 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #604      +/-   ##
==========================================
- Coverage   90.71%   90.70%   -0.01%     
==========================================
  Files          46       46              
  Lines        5417     5467      +50     
==========================================
+ Hits         4914     4959      +45     
- Misses        503      508       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

machow · 2025-02-13T18:25:57Z

@rich-iannone is going to circle around with this a bit fleshed out, but we paired on having an intermediate representation inside this function, to make it easier to communicate how column name splits end up at various spanner levels:

# case 1: simple ----

split(["span_1.A", "span_1.B.x"])

# produces

{"span_1.A": ["span_1", "A"], "span_1.B.x": ["span_1", "B", "x"]}


# case 2: reversed ----

split(["span_1.A", "span_1.B.x"], reverse=True)

# produces

{"span_1.A": ["A", "span_1"], "span_1.B.x": ["x", "B", "span_1"]}

# converted to DataFrame (full example) ----------
from itertools import zip_longest
import polars as pl

# d is the intermediate representation
d = {"span_1.A": ["A", "span_1"], "span_1.B.x": ["x", "B", "span_1"]}
rectangle = [list(d.keys()), *zip_longest(*d.values())]

# first column should be "column_name"
# other columns should be numbered (e.g. span_1, span_2, span_3)
pl.DataFrame(rectangle)

For example, the above DataFrame looks like this

┌────────────┬──────────┬──────────┬──────────┐
│ column_0   ┆ column_1 ┆ column_2 ┆ column_3 │
│ ---        ┆ ---      ┆ ---      ┆ ---      │
│ str        ┆ str      ┆ str      ┆ str      │
╞════════════╪══════════╪══════════╪══════════╡
│ span_1.A   ┆ A        ┆ span_1   ┆ null     │
│ span_1.B.x ┆ x        ┆ B        ┆ span_1   │
└────────────┴──────────┴──────────┴──────────┘

This first column is the original column names, with the other columns representing spanner levels (from lowest to highest). It seems that gt loops through the other columns from lowest to highest, and...

The lowest does a .cols_label()
Each other column run .tab_spanner() for each unique value in that column
No gathering is performed, so columns do not change position

I think having the intermediate dictionary of lists, and conversion to DataFrame will end up useful for us communicating gt's underlying spec and testing it (without having to snapshot whole tables)

rich-iannone · 2025-02-19T17:03:15Z

@jrycw Thank you so much for your work on this feature, it is a big one! Regarding the reverse= argument, I can explain its intent and how it could be implemented.

In your example you had:

import polars as pl
from great_tables import GT

data = {
    "province.NL_ZH.pop": [1, 2, 3],
    "province.NL_ZH.gdp": [4, 5, 6],
    "province.NL_NH.pop": [7, 8, 9],
    "province.NL_NH.gdp": [10, 11, 12],
}

gt = GT(pl.DataFrame(data))
gt.tab_spanner_delim()

Which is a sensible way to structure column names as we go from large to small elements, from left to right (e.g., province, NL_ZH, pop in the first column). Some tables may have this the other way around, like this:

data = {
    "pop.NL_ZH.province": [1, 2, 3],
    "gdp.NL_ZH.province": [4, 5, 6],
    "pop.NL_NH.province": [7, 8, 9],
    "gdp.NL_NH.province": [10, 11, 12],
}

It's just another way of structuring the elements, but in a different order (going from smallest to largest elements). Rendering that currently with tab_spanner_delim() returns this table:

But having the option to use reverse=True would give you the table at the top. And it might be more realistic to have a DataFrame like this:

data = {
    "pop.NL_ZH": [1, 2, 3],
    "gdp.NL_ZH": [4, 5, 6],
    "pop.NL_NH": [7, 8, 9],
    "gdp.NL_NH": [10, 11, 12],
}

where it makes sense to the user that pop or gdp might naturally go before NL_ZH or NL_NH, so having reverse= is useful there.

In terms of implementation, what's done in the R program is just this:

split the column name by delimiter into a list
reverse the contents of the list if reverse=True
continue...

Hope this helps clarify the intended use of reverse=!

jrycw · 2025-02-20T06:42:06Z

@machow and @rich-iannone, thanks for all the examples and explanations! I've tried to consolidate the logic you described into a class named SpannerTransformer (though there might be a better name). Now, the intermediate representation can be obtained from SpannerTransformer.split(), and the rectangle from SpannerTransformer.get_rectangle().

I've also added some tests, but I'm not entirely sure if my logic is correct—reverse= is still giving me a bit of trouble! :)

Add scaffold for tab_spanner_delim()

ebd5500

jrycw requested review from rich-iannone and machow as code owners February 12, 2025 16:34

jrycw marked this pull request as draft February 12, 2025 16:34

Simplify the spanner recording process

e6d2564

machow assigned rich-iannone Feb 13, 2025

jrycw added 2 commits February 20, 2025 14:31

Add SpannerTransformer implementation

a1fd624

Add tests for SpannerTransformer

5c6c2d2

Update SpannerTransformer.get_rectangle() implementation

ef0a9ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `tab_spanner_delim()` method #604

feat: add `tab_spanner_delim()` method #604

jrycw commented Feb 12, 2025

codecov bot commented Feb 12, 2025 •

edited

Loading

machow commented Feb 13, 2025

rich-iannone commented Feb 19, 2025

jrycw commented Feb 20, 2025

feat: add tab_spanner_delim() method #604

Are you sure you want to change the base?

feat: add tab_spanner_delim() method #604

Conversation

jrycw commented Feb 12, 2025

Basic Logic:

Example:

Remaining Tasks:

codecov bot commented Feb 12, 2025 • edited Loading

Codecov Report

machow commented Feb 13, 2025

rich-iannone commented Feb 19, 2025

jrycw commented Feb 20, 2025

feat: add `tab_spanner_delim()` method #604

feat: add `tab_spanner_delim()` method #604

codecov bot commented Feb 12, 2025 •

edited

Loading