Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add tab_spanner_delim() method #604

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jrycw
Copy link
Collaborator

@jrycw jrycw commented Feb 12, 2025

Fix issue: #582.

Hello team:

This PR sets up a scaffold for tab_spanner_delim().

Basic Logic:

  1. Use str.split or str.rsplit to split column names, corresponding to delimi= and limit= parameters.
  2. Iterate through selected columns, splitting each name and recording its level and root_col (original column).
  3. Process the recorded info:
    • If level != 0, call tab_spanner().
    • If level == 0, call cols_label().

Example:

import polars as pl
from great_tables import GT

data = {
    "province.NL_ZH.pop": [1, 2, 3],
    "province.NL_ZH.gdp": [4, 5, 6],
    "province.NL_NH.pop": [7, 8, 9],
    "province.NL_NH.gdp": [10, 11, 12],
}

gt = GT(pl.DataFrame(data))
gt.tab_spanner_delim()

image

gt.tab_spanner_delim(limit=1)

image

gt.tab_spanner_delim(split="first", limit=1)

image

Remaining Tasks:

  • The reverse= logic is not implemented—need the team's help to clarify the expected behavior.
  • Docs are ported from {gt} but need refinement.
  • Tests are missing since the logic isn't fully settled yet.

@jrycw jrycw marked this pull request as draft February 12, 2025 16:34
Copy link

codecov bot commented Feb 12, 2025

Codecov Report

Attention: Patch coverage is 77.35849% with 12 lines in your changes missing coverage. Please review.

Project coverage is 90.70%. Comparing base (3d6ad09) to head (ef0a9ee).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
great_tables/_spanners.py 76.92% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #604      +/-   ##
==========================================
- Coverage   90.71%   90.70%   -0.01%     
==========================================
  Files          46       46              
  Lines        5417     5467      +50     
==========================================
+ Hits         4914     4959      +45     
- Misses        503      508       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@machow
Copy link
Collaborator

machow commented Feb 13, 2025

@rich-iannone is going to circle around with this a bit fleshed out, but we paired on having an intermediate representation inside this function, to make it easier to communicate how column name splits end up at various spanner levels:

# case 1: simple ----

split(["span_1.A", "span_1.B.x"])

# produces

{"span_1.A": ["span_1", "A"], "span_1.B.x": ["span_1", "B", "x"]}


# case 2: reversed ----

split(["span_1.A", "span_1.B.x"], reverse=True)

# produces

{"span_1.A": ["A", "span_1"], "span_1.B.x": ["x", "B", "span_1"]}

# converted to DataFrame (full example) ----------
from itertools import zip_longest
import polars as pl

# d is the intermediate representation
d = {"span_1.A": ["A", "span_1"], "span_1.B.x": ["x", "B", "span_1"]}
rectangle = [list(d.keys()), *zip_longest(*d.values())]

# first column should be "column_name"
# other columns should be numbered (e.g. span_1, span_2, span_3)
pl.DataFrame(rectangle)

For example, the above DataFrame looks like this

┌────────────┬──────────┬──────────┬──────────┐
│ column_0column_1column_2column_3 │
│ ------------      │
│ strstrstrstr      │
╞════════════╪══════════╪══════════╪══════════╡
│ span_1.AAspan_1null     │
│ span_1.B.xxBspan_1   │
└────────────┴──────────┴──────────┴──────────┘

This first column is the original column names, with the other columns representing spanner levels (from lowest to highest). It seems that gt loops through the other columns from lowest to highest, and...

  • The lowest does a .cols_label()
  • Each other column run .tab_spanner() for each unique value in that column
  • No gathering is performed, so columns do not change position

I think having the intermediate dictionary of lists, and conversion to DataFrame will end up useful for us communicating gt's underlying spec and testing it (without having to snapshot whole tables)

@rich-iannone
Copy link
Member

@jrycw Thank you so much for your work on this feature, it is a big one! Regarding the reverse= argument, I can explain its intent and how it could be implemented.

In your example you had:

import polars as pl
from great_tables import GT

data = {
    "province.NL_ZH.pop": [1, 2, 3],
    "province.NL_ZH.gdp": [4, 5, 6],
    "province.NL_NH.pop": [7, 8, 9],
    "province.NL_NH.gdp": [10, 11, 12],
}

gt = GT(pl.DataFrame(data))
gt.tab_spanner_delim()
image

Which is a sensible way to structure column names as we go from large to small elements, from left to right (e.g., province, NL_ZH, pop in the first column). Some tables may have this the other way around, like this:

data = {
    "pop.NL_ZH.province": [1, 2, 3],
    "gdp.NL_ZH.province": [4, 5, 6],
    "pop.NL_NH.province": [7, 8, 9],
    "gdp.NL_NH.province": [10, 11, 12],
}

It's just another way of structuring the elements, but in a different order (going from smallest to largest elements). Rendering that currently with tab_spanner_delim() returns this table:

image

But having the option to use reverse=True would give you the table at the top. And it might be more realistic to have a DataFrame like this:

data = {
    "pop.NL_ZH": [1, 2, 3],
    "gdp.NL_ZH": [4, 5, 6],
    "pop.NL_NH": [7, 8, 9],
    "gdp.NL_NH": [10, 11, 12],
}

where it makes sense to the user that pop or gdp might naturally go before NL_ZH or NL_NH, so having reverse= is useful there.

In terms of implementation, what's done in the R program is just this:

  1. split the column name by delimiter into a list
  2. reverse the contents of the list if reverse=True
  3. continue...

Hope this helps clarify the intended use of reverse=!

@jrycw
Copy link
Collaborator Author

jrycw commented Feb 20, 2025

@machow and @rich-iannone, thanks for all the examples and explanations! I've tried to consolidate the logic you described into a class named SpannerTransformer (though there might be a better name). Now, the intermediate representation can be obtained from SpannerTransformer.split(), and the rectangle from SpannerTransformer.get_rectangle().

I've also added some tests, but I'm not entirely sure if my logic is correct—reverse= is still giving me a bit of trouble! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants