nested_dtypes.qmd

---
title: "Nested dtypes"

execute:
  eval: true
  warning: true
  error: true
  keep-ipynb: true
  cache: true
jupyter: python3
pdf-engine: lualatex
# theme: pandoc
html:
    code-tools: true
    fold-code: false
    author: Jonathan D. Rosenblatt
    data: 04-20-2024
    toc: true
    number-sections: true
    number-depth: 3
    embed-resources: true
---

```{python}
#| echo: false
# %pip install --upgrade pip
# %pip install --upgrade polars
# %pip install --upgrade pyarrow
# %pip install --upgrade Pandas
# %pip install --upgrade plotly
# %pip freeze > requirements.txt
```

```{python}
#| label: setup-env

# %pip install -r requirements.txt
```

```{python}
#| label: Polars-version
%pip show Polars # check you Polars version
```

```{python}
#| label: Pandas-version
%pip show Pandas # check you Pandas version
```

```{python}
#| label: preliminaries

import polars as pl
pl.Config(fmt_str_lengths=50)
import polars.selectors as cs

import pandas as pd
import numpy as np
import pyarrow as pa
import plotly.express as px
import string
import random
import os
import sys
%matplotlib inline 
import matplotlib.pyplot as plt
from datetime import datetime

# Following two lines only required to view plotly when rendering from VScode. 
import plotly.io as pio
# pio.renderers.default = "plotly_mimetype+notebook_connected+notebook"
pio.renderers.default = "plotly_mimetype+notebook"
```

What Polars module and dependencies are installed?
```{python}
#| label: show-versions
pl.show_versions()
```

How many cores are available for parallelism?
```{python}
#| label: show-cores
pl.thread_pool_size()
```


# Preliminaries


## A Polars Frame Can Hold Anything

Fun fact- Polars, like Pandas, can store anything within a cell. For instance, a Polars frame can hold a Polars frame.

```{python}
#| label: make-polars-frame

df = pl.DataFrame(
    {
        "a": [
            pl.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]}), 
            pl.DataFrame({"x": [7, 8, 9], "y": [10, 11, 12]})
            ],
        "b": ["a", "b"]
    }
)

print(df)

```

Things to note: 

- The dtype of the frame is `Polars Object`. 


## Motivation For Nested dtypes

Consider the following scenarios:

1. I want to compute with part-of-strings generated by a **split**. 

2. I want to store and compute with arrays of **varying lengths**. Examples include: paths of cars on the globe, paths of users in a website, 

3. I want to **group columns** together, and compute with them as a unit.
   1. For convenience of access.
   2. Because I am missing a `foo_horizontal()` method.

4. I want to read a JSON/XML/YML with **nested structures**.


How can I go about?

1. Since a Polars object, like a Python object, can hold anything, I could store the nested data as a **Polars object**.
1. I could generate **many columns**, and pad with nulls when lengths differ.
2. I could store each unit as a **Polars Series**.


The **Python object** option is always available, and I will try to **avoid it**. 
If I can avoid Polars (i.e. Rust) calling Python, I will.

The second options could work. It is actually a good one, if I can figure out the **column access** (regex?).

The third option would be great. Unlike the first, I would remain within Rus. Unlike the second, I will have a **named unit** to access. 

This is today's topic. 

::: {.callout-important}
The Polars nested dtypes are inherited from the Arrow format. 
So these are actually "Arrow Nested dtypes with Polars' functionality". 
:::


## Nested dtypes in Polars

The Polars nested dtypes, inherited from [PyArrow](https://wesm.github.io/arrow-site-test/format/Layout.html):


1.  **Polars List** 
2.  **Polars Array** 
3.  **Polars Struct** 

A **Polars list** is a Polars Series within a cell: All elements in the cell must have the same dtype; the elements are unnamed so accessed by index or filter.
Do not be confused by the name. A Polars list is not a Python list.  

A **Polars array** is a Polars list with a fixed length: All elements in the cell must have the same dtype and the same length; the elements are unnamed so accessed by index or filter. Another mental model of a Polars array is a numpy 2D array within a Polars frame.


A **Polars struct** is a Polars DataFrame within a cell: All elements in the cell must have the same fields; the elements are named so accessed by field name. 
Do not think of the struct as a Python dict within a cell, because all rows must have the same fields.


# Polars List {#sec-list}


## Making a Polars List

Make a Polars list from a Python list.

```{python}
#| label: make-list

pl.DataFrame(
    {
        "a": [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
        "b": ["a", "b", "c"]
    }
)

```

Things to note:

- `a` is not a Python list, rather, it is a Polars list. 
- Can a Polars list hold a Polars list? Yes. It can hold any Polars object, even nested ones (list of list of list, etc.); provided that all elements in the cell have the same dtype.

Make a Polars list of Python lists of Python lists:

```{python}
#| label: make-list-of-list

pl.DataFrame(
    {
        "a": [[[1, 2, 3], [4, 5, 6], [7, 8, 9]], [[10, 11, 12], [13, 14, 15], [16, 17, 18]]],
        "b": ["a", "b"]
    }
)

```


More often you will not make a Polars list directly, but rather it will be the result of some operation:

1. When grouping without aggregation. 
1. When splitting a string of unknown length (knonwn length will return a Polars struct, and I expect in the future a Polars array).
2. When wrapping a bunch of columns with `pl.concat_list()`.
3. When "imploding" (aka "collapsing") a columns.  


### List from Aggregation

A Polars group_by, will actually create a Polars list internally, before applying the `.agg()` method.

```{python}
#| label: list-from-aggregation

df = pl.DataFrame(
    {
        "a": [1, 1, 2, 2, 3, 3],
        "b": [1, 2, 3, 4, 5, 6]
    }
)

df.group_by("a").agg(pl.col("b"))

```


### List From Splitting a String

```{python}
#| label: make-frame-with-string-to-split

df_strings = pl.DataFrame(
    {
        "a": ["apple, banana, orange", "apple, banana", "apple"],
    }
)

df_strings
```

Now split `a` into... whatever Polars decides to return. 
Hint: a Polars list of Polars strings.

```{python}
(
    df_strings
    .select(
        pl.col('a').str.split(", ")
        )
)
```


### List from Imploding a Column

```{python}
#| label: list-from-imploding

df.with_columns(pl.col("b").implode())

```

Try replacing `.with_columns()` with `.select()` and see what happens.


Implode `group_by()`:

```{python}
#| label: implode-within-group

df.group_by("a").agg(pl.col("b"))

```

Implode `over()`:

```{python}
#| label: implode-over
#| eval: false

df.with_columns(pl.col("b").over("a")) # no good

df.with_columns(pl.col("b").implode().over("a")) # no good

df.with_columns(pl.concat_list('b').over("a")) # no good
```

```{python}
(
    df
    .with_columns(
        pl.col("b").over("a", mapping_strategy="join").alias('over_join'),
        
        pl.col("b").over("a", mapping_strategy="explode").alias('over_explode'),
        
        pl.col("b").over("a", mapping_strategy="group_to_rows").alias('over_group_to_rows'),
    )
)
# good!

```

Things to note:

- The `mapping_strategy` argument govern the way the result within group is propagate to each row. Trying chainging its value to see the effect.
- See [here](https://github.com/pola-rs/polars/pull/6487) for more context on how `.over()` works and why.


### List From `pl.concat_list()`

```{python}
#| label: list-from-concatenation

df.select(pl.concat_list([pl.col("a"), pl.col("b")]))

```

### List From `.list.concat()` 

```{python}
(
    df
    .with_columns(
        pl.concat_list([pl.col("a"), pl.col("b")]).alias('ab')
        )
    .select(pl.col('ab').list.concat(['a','a','a']))
    .to_pandas()
)

```


## Operating on List Elements

Arguably, the only motivation for using a Polars list, is for easy access to the list and its elements. 
You are thus welcome to think of equivalent operations, without the `.list.` API. 


An element can be accessed and trasformed with `list.eval(pl.element())`, which will expose (almost) all the methods of a `pl.Expr()` object.

```{python}
#| label: list-eval-element

df_with_list = (
    df
    .with_columns(
        pl.concat_list(['a','b']).alias('ab')
    )
)


(
    df_with_list
    .select(
        pl.col('ab').list.eval(
            pl.element().add(1000)
            )
        )
)
```


Things to note:

- `.eval()` belongs to the `.list` namespace. 
- `pl.element()` is a selector that selects the elements of the list. It has almost all the methods available to `pl.col()`.
- `.eval()` currently cannot access other columns. You will probably discover this is a considerable limitation.


```{python}
#| label: list-eval-element-methods
print(dir(pl.element()))
```

`list.eval(pl.element())`, however, does not see to add much that cannot be done with `pl.col()` methods.
The true power of the `.list` namespace, is in is pre-defined methods.


## List Methods


### Accessing Elements

Recall; list elements are unnamed. They are thus accessed using indices (`iloc[]` style).

```{python}
#| label: list-get-and-gather

(
    df_with_list.select(
        pl.col('ab'),
        pl.col('ab').list.get(0).alias('get_1'), 
        pl.col('ab').list.gather([0]).alias('gather_1'), # returns a list
        pl.col('ab').list.slice(0,1).alias('slice_1'), # returns a list
        pl.col('ab').list.gather([0, 1]).alias('gather_2'),
        pl.col('ab').list.slice(0,2).alias('slice_2'),
        pl.col('ab').list.gather_every(2).alias('gather_every_2'),
        )
)

```

Things to note:

- What returns a list, and what returns a single element?


First, Head, Last, Tail

```{python}
#| label: list-methods

(
    df_with_list.select(
        pl.col('ab'),
        pl.col('ab').list.get(1).alias('first_1'),
        pl.col('ab').list.first().alias('first_2'),
        pl.col('ab').list.head(1).alias('first_5'), # returns a list
        pl.col('ab').list.last().alias('last_2'), 
        pl.col('ab').list.tail(1).alias('last_1'), # returns a list
        pl.col('ab').list.tail(1).list.first().alias('last_3'),
        )
)

```

Shift

```{python}
#| label: list-shift

(
    df_with_list.select(
        pl.col('ab'),
        pl.col('ab').list.shift(1).alias('shift_1'),
        pl.col('ab').list.shift(-1).alias('shift_-1'),
        )
)

```

Sample

```{python}
#| label: list-sample

(
    df_with_list.select(
        pl.col('ab'),
        pl.col('ab').list.sample(10, with_replacement=True).alias('sample'),
        )
    .to_pandas()
)

```


Careful where you put the `.sample()`!

```{python}
(
    df_with_list
    .select(
        pl.col('ab'),
        
        pl.col('ab').list.sample(10, with_replacement=True).alias('within_row'),
        
        pl.col('ab').sample(6, with_replacement=True).alias('within_column'),
    )
    .to_pandas()
)
```


### Statistical Aggregations


```{python}
#| label: list-statistical-aggregations

(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.min().alias('min'),
        pl.col('ab').list.max().alias('max'),
        pl.col('ab').list.sum().alias('sum'),
        pl.col('ab').list.mean().alias('mean'),
        pl.col('ab').list.median().alias('median'),
        pl.col('ab').list.var().alias('var'),
        pl.col('ab').list.std().alias('std'),
        pl.col('ab').list.n_unique().alias('n_unique'),
        pl.col('ab').list.unique().alias('unique'),
        pl.col('ab').list.len().alias('len'),
        )
)

```

### Ordering and Ranking


```{python}
#| label: list-ordering-and-ranking

(
    df_with_list
    .select(
        pl.col('ab'),
        # pl.col('ab').list.rank().alias('rank'),
        pl.col('ab').list.sort().alias('sort'),
        pl.col('ab').list.reverse().alias('reverse'),
        )
)

```


### Sequence Operations


```{python}
#| label: list-diff

(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.diff().alias('diff'),
        )
)

```

What about `.pct_change()`?
The following will not work, but see @sec-pct-change for a workaround.

```{python}
#eval: false
(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.pct_change().alias('pct_change'),
        )

)
```


### Logical Aggregations

```{python}
#| label: list-logical-aggregations

(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.eval(pl.element().eq(1)).list.any().alias('any'),
        pl.col('ab').list.eval(pl.element().eq(1)).list.all().alias('all'),
        )
)

```


### String Operations

- `list.join()` will join the strings in the list. I admit, I would have expected a `.list.str.concat()` instead. 
- `list.contains()` will check if the list contains a string.
- `list.count_matches()` will count the number of times a string appears in the list.


```{python}
#| label: list-strings

df_2_with_list = pl.DataFrame(
    {
        "ab": [["a", "b", "c"], ["d", "e", "f"], ["g", "h", "i"]],
        'sep': ['@', '#', '$']
    }
)

(
    df_2_with_list
    .select(
        pl.col('ab'),
        pl.col('sep'),
        pl.col('ab').list.join(pl.col('sep')).alias('joined'),
        pl.col('ab').list.contains('a').alias('contains_a'),
        pl.col('ab').list.count_matches('a').alias('count_a'),
        )
)

```


### Filtering

Currently there is no `.list.filter()` method (see [issue](https://github.com/pola-rs/polars/issues/9189)).
There are, however, many workarounds. 

#### Option 1: Using `pl.element().filter()`

```{python}
#| label: list-filter

(
    df_with_list
    .select(
        pl.col('ab'),
        
        pl.col('ab').list.eval(
            pl.element()
            .filter(pl.element().gt(2)),
        )
        .alias('gt_2'),
    )
)

```


#### Option 2: Using `.list.gather()` + `pl.arg_where()`

Let's try to filter elements of a list based on the length of the string.

```{python}

# Sample data
data = pl.DataFrame({
    'string_list': [
        ["apple", "banana", "orange", "grape", "pear"],
        ["apple", "banana", "orange", "grape"],
        ["apple", "banana", "orange"],
        ["apple", "banana"],
        ["apple"],
        ["apple", "banana", "orange", "grape", "pear"],
        ["apple", "banana", "orange", "grape"],
    ],
    }
    )

```

In each list, keep only strings of length 4.

Using `.list.gather()` and `pl.arg_where()` and explode to access `pl.col().str` methods. 

```{python}
(
    data
    .with_row_index('i')
    .with_columns(
        pl.col('string_list').list.gather(
            pl.arg_where(
                pl.col('string_list').explode().str.len_chars().eq(4))
                )
                .over('i')
            )
    .drop('i')
)
```

Without exploding: Using `list.gather()` to filter, and `list.eval()` to access `pl.element().str` methods.

```{python}
(
    data
    .select(
        pl.col('string_list').list.gather(
            pl.col('string_list').list.eval(
                pl.arg_where(
                    pl.element().str.len_chars().eq(4)
                    )
                )
            )
        )
)

```

Explode once, and call the output twice using the Walrus operator; once to access `pl.col().filter()` and then to access `pl.col().str`.

```{python}
(
    data
    .with_row_index('i')
    .with_columns(
        (
            (cntr := pl.col('string_list').explode())
            .filter(cntr.str.len_chars().eq(4))
            )
        .implode()
        .over('i')
        )
    .drop('i')
)

```

Things to note:

- If you are unfamilar with Python's (not Polars') Walrus operator `:=`, see [here](https://realpython.com/python-walrus-operator/).


### Missing

I believe everything can be done with `.list.eval(pl.element().method())`.


### Set Operations

.list.set_intersection

.list.set_union

.list.set_difference

.list.set_symmetric_difference

```{python}
(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.set_intersection([1,2,3]).alias('intersection'),
        
        pl.col('ab').list.set_union(['11']).alias('union'),

        pl.col('ab').list.set_union(['a']).alias('union_2'),
        
        pl.col('ab').list.set_difference([2,3]).alias('difference'),
        
        pl.col('ab').list.set_symmetric_difference([1,2,3]).alias('symmetric_difference'),
        )
)

```

Things to note:

- We can pass a Python list and not a Python set. 
- Set union with an appropriate dtype will return Null. Is that what you would expect? 


### Polars List to Array or Struct

```{python}
#| label: list-to-struct

(
    df_with_list
    .select(
        pl.col('ab'),

        pl.col('ab').list.to_struct().alias('ab_struct'),

        pl.col('ab').list.to_array(width=2).alias('ab_array'),
        )
)

```


Things to note:

- An array has a fixed length; it is not inferred, you must specify it.


## Exploding and Imploding {#sec-list-explode}

**Explode**: from list onto a column.
```{python}
(
    df_with_list
    .select(
        pl.col('ab').list.explode().alias('ab_exploded'),
        )
)

```


**Implode**: from column to list.
```{python}
(
    df_with_list
    .select(
        pl.col('ab').implode().alias('ab_imploded'),
        
        pl.col('ab').list.explode().implode().alias('ab_exploded_imploded'),
        )
)
```

::: {.callout-important}
The following is a great trick to use when the `.list` namespace does not have the method you need.
:::

Explode within a group will convert the list to a column, on which you can operate using usual `pl.col()` methods.

```{python}
(
    df_with_list
    .with_row_index()
    .group_by('index')
    .agg(
        pl.col('ab').list.explode() 
        .alias('this_is_actually_a_column'),
        )
)

```


There is also  `df.explode()`, alongside the `pl.col().explode()` method.
```{python}
#| label: df-explode
df_with_list.explode('ab')
```


## Examples 


#### pct_change {#sec-pct-change}

In the absence of a `.list.pct_change()` method, we can use the following workaround.

```{python}
(
    df_with_list
    .with_row_index()
    .group_by('index')
    .agg(
        pl.col('ab').list.explode().pct_change().alias('pct_change'),
        ) 
)

```

Could I have done that with the multiple column approach?
Would have needed `pl.col.pct_change_horizontal()`.


#### ECDF {#sec-ecdf}

How to compute within a column?
In the abcense of `.list.ecdf()` and `pl.order_horizontal()`, we can use the following.


```{python}
#| label: list-ecdf

(
    df_with_list
    .select(
        pl.col('ab'),
        
        # Ranks
        pl.col('ab').list.eval(pl.element().rank()).alias('ranks'),
        
        # Option 1:ECDF with hard coded length (makes sense for an array)
        pl.col('ab').list.eval(pl.element().rank().truediv(2)).alias('ecdf_1'),
        
        # Option 2: ECDF with length of list. Will not work. 
        # pl.col('ab').list.eval(pl.element().rank().truediv(pl.col('ab').list.len())).alias('ecdf_2'),
    )
)

```


Things to note:

- `.list.eval(pl.element())` can do more than point_wise operations. Think about the `rank()` example.

- Option 2 will fail because `.list.eval()` cannot reference another column. See [issue](https://github.com/pola-rs/polars/issues/7210).

- `.rank()` has a `method` argument that can be used to specify how to deal with ties.


Here is another attempt, which does not assume the length of the list is known (not fixed). 
The following will not work, but may be fixed. 


```{python}

(
    df_with_list
    .with_row_index()
    .group_by('index')
    .agg(
        (abrank:=pl.col('ab').explode().rank()).truediv(abrank.len()),
    )
)

```


#### ECDF wrt Other Column

@sec-ecdf showed how to compute the ECDF of a list. Say we want to evaluate the ECDF of column `a` wrt column `b`.

TODO


```{python}
#| eval: false

(
    df_with_list
    .with_columns(pl.col('a').implode())    
    .with_row_index()
    .group_by('index')
    .agg(
        (
            a_ge_b_ind := pl.col('a').explode().ge(pl.col('b'))
        )
    )
    .with_columns(
        a_ge_b_ind.list.sum().truediv(a_ge_b_ind.list.len())
    )

)

```


#### arg_max_horizontal

Note, there currently is no `pl.arg_max_horizontal()` method.
We will try to make one by concatenating columns to list, and then using the `.list.arg_max()`.
In particular, we will want the arg_max col name, and not the index. 

```{python}
#| label: arg-max-horizontal

(
    df
    # Identify the arg_max
    .with_columns(
        pl.concat_list([pl.col("a"), pl.col("b")])
        .list.arg_max()
        .alias('arg_max'),
        )
    # Return the name of the argmax
    .select(
        'a',
        'b',
        (
            pl.col('arg_max')
            .map_elements(lambda i: df.columns[i], return_dtype=pl.Utf8)
            .alias('arg_max_col_name')
        )
        
    )
)

```

Things to note:

- Can you find a more efficient way to do this? Replacing the lambda?


#### Quantile

Recall there is no `.list.quantile()` method; so we need to brew our own. 

```{python}
#| label: quantile-example

df_for_quantile = (
    df_with_list
    .select(
        pl.col('ab').list.sample(100, with_replacement=True),
    )
)

df_for_quantile.to_pandas()

```

Option 1: `group_by` then `explode`

```{python}
(
    df_for_quantile
    .with_row_index()
    .group_by('index')
    .agg(
        pl.col('ab').explode().quantile(0.2)
        )
)
```

Option 2: `list.eval()`

```{python}
(
    df_for_quantile
    .select(
        pl.col('ab').list.eval(pl.element().quantile(0.2))
        .list.first() # to extract a float from a list
        .alias('quantile_0.2'),
    )


)
```


# Polars Array {#sec-array}

Polars Arrays as Polars lists with a fixed length.

## Making a Polars Array

Make a Polars array from a Python list.

```{python}
#| label: make-array

pl.DataFrame(
    [
        pl.Series("Array_1", [[1, 3], [2, 5]]),
        pl.Series("Array_2", [[1, 7, 3], [8, 1, 0]]),
    ],
    schema={
        "Array_1": pl.Array(pl.Int64, 2),
        "Array_2": pl.Array(pl.Int64, 3),
    },
)

```

Things to note:

- The dtype of the array is specified in the schema. Otherwise, a Polars list would have been inferred. 


## Array Methods

Currently, list methods and array methods seem to be the same. 
I expect these lists to diverge in the future, because more operations can be defined when assuming a fixed length.


# Polars Struct {#sec-struct}

Quoting [RhoSignal](https://www.rhosignal.com/posts/nested-dtypes/):
> pl.Struct type is a nested collection of columns. The pl.Struct is really just a way of having a nested namespace for columns. The underlying columns are just normal Polars Series.

Also from the [Arrow documentation](https://wesm.github.io/arrow-site-test/format/Layout.html):
> A struct is a nested type parameterized by an ordered sequence of relative types (which can all be distinct), called its fields.

So in summary:

2. The Python dict analogy is not a good one.
3. A Polars struct is a column which consists of another Polars dataframe.
4. There no difference in performance between a Polars struct and a set of columns (unless an entire field is null. See the [Arrow documentation](https://wesm.github.io/arrow-site-test/format/Layout.html)).
5. 


##  Making a Polars Struct

Make a Polars struct from a Python dict.
But recall- there is nothing "Dicti" about the struct!

```{python}
#| label: make-struct

pl.DataFrame(
    {
        "a": [{"x": 1, "y": 2}, {"x": 3, "y": 4}, {"x": 5, "y": 6}],
        "b": ["a", "b", "c"]
    }
)

```


More often you will not make a Polars struct directly, but rather you will create one from a Polars DataFrame:

1. By directly creating a struct.
2. As the output of some operation, like `pl.Expr().value_counts()`, or all the horizontal cumulators like `pl.cum_sum_horizontal()`, `pl.cum_reduce()`, `pl.cum_fold()`, ...
3. From a Polars list. 
4. By splitting a string to a known length. 


### Making a Polars Struct Directly

`pl.struct()` with existing names:

```{python}
#| label: make-struct-directly
df_with_struct = df.select(
    pl.struct(['a','b'])
    .alias('struct')
    )

df_with_struct
```

Verify that the column is a struct:

```{python}
#| label: verify-struct
df_with_struct.schema
```

`pl.struct()` with field naming:

```{python}
df.select(pl.struct(aaa=pl.col('a'), bbb=pl.col('b'))).schema
```


### Struct As Output

```{python}
df.select(pl.col('a').value_counts())
# df.select(pl.col('a').value_counts().struct.json_encode())

```


### Struct From List

When a list has constant length, which you want to access by name, it makes sense to convert it to a struct.

```{python}
#| label: struct-from-list
df_with_list.select(pl.col('ab').list.to_struct().struct.json_encode())
```


### Split a String

If you want to govern the length of the output of a split, use `.str.splitn()` or `.str.split_exact()` instead of `.str.split()`.

```{python}
#| label: struct-from-splitn

(
    df_strings
    .select(
        pl.col('a').str.splitn(",",4)
        )
)

```

```{python}
#| label: struct-from-split-exact

(
    df_strings
    .select(
        pl.col('a').str.split_exact(",",4)
        )
)

```


## Struct Methods


### Extracting Elements

Getting a field is actually getting a column from a sub-frame.

```{python}
#| label: struct-get
(
    df_with_struct
    .select(
        pl.col('struct').struct.field('a'),
        pl.col('struct').struct.field('b'),
        )
)
```


To text in JSON format:

```{python}
#| label: struct-to-json
(
    df_with_struct
    .select(
        pl.col('struct').struct.json_encode()
        )
)

```


### The New (>= 0.20.27) Way to Access Fields

As of version 0.20.27, you can access fields using a `.with_fields()` context, akin to the `.with_columns()` context.

```{python}
(
    df_with_struct
    .select(
        pl.col('struct').struct.with_fields(
            new_field = pl.col('struct').struct.field('a') * 
            pl.col('struct').struct.field('b')
            )
        .struct.json_encode()
    )
)
```


## Struct to List

There is no `.struct.to_list()` method. 
This makes sense, since a struct does not require all fields to have the same dtype. 
In the case where all fields have the same dtype, you can use the following.

```{python}
#| label: struct-to-list
(
    df_with_struct
    .unnest('struct') #from struct to columns
    .select(pl.concat_list(pl.all())) # from columns to list
)

```

You can also consider element-by-element extractions.

```{python}
#| label: struct-to-list-element-by-element
(
    df_with_struct
    .select(
        pl.concat_list([
            pl.col('struct').struct.field('a'), 
            pl.col('struct').struct.field('b')
            ])
        )
)

```


## Examples


### Verifying Multi-Column Uniques (hashing)

The `.expr().unique()` method does not accept multiple columns.
We can use a struct to group the columns, and then apply the single-column `.unique()` method.

```{python}
#| label: multi-column-uniques

(
    df
    .select(
        pl.struct(['a','b']).alias('struct')
        )
    .select(
        pl.col('struct').is_unique().alias('unique_ind'),
        pl.col('struct').hash().alias('hash'),
        )
)

```


# Discussion


### Q: Can I have a list of list of lists (i.e. more than 2 layers of nesting)?  

Yes. 

### Can I use list methods on a struct?(e.g. argmax)  

No. But you can extract the struct to a list, and then use list methods.


### Q: When a Polars List and when a set of columns?

The difference is mostly syntactic. So a matter of preference and convenience. See the following example. The same goes for Polars arrays. 

```{python}
# make a frame with lists of random length

max_length = int(1e6)

def make_list():
    
    result = pl.Series(list(string.ascii_letters)).sample(random.randint(1, max_length), with_replacement=True)
    
    return result

# make_list()
```


```{python}
df = pl.DataFrame(
    {
        "a": [make_list(), make_list(), make_list()],
        "b": ["a", "b", "c"]
    }
)

```

```{python}
df.estimated_size(unit='mb')
```

```{python}
df_2 = df.select(pl.col('a').list.to_struct(n_field_strategy='max_width')).unnest("a")
df_2.estimated_size(unit='mb')
```

```{python}
# %timeit df_2.select(pl.col('a').str.len_chars())
```


### Can I export to CSV? To Parquet? 


```{python}
#| eval: false
# get tmp file name from operating system
import tempfile
temp_file = os.path.join(tempfile._get_default_tempdir(), 'something.csv')

df_with_list.write_csv('df_with_list.csv')
df_with_list.write_parquet('df_with_list.parquet')
pl.read_parquet('df_with_list.parquet').schema
```


What is the dtype when I export to Pandas?
```{python}
#| label: export-to-pandas

df_with_list_pandas = df_with_list.to_pandas()
df_with_list_pandas.info()
```


### Q: What is the namespace for a list of lists?