nested_dtypes.qmd

---
title: "Nested dtypes"

execute:
  eval: true
  warning: true
  error: true
  keep-ipynb: true
  cache: true
jupyter: python3
pdf-engine: lualatex
# theme: pandoc
html:
    code-tools: true
    fold-code: false
    author: Jonathan D. Rosenblatt
    data: 04-12-2024
    toc: true
    number-sections: true
    number-depth: 3
    embed-resources: true
---

```{python}
#| echo: false
# %pip install --upgrade pip
# %pip install --upgrade polars
# %pip install --upgrade pyarrow
# %pip install --upgrade Pandas
# %pip install --upgrade plotly
# %pip freeze > requirements.txt
```

```{python}
#| label: setup-env

# %pip install -r requirements.txt
```

```{python}
#| label: Polars-version
%pip show Polars # check you Polars version
```

```{python}
#| label: Pandas-version
%pip show Pandas # check you Pandas version
```

```{python}
#| label: preliminaries

import polars as pl
pl.Config(fmt_str_lengths=50)
import polars.selectors as cs

import pandas as pd
import numpy as np
import pyarrow as pa
import plotly.express as px
import string
import random
import os
import sys
%matplotlib inline 
import matplotlib.pyplot as plt
from datetime import datetime

# Following two lines only required to view plotly when rendering from VScode. 
import plotly.io as pio
# pio.renderers.default = "plotly_mimetype+notebook_connected+notebook"
pio.renderers.default = "plotly_mimetype+notebook"
```

What Polars module and dependencies are installed?
```{python}
#| label: show-versions
pl.show_versions()
```

How many cores are available for parallelism?
```{python}
#| label: show-cores
pl.thread_pool_size()
```


# Introduction

Recall the nested dtypes:

1.  Polars Struct: Like a Python dict within a cell; Multiple named elements.
2.  Polars List: Similar to a Python list within a cell; Multiple unnamed elements. Unlike the Python list, in a Polars list, all elements must have the same dtype.
3.  Polars Array: Like a Polars list, but with a fixed length for all cells in the column.


# Polars List {#sec-list}


## Making a Polars List

Make a Polars list from a Python list.

```{python}
#| label: make-list

pl.DataFrame(
    {
        "a": [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
        "b": ["a", "b", "c"]
    }
)

```

Things to note:

- `a` is not a Python list, rather, it is a Polars list. There are differences between the two. For instance, all elements in a Polars list must have the same dtype. Also, the Polars list is a columnar data structure, which is more efficient for certain operations. Finally, the Polars list is a first-class citizen in the Polars API, with its own methods and functions.
- Can a Polars list hold a polars list? Yes. See the next example.

Make a Polars list of Polars lists:

```{python}
#| label: make-list-of-list

pl.DataFrame(
    {
        "a": [[[1, 2, 3], [4, 5, 6], [7, 8, 9]], [[10, 11, 12], [13, 14, 15], [16, 17, 18]]],
        "b": ["a", "b"]
    }
)

```


More often you will not make a Polars list directly, but rather you will create one from a Polars DataFrame:

1. When grouping without aggregation. 
2. When wrapping a bunch of columns with `pl.concat_list()`.
3. When "imploding" (aka "collapsing") an `pl.Expr()`.  

List from aggregation:
```{python}
#| label: list-from-aggregation

df = pl.DataFrame(
    {
        "a": [1, 1, 2, 2, 3, 3],
        "b": [1, 2, 3, 4, 5, 6]
    }
)

df.group_by("a").agg(pl.col("b"))

```

List from concatenation of pl.Exprs():

```{python}
#| label: list-from-concatenation

df.select(pl.concat_list([pl.col("a"), pl.col("b")]))

```

List from concatenation of pl.Lists:

```{python}
(
    df
    .select(
        pl.concat_list([pl.col("a"), pl.col("b")]).alias('ab')
        )
    .with_columns(pl.col('ab').list.concat('ab'))
    .to_pandas()
)

```

List from imploding an `pl.Expr()`:

```{python}
#| label: list-from-imploding

df.with_columns(pl.col("b").implode())

```

Implode within group:

```{python}
#| label: implode-within-group

df.group_by("a").agg(pl.col("b"))

```

Implode over:

```{python}
#| label: implode-over
#| eval: false
df.with_columns(pl.col("b").over("a")) # no good
df.with_columns(pl.col("b").implode().over("a")) # no good
df.with_columns(pl.concat_list('b').over("a")) # no good
```

```{python}
df.with_columns(pl.col("b").over("a", mapping_strategy="join")) # good!
```

See [here](https://github.com/pola-rs/polars/pull/6487) for more context. 


###  More on `.over(mapping_strategy=...)`
TODO


## Operating on List Elements

The most general way is to use `list.eval(pl.element())`:

```{python}
#| label: list-eval-element

df_with_list = df.with_columns(pl.concat_list([pl.col("a"), pl.col("b")]).alias('ab'))

(
    df_with_list
    .select(
        pl.col('ab').list.eval(pl.element().add(1000))
        )
)
```

Things to note:

- `.eval()` belongs to the `.list` namespace. 
- `pl.element()` is a selector that selects the elements of the list. It has almost all the methods available to `pl.col()`.

```{python}
#| label: list-eval-element-methods
print(dir(pl.element()))
```


## List Methods

`.list.eval(pl.element())` will operate one element at a time. 
If you want to operate on the list as a whole, you can wither use existing [list methods](https://docs.pola.rs/py-polars/html/reference/expressions/list.html), or use `pl.col()` methods, after exploding the list (see @sec-list-explode).

We now demonstrate some of the list methods.


### Selecting Elements

Get, Gather, Slice, Gather_Every

```{python}
#| label: list-get-and-gather

(
    df_with_list.select(
        pl.col('ab'),
        pl.col('ab').list.get(0).alias('get_1'), 
        pl.col('ab').list.gather([0]).alias('gather_1'), # returns a list
        pl.col('ab').list.slice(0,1).alias('slice_1'), # returns a list
        pl.col('ab').list.gather([0, 1]).alias('gather_2'),
        pl.col('ab').list.slice(0,2).alias('slice_2'),
        pl.col('ab').list.gather_every(2).alias('gather_every_2'),
        )
)

```


First, Head, Last, Tail

```{python}
#| label: list-methods

(
    df_with_list.select(
        pl.col('ab'),
        pl.col('ab').list.get(1).alias('first_1'),
        pl.col('ab').list.first().alias('first_2'),
        pl.col('ab').list.head(1).alias('first_5'), # returns a list
        pl.col('ab').list.last().alias('last_2'), 
        pl.col('ab').list.tail(1).alias('last_1'), # returns a list
        pl.col('ab').list.tail(1).list.first().alias('last_3'),
        )
)

```

Shift

```{python}
#| label: list-shift

(
    df_with_list.select(
        pl.col('ab'),
        pl.col('ab').list.shift(1).alias('shift_1'),
        pl.col('ab').list.shift(-1).alias('shift_-1'),
        )
)

```

Sample

```{python}
#| label: list-sample

(
    df_with_list.select(
        pl.col('ab'),
        pl.col('ab').list.sample(10, with_replacement=True).alias('sample'),
        )
    .to_pandas()
)

```


Careful where you put the `.sample()`!

```{python}
(
    df_with_list
    .select(
        pl.col('ab').list.sample(10, with_replacement=True).alias('within_row'),
        pl.col('ab').sample(6, with_replacement=True).alias('within_column'),
    )
    .to_pandas()
)
```


### Filtering

```{python}
#| label: list-filter

(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.eval(pl.element().gt(2)).alias('gt_2'),
        # pl.col('ab').list.filter(pl.col('ab').list.gt(5)).alias('gt_5'), # not implemented yet
        )
)

```


### Statistical Aggregations

.arg_max:

.arg_min:

.var:

.std:

.n_unique:

.unique:

.min:

.max:

.sum:

.mean:

.median:

.len

```{python}
#| label: list-statistical-aggregations

(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.min().alias('min'),
        pl.col('ab').list.max().alias('max'),
        pl.col('ab').list.sum().alias('sum'),
        pl.col('ab').list.mean().alias('mean'),
        pl.col('ab').list.median().alias('median'),
        pl.col('ab').list.var().alias('var'),
        pl.col('ab').list.std().alias('std'),
        pl.col('ab').list.n_unique().alias('n_unique'),
        pl.col('ab').list.unique().alias('unique'),
        pl.col('ab').list.len().alias('len'),
        )
)

```

### Ordering and Ranking

.sort:

.reverse:

```{python}
#| label: list-ordering-and-ranking

(
    df_with_list
    .select(
        pl.col('ab'),
        # pl.col('ab').list.rank().alias('rank'),
        pl.col('ab').list.sort().alias('sort'),
        pl.col('ab').list.reverse().alias('reverse'),
        )
)

```


### Sequence Operations

.diff

```{python}
#| label: list-diff

(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.diff().alias('diff'),
        )
)

```


### Logical Aggregations

```{python}
#| label: list-logical-aggregations

(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.eval(pl.element().eq(1)).list.any().alias('any'),
        pl.col('ab').list.eval(pl.element().eq(1)).list.all().alias('all'),
        )
)

```


### String Operations

```{python}
#| label: list-strings

df_2_with_list = pl.DataFrame(
    {
        "ab": [["a", "b", "c"], ["d", "e", "f"], ["g", "h", "i"]],
        'sep': ['@', '#', '$']
    }
)

(
    df_2_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.join(pl.col('sep')).alias('joined'),
        pl.col('ab').list.contains('a').alias('contains_a'),
        pl.col('ab').list.count_matches('a').alias('count_a'),
        )
)

```


### Missing

I believe everything can be done with `.list.eval(pl.element().method())`.


### Set Operations

.list.set_intersection

.list.set_union

.list.set_difference

.list.set_symmetric_difference


### Exporting to Other Nested Dtypes

```{python}
#| label: list-to-struct

(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.to_struct().alias('ab_struct'),
        pl.col('ab').list.to_array(width=2).alias('ab_array'),
        )
)

```


### Examples

#### ECDF 

```{python}
#| label: list-ecdf

(
    df_with_list
    .select(
        pl.col('ab').alias('raw'),
        pl.col('ab').list.eval(pl.element().rank()).alias('ranks'),
        pl.col('ab').list.eval(pl.element().rank().truediv(2)).alias('ecdf_1'),
        # pl.col('ab').list.eval(pl.element().rank().truediv(pl.col('ab').list.len())).alias('ecdf_2'),
    )
)

```

Things to note:

- Currently, `.list.eval()` cannot reference another column. See [issue](https://github.com/pola-rs/polars/issues/7210).
- `.list.eval(pl.element())` can do more than point_wise operations. Think about the `rank()` example.
- `.rank()` has a `method` argument that can be used to specify how to deal with ties.


#### arg_max_horizontal

Note, there currently is no `pl.arg_max_horizontal()` method.
We will try to make one by concatenating columns to list, and then using the `.list.arg_max()`.
In particular, we will want the arg_max col name, and not the index. 

```{python}
#| label: arg-max-horizontal

(
    df
    .with_columns(
        pl.concat_list([pl.col("a"), pl.col("b")]).list.arg_max().alias('arg_max'),
        )
    .select(
        'a','b',
        (
            pl.col('arg_max')
            .map_elements(lambda i: df.columns[i], return_dtype=pl.Utf8)
            .alias('arg_max_col_name')
        )
        
    )
)

```

Things to note:

- Can you find a more efficient way to do this? Maybe using a struct?


## `.list.explode()` {#sec-list-explode}

Explode: "explode" the list onto a column.
```{python}
(
    df_with_list
    .select(
        pl.col('ab').list.explode().alias('ab_exploded'),
        )
)

```

Implode: "implode" the list back into a list.
```{python}
(
    df_with_list
    .select(
        pl.col('ab').implode().alias('ab_imploded'),
        pl.col('ab').list.explode().implode().alias('ab_exploded_imploded'),
        )
)
```

Use group-wise to work on the list as a column.

```{python}
(
    df_with_list
    .with_row_index()
    .group_by('index')
    .agg(
        pl.col('ab').list.explode().alias('this_is_actually_a_column'),
        )
)

```


There is also  `df.explode()`
```{python}
#| label: df-explode
df_with_list.explode('ab')
```


### Examples 

#### Quantile

Recall there is no `.list.quantile()` method; so we need to brew our own. 

```{python}
#| label: quantile-example

df_for_quantile = (
    df_with_list
    .select(
        pl.col('ab').list.sample(100, with_replacement=True),
    )
)

df_for_quantile.to_pandas()

```

```{python}
(
    df_for_quantile
    .with_row_index()
    .group_by('index')
    .agg(
        pl.col('ab').explode().quantile(0.2)
        )
)
```

Things to note:

1. The explode within row index is a powerful trick to apply any `pl.col()` method to a list.
2. There is another way to apply `pl.col()` methods to a list... With `.list.eval(pl.element())` as in the next example.


```{python}
(
    df_for_quantile
    .select(
        pl.col('ab').list.eval(pl.element().quantile(0.2))
        .list.first() # to extract a float from a list
        .alias('quantile_0.2'),
    )


)
```


## Polars Lists of Polars Lists

```{python}
#| label: list-of-list

df_with_list_of_list = pl.DataFrame(
    {
        "a": [[[1, 2, 3], [4, 5, 6], [7, 8, 9]], [[10, 11, 12], [13, 14, 15], [16, 17, 18]]],
        "b": ["a", "b"]
    }
)

df_with_list_of_list

```

List operations will work one level deep.

```{python}
#| label: list-of-list-inner

(
    df_with_list_of_list
    .select(
        pl.col('a'),
        pl.col('a').list.get(0).alias('get_1'), 
        # pl.col('a').list.gather([0]).alias('gather_1'), # returns a list
        # pl.col('a').list.slice(0,1).alias('slice_1'), # returns a list
        # pl.col('a').list.gather([0, 1]).alias('gather_2'),
        # pl.col('a').list.slice(0,2).alias('slice_2'),
        # pl.col('a').list.gather_every(2).alias('gather_every_2'),
        )
)

```


## Examples

### Filter A List With A Predicte


```{python}

# Sample data
data = {
    'string_list': [
        ["apple", "banana", "orange", "grape", "pear"],
        ["apple", "banana", "orange", "grape"],
        ["apple", "banana", "orange"],
        ["apple", "banana"],
        ["apple"],
        ["apple", "banana", "orange", "grape", "pear"],
        ["apple", "banana", "orange", "grape"],
    ],
        }

# Create a polars DataFrame
df = pl.DataFrame(data)
df
```

In each list, keep only strings of length 4.

Syntax 1:

```{python}
(
    df
    .with_row_index('i')
    .with_columns(
        pl.col('string_list').list.gather(
            pl.arg_where(
                pl.col('string_list').explode().str.len_chars().eq(4))
                )
                .over('i')
            )
    .drop('i')
)
```

Syntax 2:

```{python}
(
    df
    .with_row_index('i')
    .with_columns(
        (
            (cntr:=pl.col('string_list').explode())
            .filter(cntr.str.len_chars().eq(4))
            )
        .implode()
        .over('i')
        )
    .drop('i')
)

```

Syntax 3:

```{python}
(
    df
    .select(
        pl.col('string_list').list.gather(
            pl.col('string_list').list.eval(
                pl.arg_where(
                    pl.element().str.len_chars().eq(4)
                    )
                )
            )
        )
)

```


# Polars Array {#sec-array}

Polars Arrays as Polars lists with a fixed length.

## Making a Polars Array

Make a Polars array from a Python list.

```{python}
#| label: make-array

pl.DataFrame(
    [
        pl.Series("Array_1", [[1, 3], [2, 5]]),
        pl.Series("Array_2", [[1, 7, 3], [8, 1, 0]]),
    ],
    schema={
        "Array_1": pl.Array(pl.Int64, 2),
        "Array_2": pl.Array(pl.Int64, 3),
    },
)

```

Things to note:

- The dtype of the array is specified in the schema. Otherwise, a Polars list would have been inferred. 


## Array Methods

Currently, list methods and array methods seem to be the same. 
I expect these lists to diverge in the future, because more operations can be defined when assuming a fixed length.


# Polars Struct {#sec-struct}

##  Making a Polars Struct

Make a Polars struct from a Python dict.

```{python}
#| label: make-struct

pl.DataFrame(
    {
        "a": [{"x": 1, "y": 2}, {"x": 3, "y": 4}, {"x": 5, "y": 6}],
        "b": ["a", "b", "c"]
    }
)

```

What are the differences between a Polars struct and a Python dict? 

1. The Polars stuct must have the same keys (called `fields`) in all rows. 
2. A Polars struct is a first-class citizen in the Polars API, with its own methods and functions.
3. Besides these, I am still figuring it out. 

What are the differences between a Polars struct and a Polars list?

1. A Polars struct has named elements, while a Polars list has unnamed elements.
2. Because the struct has the same fields in all rows, it must have the same length in all rows. This is not the case for a Polars list.

More often you will not make a Polars struct directly, but rather you will create one from a Polars DataFrame:

1. By directly creating a struct.
2. As the output of some operation, like `pl.Expr().value_counts()`, or all the horizontal cumulators like `pl.cum_sum_horizontal()`, `pl.cum_reduce()`, `pl.cum_fold()`, ...
3. From a Polars list. 


### Making a Polars Struct Directly

```{python}
#| label: make-struct-directly
df_with_struct = df.select(pl.struct(['a','b']).alias('struct'))
df_with_struct
```

Verify that the column is a struct:

```{python}
#| label: verify-struct
df_with_struct.schema
```

Alternative constructors

```{python}
df.select(pl.struct(aaa=pl.col('a'), bbb=pl.col('b'))).schema
```


### Struct As Output

```{python}
df.select(pl.col('a').value_counts())
# df.select(pl.col('a').value_counts()).schema
```


### Struct From List

```{python}
#| label: struct-from-list
df_with_list.select(pl.col('ab').list.to_struct())
```


## Struct Methods


### Extracting Elements

```{python}
#| label: struct-get
(
    df_with_struct
    .select(
        pl.col('struct').struct.field('a'),
        pl.col('struct').struct.field('b'),
        )
)
```

To text in JSON format:

```{python}
#| label: struct-to-json
(
    df_with_struct
    .select(
        pl.col('struct').struct.json_encode()
        )
)

```


## Struct to List

There is no `.struct.to_list()` method. 
This makes sense, since a struct does not require all fields to have the same dtype. 
In the case where all fields have the same dtype, you can use the following.

```{python}
#| label: struct-to-list
(
    df_with_struct
    .unnest('struct')
    .select(pl.concat_list(pl.all()))
)

```

You can also consider element-by-element extractions.

```{python}
#| label: struct-to-list-element-by-element
(
    df_with_struct
    .select(
        pl.concat_list([
            pl.col('struct').struct.field('a'), 
            pl.col('struct').struct.field('b')
            ])
        )
)

```


# Discussion

## General 

Q: Can I have a list of list of lists (i.e. more than 2 layers of nesting)?  
A: No. Only a list within a list. 

Q: Can I use list methods on a struct?(e.g. argmax)  
A: No. But you can extract the struct to a list, and then use list methods.


Q: When a Polars Array and when a set of columns?
A: I am still thinking about it. 


## Exporting

Can I export to CSV? To Parquet? 
```{python}
# get tmp file name from operating system
import tempfile
temp_file = os.path.join(tempfile._get_default_tempdir(), 'something.csv')

df_with_list.write_csv('df_with_list.csv')
```


What is the dtype when I export to Pandas?
```{python}
#| label: export-to-pandas

df_with_list_pandas = df_with_list.to_pandas()
df_with_list_pandas.info()
```