Kartothek metadata does not distinguish int64 and Int64 #410

mlondschien · 2021-02-12T12:12:26Z

In [1]: from functools import partial
   ...: 
   ...: import pandas as pd
   ...: import storefact
   ...: from kartothek.core.common_metadata import empty_dataframe_from_schema
   ...: from kartothek.io.dask.dataframe import read_dataset_as_ddf
   ...: from kartothek.io.eager import read_table, store_dataframes_as_dataset
   ...: 
   ...: store = partial(storefact.get_store_from_url, "hfs:///tmp")
   ...: 
   ...: df = pd.DataFrame(
   ...:     {
   ...:         "i": [0, 1],
   ...:         "I": [0, pd.NA],
   ...:         "o": ["a", None],
   ...:         "s": ["b", pd.NA],
   ...:         "b": [True, False],
   ...:         "B": [True, pd.NA],
   ...:     }
   ...: )
   ...: df["I"] = df["I"].astype("Int64")
   ...: df["s"] = df["s"].astype("string")
   ...: df["B"] = df["B"].astype("boolean")
   ...: 
   ...: df.dtypes
Out[1]: 
i      int64
I      Int64
o     object
s     string
b       bool
B    boolean
dtype: object

In [2]: dm = store_dataframes_as_dataset(
   ...:     dfs=[df], dataset_uuid="dataset", store=store, overwrite=True
   ...: )
   ...: 
   ...: df_meta = empty_dataframe_from_schema(schema=dm.table_meta["table"])
   ...: df_meta.dtypes
Out[2]: 
B    boolean
I      int64
b       bool
i      int64
o     object
s     string
dtype: object

In [3]: ddf = read_dataset_as_ddf(dataset_uuid="dataset", store=store)
   ...: ddf._meta.dtypes
Out[3]: 
B    boolean
I      int64
b       bool
i      int64
o     object
s     string
dtype: object

In [4]: ddf.compute().dtypes
Out[4]: 
B    boolean
I      Int64
b       bool
i      int64
o     object
s     string
dtype: object

Copy pastable below:

from functools import partial

import pandas as pd
import storefact
from kartothek.core.common_metadata import empty_dataframe_from_schema
from kartothek.io.dask.dataframe import read_dataset_as_ddf
from kartothek.io.eager import read_table, store_dataframes_as_dataset

store = partial(storefact.get_store_from_url, "hfs:///tmp")

df = pd.DataFrame(
    {
        "i": [0, 1],
        "I": [0, pd.NA],
        "o": ["a", None],
        "s": ["b", pd.NA],
        "b": [True, False],
        "B": [True, pd.NA],
    }
)
df["I"] = df["I"].astype("Int64")
df["s"] = df["s"].astype("string")
df["B"] = df["B"].astype("boolean")

df.dtypes

dm = store_dataframes_as_dataset(
    dfs=[df], dataset_uuid="dataset", store=store, overwrite=True
)

df_meta = empty_dataframe_from_schema(schema=dm.table_meta["table"])
df_meta.dtypes

ddf = read_dataset_as_ddf(dataset_uuid="dataset", store=store)
ddf._meta.dtypes

ddf.compute().dtypes

The dtype is incorrectly stored in the kartothek metadata:

In [12]: store().get(dataset/table/_common_metadata)
Out[12]: b'PAR1\x15\x04\x19|5\x00\x18\x06schema\x15\x0c\x00\x15\x00%\x02\x18\x01B\x00\x15\x04%\x02\x18\x01I\x00\x15\x00%\x02\x18\x01b\x00\x15\x04%\x02\x18\x01i\x00\x15\x0c%\x02\x18\x01o%\x00L\x1c\x00\x00\x00\x15\x0c%\x02\x18\x01s%\x00L\x1c\x00\x00\x00\x16\x00\x19\x0c\x19,\x18\x06pandas\x18\x99\x07{"column_indexes": [{"field_name": null, "metadata": {"encoding": "UTF-8"}, "name": null, "numpy_type": "object", "pandas_type": "unicode"}], "columns": [{"field_name": "B", "metadata": null, "name": "B", "numpy_type": "boolean", "pandas_type": "bool"}, {"field_name": "I", "metadata": null, "name": "I", "numpy_type": "int64", "pandas_type": "int64"}, {"field_name": "b", "metadata": null, "name": "b", "numpy_type": "bool", "pandas_type": "bool"}, {"field_name": "i", "metadata": null, "name": "i", "numpy_type": "int64", "pandas_type": "int64"}, {"field_name": "o", "metadata": null, "name": "o", "numpy_type": "object", "pandas_type": "unicode"}, {"field_name": "s", "metadata": null, "name": "s", "numpy_type": "string", "pandas_type": "unicode"}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "index_columns": [{"kind": "range", "name": null, "start": 0, "step": 1, "stop": 2}], "pandas_version": "1.1.3"}\x00\x18\x0cARROW:schema\x18\xe0\r/////yAFAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAANADAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAJkDAAB7ImNvbHVtbl9pbmRleGVzIjogW3siZmllbGRfbmFtZSI6IG51bGwsICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifSwgIm5hbWUiOiBudWxsLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9XSwgImNvbHVtbnMiOiBbeyJmaWVsZF9uYW1lIjogIkIiLCAibWV0YWRhdGEiOiBudWxsLCAibmFtZSI6ICJCIiwgIm51bXB5X3R5cGUiOiAiYm9vbGVhbiIsICJwYW5kYXNfdHlwZSI6ICJib29sIn0sIHsiZmllbGRfbmFtZSI6ICJJIiwgIm1ldGFkYXRhIjogbnVsbCwgIm5hbWUiOiAiSSIsICJudW1weV90eXBlIjogImludDY0IiwgInBhbmRhc190eXBlIjogImludDY0In0sIHsiZmllbGRfbmFtZSI6ICJiIiwgIm1ldGFkYXRhIjogbnVsbCwgIm5hbWUiOiAiYiIsICJudW1weV90eXBlIjogImJvb2wiLCAicGFuZGFzX3R5cGUiOiAiYm9vbCJ9LCB7ImZpZWxkX25hbWUiOiAiaSIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogImkiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCJ9LCB7ImZpZWxkX25hbWUiOiAibyIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogIm8iLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9LCB7ImZpZWxkX25hbWUiOiAicyIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogInMiLCAibnVtcHlfdHlwZSI6ICJzdHJpbmciLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjEuMC4xIn0sICJpbmRleF9jb2x1bW5zIjogW3sia2luZCI6ICJyYW5nZSIsICJuYW1lIjogbnVsbCwgInN0YXJ0IjogMCwgInN0ZXAiOiAxLCAic3RvcCI6IDJ9XSwgInBhbmRhc192ZXJzaW9uIjogIjEuMS4zIn0AAAAGAAAA9AAAAKwAAACEAAAAVAAAACwAAAAEAAAANP///wAAAQUUAAAADAAAAAQAAAAAAAAAJP///wEAAABzAAAAWP///wAAAQUUAAAADAAAAAQAAAAAAAAASP///wEAAABvAAAAfP///wAAAQIcAAAADAAAAAQAAAAAAAAAsP///wAAAAFAAAAAAQAAAGkAAACo////AAABBhQAAAAMAAAABAAAAAAAAACY////AQAAAGIAAADM////AAABAiQAAAAUAAAABAAAAAAAAAAIAAwACAAHAAgAAAAAAAABQAAAAAEAAABJAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEGGAAAABAAAAAEAAAAAAAAAAQABAAEAAAAAQAAAEIAAAAAAAAA\x00\x18"parquet-cpp version 1.5.1-SNAPSHOT\x19l\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x00#\x0b\x00\x00PAR1'

The text was updated successfully, but these errors were encountered:

fjetter · 2021-02-17T13:01:30Z

IIUC, this affects only the schema but we're able to read the data properly?

fjetter · 2021-02-17T13:05:37Z

Looks like this is because arrow doesn't distinguish between the two and we're defining the _common_metadata purely via the arrow types

In [1]: import pyarrow as pa

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({"Int": pd.Series([1, pd.NA], dtype="Int64")})

In [4]: schema = pa.Schema.from_pandas(df)

In [5]: schema
Out[5]:
Int: int64

I'd be curious if there are any practical implications for this discrepancy or if this is rather a 'formal' error

mlondschien · 2021-02-17T18:52:59Z

We use Kartothek datasets to cache computation results. For validation, we check if the Kartothek metadata matches the expected dtypes. We've not had any issues loading data with such columns yet. So this is mostly an inconvenience (or "formal" error).

fjetter · 2021-02-18T09:34:47Z

That's very interesting and I'm positively surprised that this works generally. afaik, we do not have any tests using the nullables types in kartothek, yet (but it's about time). If you want to contribute on that front, I suggest to start with adding some nullable ints/bools to

kartothek/kartothek/serialization/testing.py

Line 27 in 6514c1f

def get_dataframe_not_nested(n):

and see what breaks

If I understand this correctly, it also only affects integers, correct? bool(eans) are correctly reconstructed.

I assume this is connected to us stripping the metadata from the schema. I'm just wondering why it works for bools

kartothek/kartothek/core/common_metadata.py

Lines 612 to 614 in 6514c1f

    
           reference_to_compare = _strip_columns_from_schema( 
        
               reference, null_columns 
        
           ).remove_metadata()

 dm.table_meta["table"].internal().empty_table().to_pandas().dtypes
Out[5]:
B    boolean
I      int64
b       bool
i      int64
o     object
s     string
dtype: object

mlondschien · 2021-02-25T14:38:11Z

We're not using these in production but have tests (that need skipping). Is there a specific reason you are not testing for categoricals?

fjetter · 2021-03-11T10:22:05Z

Is there a specific reason you are not testing for categoricals

We do have tests for categoricals but not systematically as part of the "all types dataframes"

mlondschien mentioned this issue Feb 25, 2021

Test serialization of nullable types Int64 and string #425

Draft

mlondschien mentioned this issue Apr 19, 2021

BUG: the __from_arrow__ conversion for numeric arrays broken if buffer size doesn't match pandas-dev/pandas#40896

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kartothek metadata does not distinguish int64 and Int64 #410

Kartothek metadata does not distinguish int64 and Int64 #410

mlondschien commented Feb 12, 2021

fjetter commented Feb 17, 2021

fjetter commented Feb 17, 2021 •

edited

Loading

mlondschien commented Feb 17, 2021

fjetter commented Feb 18, 2021

mlondschien commented Feb 25, 2021

fjetter commented Mar 11, 2021

Kartothek metadata does not distinguish int64 and Int64 #410

Kartothek metadata does not distinguish int64 and Int64 #410

Comments

mlondschien commented Feb 12, 2021

fjetter commented Feb 17, 2021

fjetter commented Feb 17, 2021 • edited Loading

mlondschien commented Feb 17, 2021

fjetter commented Feb 18, 2021

mlondschien commented Feb 25, 2021

fjetter commented Mar 11, 2021

fjetter commented Feb 17, 2021 •

edited

Loading