Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kartothek metadata does not distinguish int64 and Int64 #410

Open
mlondschien opened this issue Feb 12, 2021 · 6 comments
Open

Kartothek metadata does not distinguish int64 and Int64 #410

mlondschien opened this issue Feb 12, 2021 · 6 comments

Comments

@mlondschien
Copy link
Contributor

In [1]: from functools import partial
   ...: 
   ...: import pandas as pd
   ...: import storefact
   ...: from kartothek.core.common_metadata import empty_dataframe_from_schema
   ...: from kartothek.io.dask.dataframe import read_dataset_as_ddf
   ...: from kartothek.io.eager import read_table, store_dataframes_as_dataset
   ...: 
   ...: store = partial(storefact.get_store_from_url, "hfs:///tmp")
   ...: 
   ...: df = pd.DataFrame(
   ...:     {
   ...:         "i": [0, 1],
   ...:         "I": [0, pd.NA],
   ...:         "o": ["a", None],
   ...:         "s": ["b", pd.NA],
   ...:         "b": [True, False],
   ...:         "B": [True, pd.NA],
   ...:     }
   ...: )
   ...: df["I"] = df["I"].astype("Int64")
   ...: df["s"] = df["s"].astype("string")
   ...: df["B"] = df["B"].astype("boolean")
   ...: 
   ...: df.dtypes
Out[1]: 
i      int64
I      Int64
o     object
s     string
b       bool
B    boolean
dtype: object

In [2]: dm = store_dataframes_as_dataset(
   ...:     dfs=[df], dataset_uuid="dataset", store=store, overwrite=True
   ...: )
   ...: 
   ...: df_meta = empty_dataframe_from_schema(schema=dm.table_meta["table"])
   ...: df_meta.dtypes
Out[2]: 
B    boolean
I      int64
b       bool
i      int64
o     object
s     string
dtype: object

In [3]: ddf = read_dataset_as_ddf(dataset_uuid="dataset", store=store)
   ...: ddf._meta.dtypes
Out[3]: 
B    boolean
I      int64
b       bool
i      int64
o     object
s     string
dtype: object

In [4]: ddf.compute().dtypes
Out[4]: 
B    boolean
I      Int64
b       bool
i      int64
o     object
s     string
dtype: object

Copy pastable below:

from functools import partial

import pandas as pd
import storefact
from kartothek.core.common_metadata import empty_dataframe_from_schema
from kartothek.io.dask.dataframe import read_dataset_as_ddf
from kartothek.io.eager import read_table, store_dataframes_as_dataset

store = partial(storefact.get_store_from_url, "hfs:///tmp")

df = pd.DataFrame(
    {
        "i": [0, 1],
        "I": [0, pd.NA],
        "o": ["a", None],
        "s": ["b", pd.NA],
        "b": [True, False],
        "B": [True, pd.NA],
    }
)
df["I"] = df["I"].astype("Int64")
df["s"] = df["s"].astype("string")
df["B"] = df["B"].astype("boolean")

df.dtypes

dm = store_dataframes_as_dataset(
    dfs=[df], dataset_uuid="dataset", store=store, overwrite=True
)

df_meta = empty_dataframe_from_schema(schema=dm.table_meta["table"])
df_meta.dtypes

ddf = read_dataset_as_ddf(dataset_uuid="dataset", store=store)
ddf._meta.dtypes

ddf.compute().dtypes

The dtype is incorrectly stored in the kartothek metadata:

In [12]: store().get(dataset/table/_common_metadata)
Out[12]: b'PAR1\x15\x04\x19|5\x00\x18\x06schema\x15\x0c\x00\x15\x00%\x02\x18\x01B\x00\x15\x04%\x02\x18\x01I\x00\x15\x00%\x02\x18\x01b\x00\x15\x04%\x02\x18\x01i\x00\x15\x0c%\x02\x18\x01o%\x00L\x1c\x00\x00\x00\x15\x0c%\x02\x18\x01s%\x00L\x1c\x00\x00\x00\x16\x00\x19\x0c\x19,\x18\x06pandas\x18\x99\x07{"column_indexes": [{"field_name": null, "metadata": {"encoding": "UTF-8"}, "name": null, "numpy_type": "object", "pandas_type": "unicode"}], "columns": [{"field_name": "B", "metadata": null, "name": "B", "numpy_type": "boolean", "pandas_type": "bool"}, {"field_name": "I", "metadata": null, "name": "I", "numpy_type": "int64", "pandas_type": "int64"}, {"field_name": "b", "metadata": null, "name": "b", "numpy_type": "bool", "pandas_type": "bool"}, {"field_name": "i", "metadata": null, "name": "i", "numpy_type": "int64", "pandas_type": "int64"}, {"field_name": "o", "metadata": null, "name": "o", "numpy_type": "object", "pandas_type": "unicode"}, {"field_name": "s", "metadata": null, "name": "s", "numpy_type": "string", "pandas_type": "unicode"}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "index_columns": [{"kind": "range", "name": null, "start": 0, "step": 1, "stop": 2}], "pandas_version": "1.1.3"}\x00\x18\x0cARROW:schema\x18\xe0\r/////yAFAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAANADAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAJkDAAB7ImNvbHVtbl9pbmRleGVzIjogW3siZmllbGRfbmFtZSI6IG51bGwsICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifSwgIm5hbWUiOiBudWxsLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9XSwgImNvbHVtbnMiOiBbeyJmaWVsZF9uYW1lIjogIkIiLCAibWV0YWRhdGEiOiBudWxsLCAibmFtZSI6ICJCIiwgIm51bXB5X3R5cGUiOiAiYm9vbGVhbiIsICJwYW5kYXNfdHlwZSI6ICJib29sIn0sIHsiZmllbGRfbmFtZSI6ICJJIiwgIm1ldGFkYXRhIjogbnVsbCwgIm5hbWUiOiAiSSIsICJudW1weV90eXBlIjogImludDY0IiwgInBhbmRhc190eXBlIjogImludDY0In0sIHsiZmllbGRfbmFtZSI6ICJiIiwgIm1ldGFkYXRhIjogbnVsbCwgIm5hbWUiOiAiYiIsICJudW1weV90eXBlIjogImJvb2wiLCAicGFuZGFzX3R5cGUiOiAiYm9vbCJ9LCB7ImZpZWxkX25hbWUiOiAiaSIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogImkiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCJ9LCB7ImZpZWxkX25hbWUiOiAibyIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogIm8iLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9LCB7ImZpZWxkX25hbWUiOiAicyIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogInMiLCAibnVtcHlfdHlwZSI6ICJzdHJpbmciLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjEuMC4xIn0sICJpbmRleF9jb2x1bW5zIjogW3sia2luZCI6ICJyYW5nZSIsICJuYW1lIjogbnVsbCwgInN0YXJ0IjogMCwgInN0ZXAiOiAxLCAic3RvcCI6IDJ9XSwgInBhbmRhc192ZXJzaW9uIjogIjEuMS4zIn0AAAAGAAAA9AAAAKwAAACEAAAAVAAAACwAAAAEAAAANP///wAAAQUUAAAADAAAAAQAAAAAAAAAJP///wEAAABzAAAAWP///wAAAQUUAAAADAAAAAQAAAAAAAAASP///wEAAABvAAAAfP///wAAAQIcAAAADAAAAAQAAAAAAAAAsP///wAAAAFAAAAAAQAAAGkAAACo////AAABBhQAAAAMAAAABAAAAAAAAACY////AQAAAGIAAADM////AAABAiQAAAAUAAAABAAAAAAAAAAIAAwACAAHAAgAAAAAAAABQAAAAAEAAABJAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEGGAAAABAAAAAEAAAAAAAAAAQABAAEAAAAAQAAAEIAAAAAAAAA\x00\x18"parquet-cpp version 1.5.1-SNAPSHOT\x19l\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x00#\x0b\x00\x00PAR1'
@fjetter
Copy link
Collaborator

fjetter commented Feb 17, 2021

IIUC, this affects only the schema but we're able to read the data properly?

@fjetter
Copy link
Collaborator

fjetter commented Feb 17, 2021

Looks like this is because arrow doesn't distinguish between the two and we're defining the _common_metadata purely via the arrow types

In [1]: import pyarrow as pa

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({"Int": pd.Series([1, pd.NA], dtype="Int64")})

In [4]: schema = pa.Schema.from_pandas(df)

In [5]: schema
Out[5]:
Int: int64

I'd be curious if there are any practical implications for this discrepancy or if this is rather a 'formal' error

@mlondschien
Copy link
Contributor Author

We use Kartothek datasets to cache computation results. For validation, we check if the Kartothek metadata matches the expected dtypes. We've not had any issues loading data with such columns yet. So this is mostly an inconvenience (or "formal" error).

@fjetter
Copy link
Collaborator

fjetter commented Feb 18, 2021

That's very interesting and I'm positively surprised that this works generally. afaik, we do not have any tests using the nullables types in kartothek, yet (but it's about time). If you want to contribute on that front, I suggest to start with adding some nullable ints/bools to

def get_dataframe_not_nested(n):
and see what breaks

If I understand this correctly, it also only affects integers, correct? bool(eans) are correctly reconstructed.

I assume this is connected to us stripping the metadata from the schema. I'm just wondering why it works for bools

reference_to_compare = _strip_columns_from_schema(
reference, null_columns
).remove_metadata()

 dm.table_meta["table"].internal().empty_table().to_pandas().dtypes
Out[5]:
B    boolean
I      int64
b       bool
i      int64
o     object
s     string
dtype: object

@mlondschien
Copy link
Contributor Author

We're not using these in production but have tests (that need skipping). Is there a specific reason you are not testing for categoricals?

@fjetter
Copy link
Collaborator

fjetter commented Mar 11, 2021

Is there a specific reason you are not testing for categoricals

We do have tests for categoricals but not systematically as part of the "all types dataframes"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants