-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kartothek metadata does not distinguish int64 and Int64 #410
Comments
IIUC, this affects only the schema but we're able to read the data properly? |
Looks like this is because arrow doesn't distinguish between the two and we're defining the _common_metadata purely via the arrow types In [1]: import pyarrow as pa
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"Int": pd.Series([1, pd.NA], dtype="Int64")})
In [4]: schema = pa.Schema.from_pandas(df)
In [5]: schema
Out[5]:
Int: int64 I'd be curious if there are any practical implications for this discrepancy or if this is rather a 'formal' error |
We use Kartothek datasets to cache computation results. For validation, we check if the Kartothek metadata matches the expected dtypes. We've not had any issues loading data with such columns yet. So this is mostly an inconvenience (or "formal" error). |
That's very interesting and I'm positively surprised that this works generally. afaik, we do not have any tests using the nullables types in kartothek, yet (but it's about time). If you want to contribute on that front, I suggest to start with adding some nullable ints/bools to
If I understand this correctly, it also only affects integers, correct? bool(eans) are correctly reconstructed. I assume this is connected to us stripping the metadata from the schema. I'm just wondering why it works for bools kartothek/kartothek/core/common_metadata.py Lines 612 to 614 in 6514c1f
dm.table_meta["table"].internal().empty_table().to_pandas().dtypes
Out[5]:
B boolean
I int64
b bool
i int64
o object
s string
dtype: object |
We're not using these in production but have tests (that need skipping). Is there a specific reason you are not testing for categoricals? |
We do have tests for categoricals but not systematically as part of the "all types dataframes" |
Copy pastable below:
The dtype is incorrectly stored in the kartothek metadata:
The text was updated successfully, but these errors were encountered: