BUG: not reproducible error `FloatingPointError: overflow encountered in multiply` in the following sequence: read_csv followed by to_datetime with pandas version 2.2.2 #58419

kapytaine · 2024-04-25T12:30:33Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

for ii in range(10000):
    df = pd.read_csv("data.csv", dtype={"ts": float})   # data.csv provided in an attached file
    pd.to_datetime(df["ts"], unit="s", errors="coerce")

Issue Description

I, sometimes, get the following error with pandas 2.2.2 (I don't have this error with pandas 2.1.4):

Exception has occurred: FloatingPointError
overflow encountered in multiply
File ".../main.py", line 218, in
pd.to_datetime(df["ts"], unit="s", errors="coerce")
FloatingPointError: overflow encountered in multiply

The error is not repeatable, hence the loop. I tried to reduce as much as possible the input file while keeping the raised error, this is why I provided a csv file with 200 rows, attached to this issue. I don't know if the issue is due to the read_csv (I got the same problem with read_parquet) or due to to_datetime. If the read_csv is outside the loop and I make a deepcopy at the beginning of each loop, I don't have the problem, so my hunch is that this is linked to the reading process (read_csv in the example).

Expected Behavior

I expect the loop content to have the same behaviour, works every time or fails every time.

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.11.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-105-generic
Version : #115~20.04.1-Ubuntu SMP Mon Apr 15 17:33:04 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 65.5.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 16.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

data.csv

The text was updated successfully, but these errors were encountered:

SUCM · 2024-05-06T09:35:36Z

I also see this FloatingPointError: overflow encountered in multiply error from time to time and it's always with pd.to_datetime where errors=coerce or ignore

so far I have done code like below and it would always work even when encountering the random FloatingPointError confirmed by the prints. basically a retry would work

try:
    c=df["a"].copy()
    df["a"] = pd.to_datetime(df["a"], ... errors="coerce")
except FloatingPointError:
    print("pandas FloatingPointError on column a")
    df["a"] = pd.to_datetime(c, ... errors="coerce")

you may want to try using Python pickle to preserve Python objects. I find pandas's csv and excel read/write to be unreliable

as an example, having 2 columns of str, one with phone numbers where area codes all starts with 0 and the other column are empty string, then pandas to/read csv and excel would save them to file without problem but on reading back you would get 2 columns of timestamps and/or other non-str objects

lbortolotti · 2024-05-07T13:16:23Z

I'm seeing the exact same issue as the OP. No csv read/write in my case, just a pd.to_datetime call that randomly fails, but if re-run on the exact same data then succeeds...

lmuther8 · 2024-05-08T16:16:18Z

I also have this issue occurring when reading in a parquet file to a df then attempting to do pd.to_datetime(df['col_name'],unit='s', utc=True, errors='ignore').

ichatz · 2024-06-29T05:39:09Z

I also have the exact same behavior when using pd.to_datetime in combination with read_json.

lmuther8 · 2024-07-01T13:43:10Z

I converted the floats to integers and it got rid of the error as far as i can tell if that works for anyone else's use case.

drSylvia · 2024-08-02T04:02:55Z

I have the same issue occuring when read from a csv file and then
pd.to_datetime(x * 3600, unit="s", origin=origin, errors="coerce").
And this FloatingPointError: overflow encountered in multiply error from time to time and on different columns in the same dataframe.

drSylvia · 2024-08-02T04:27:08Z

I converted the floats to integers and it got rid of the error as far as i can tell if that works for anyone else's use case.

Would this result in a loss of accuracy or precision?

optiluca · 2024-09-26T08:56:25Z

Any updates on this? I've still got Pandas pinned to 2.1.4 because of this bug.

LilMonk · 2024-10-25T12:22:57Z

Facing the same issue while converting from long to timestamp.
pd.to_datetime(df['utc_long'], unit='s', utc=True)

Solutions that worked for me:

Convert the column data type to int explicitly (@lmuther8 suggestion):

df['utc_timestamp'] = charging_df['utc_long'].astype(int)
# Be careful with NA values. First fill NA values and then convert to int.

Iterate over each record in the column and convert it to timestamp:

utc_timestamp_list = []
for utc in df["utc_long"]:
    try:
        utc_timestamp_list.append(pd.to_datetime(utc_long, unit="s", utc=True))
    except Exception as e:
        print(f"Error converting utc_long {utc}: {e}")
df["utc_timestamp"] = utc_timestamp_list

jvahl · 2025-01-15T00:22:18Z

It looks like the error only occurs for inputs longer than 127.

This works fine:

import pandas as pd

for ii in range(10000):
    df = pd.read_csv("data.csv", dtype={"ts": float}).iloc[:127]
    pd.to_datetime(df["ts"], unit="s", errors="coerce")

While this leads to the FloatingPointError:

import pandas as pd

for ii in range(10000):
    df = pd.read_csv("data.csv", dtype={"ts": float}).iloc[:128]
    pd.to_datetime(df["ts"], unit="s", errors="coerce")

Hnasar · 2025-02-28T01:06:13Z

@mroeschke and @jbrockmendel you two were listed as authors on #56037 could you take a look? This seems to be a nondeterministic FloatingPointError in the vectorized timestamp conversion for floats. I'm struggling to repro the examples above, but I'm consistently seeing it in my own code.

From the 2.2.0 what's new:

Bug in the results of to_datetime() with an floating-dtype argument with unit not matching the pointwise results of Timestamp (GH 56037)

If I reorder function calls, or do abitrary other things, sometimes the error disappears. It seems like memory corruption, and I'm not familiar with pandas to debug further.

Here's the Series I'm using:

(Pdb) ser
0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1789   NaN
1790   NaN
1791   NaN
1792   NaN
1793   NaN
Name: earliest_parent_ts, Length: 1794, dtype: float64
(Pdb) ser.dropna()
44      1.740506e+09
45      1.740506e+09
46      1.740506e+09
47      1.740506e+09
49      1.740505e+09
            ...     
1426    1.740529e+09
1430    1.740503e+09
1431    1.740503e+09
1432    1.740503e+09
1549    1.740587e+09
Name: earliest_parent_ts, Length: 102, dtype: float64

Basically if I call this twice:

pd.to_datetime(ser, utc=True, unit="s").dt  # If I remove .dt it works

then it pretty reliably fails, but only if I have some specific function calls before this (into my own code).

Batching the series into 127-sized chunks seems to workaround the issue somehow but I'm not super confident in it. Retrying doesn't help.

Could either of you take a closer look at the patch you landed?
Thank you for your hard work and attention!

edit: Btw, here's the workaround I'm using

try:
    ser = pd.to_datetime(ser, utc=True, unit=unit)
except FloatingPointError:
    # Pandas 2.2 has a non-deterministic error with large arrays, so
    # chunk into 127-sized parts to workaround.
    # https://github.com/pandas-dev/pandas/issues/58419

    match ser:
        case pd.Series():
            parts = [
                pd.to_datetime(ser.iloc[i : i + 127], utc=True, unit=unit)
                for i in range(0, len(ser), 127)
            ]
            ser = pd.concat(parts)
        case pd.Index() | np.ndarray():
            parts = [
                pd.to_datetime(ser[i : i + 127], utc=True, unit=unit)
                for i in range(0, len(ser), 127)
            ]
            ser = parts[0].append(parts[1:])
        case _:
            raise

snitish · 2025-03-01T00:19:28Z

take

snitish · 2025-03-01T00:37:39Z

A simpler example (fails about 80% of the time on my machine) -

>>> for i in range(1000):
>>>     pd.to_datetime(pd.Series([np.nan]*10000 + [1712219033.0], dtype=np.float64), unit="s", errors="coerce")
FloatingPointError: overflow encountered in multiply

snitish · 2025-03-01T03:26:30Z

Thanks for the analysis @Hnasar. The source of the bug is indeed due to accessing uninitialized memory containing garbage values. I've submitted a PR with a fix.

kapytaine added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 25, 2024

github-actions bot assigned snitish Mar 1, 2025

snitish linked a pull request Mar 1, 2025 that will close this issue

BUG: Fix bug in to_datetime that occasionally throws FloatingPointErr… #61022

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: not reproducible error `FloatingPointError: overflow encountered in multiply` in the following sequence: read_csv followed by to_datetime with pandas version 2.2.2 #58419

BUG: not reproducible error `FloatingPointError: overflow encountered in multiply` in the following sequence: read_csv followed by to_datetime with pandas version 2.2.2 #58419

kapytaine commented Apr 25, 2024

INSTALLED VERSIONS

SUCM commented May 6, 2024 •

edited

Loading

lbortolotti commented May 7, 2024

lmuther8 commented May 8, 2024

ichatz commented Jun 29, 2024

lmuther8 commented Jul 1, 2024

drSylvia commented Aug 2, 2024

drSylvia commented Aug 2, 2024

optiluca commented Sep 26, 2024

LilMonk commented Oct 25, 2024 •

edited

Loading

jvahl commented Jan 15, 2025 •

edited

Loading

Hnasar commented Feb 28, 2025 •

edited

Loading

snitish commented Mar 1, 2025

snitish commented Mar 1, 2025

snitish commented Mar 1, 2025

BUG: not reproducible error FloatingPointError: overflow encountered in multiply in the following sequence: read_csv followed by to_datetime with pandas version 2.2.2 #58419

BUG: not reproducible error FloatingPointError: overflow encountered in multiply in the following sequence: read_csv followed by to_datetime with pandas version 2.2.2 #58419

Comments

kapytaine commented Apr 25, 2024

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

SUCM commented May 6, 2024 • edited Loading

lbortolotti commented May 7, 2024

lmuther8 commented May 8, 2024

ichatz commented Jun 29, 2024

lmuther8 commented Jul 1, 2024

drSylvia commented Aug 2, 2024

drSylvia commented Aug 2, 2024

optiluca commented Sep 26, 2024

LilMonk commented Oct 25, 2024 • edited Loading

jvahl commented Jan 15, 2025 • edited Loading

Hnasar commented Feb 28, 2025 • edited Loading

snitish commented Mar 1, 2025

snitish commented Mar 1, 2025

snitish commented Mar 1, 2025

BUG: not reproducible error `FloatingPointError: overflow encountered in multiply` in the following sequence: read_csv followed by to_datetime with pandas version 2.2.2 #58419

BUG: not reproducible error `FloatingPointError: overflow encountered in multiply` in the following sequence: read_csv followed by to_datetime with pandas version 2.2.2 #58419

SUCM commented May 6, 2024 •

edited

Loading

LilMonk commented Oct 25, 2024 •

edited

Loading

jvahl commented Jan 15, 2025 •

edited

Loading

Hnasar commented Feb 28, 2025 •

edited

Loading