Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: use blk.dtype in where() & _setitem_frame() #61014

Merged
merged 5 commits into from
Feb 28, 2025

Conversation

auderson
Copy link
Contributor

@auderson auderson commented Feb 27, 2025

this pr is a one-line change to the code here:

pandas/pandas/core/generic.py

Lines 9735 to 9737 in d1ec1a4

for _dt in cond.dtypes:
if not is_bool_dtype(_dt):
raise TypeError(msg.format(dtype=_dt))

Performance comparison:

test script:

import numpy as np
import pandas as pd
import timeit

for width in [10, 1000, 1000_00, 1000_0000]:
    df = pd.DataFrame(np.random.randn(1, width))
    mask = df > 0.5
    tm = timeit.timeit("df.where(mask)", number=10, globals=globals())
    print(width, tm)

for _dt in cond.dtypes:

10 0.002963045029900968
1000 0.006705133942887187
100000 0.40306550299283117
10000000 46.55275956704281
image

for _dt in cond.dtypes.unique():

10 0.0028260269900783896
1000 0.002695770002901554
100000 0.042065858957357705
10000000 6.068146598991007
image

for _dt in [blk.dtype for blk in cond._mgr.blocks]:

10 0.0009857049444690347
1000 0.0011893719201907516
100000 0.003112988080829382
10000000 0.13763279700651765
image

@mroeschke mroeschke added the Performance Memory or execution speed performance label Feb 27, 2025
Co-authored-by: Matthew Roeschke <[email protected]>
@auderson
Copy link
Contributor Author

@mroeschke
Hi, do I need to add this in whatsnew? 2.3.0 or 3.0.0?

@auderson auderson changed the title PERF: use blk.dtype in where() PERF: use blk.dtype in where() & _setitem_frame() Feb 28, 2025
@auderson
Copy link
Contributor Author

This pattern is also found in _setitem_frame:

pandas/pandas/core/frame.py

Lines 4276 to 4289 in 5da9eb7

def _setitem_frame(self, key, value) -> None:
# support boolean setting with DataFrame input, e.g.
# df[df > df2] = 0
if isinstance(key, np.ndarray):
if key.shape != self.shape:
raise ValueError("Array conditional must be same shape as self")
key = self._constructor(key, **self._construct_axes_dict(), copy=False)
if key.size and not all(is_bool_dtype(dtype) for dtype in key.dtypes):
raise TypeError(
"Must pass DataFrame or 2-d ndarray with boolean values only"
)
self._where(-key, value, inplace=True)

Before

image

After

image

@mroeschke
Copy link
Member

Hi, do I need to add this in whatsnew? 2.3.0 or 3.0.0?

v3.0.0.rst please

@mroeschke mroeschke added this to the 3.0 milestone Feb 28, 2025
@mroeschke mroeschke merged commit 928fb7e into pandas-dev:main Feb 28, 2025
42 checks passed
@mroeschke
Copy link
Member

Thanks @auderson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: bottleneck in where()
2 participants