Adding DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE #227

PyExplorer · 2024-10-18T08:27:00Z

The initial version without fixing tests. The docs is still under fixing.

kmike · 2024-10-18T08:28:44Z

docs/reference/settings.rst

+.. setting:: DOWNLOAD_MAXSIZE
+
+DOWNLOAD_MAXSIZE
+================


I don't think we need to copy the documentation here, we should just mention that these two standard Scrapy settings are supported.

kmike · 2024-10-18T08:31:10Z

scrapy_zyte_api/responses.py

+        expected_size = None
+        for header in api_response.get("httpResponseHeaders"):
+            if header["name"].lower() == "content-length":
+                expected_size = int(header["value"])
+                break


Why is computing expected size needed? We already got the httpResponseBody, and can check its real size.

The idea was to check it first (as this is faster) and if "content-length" is exceed the limit - return without checking the length of the real body.

If you don't decode from base64, but use 0.75 approximation, then using content-length will not be any faster - it'd be slower, and also less reliable, as content-length might lie.

Do you think to remove this check for content-length at all?
Actually, I've added this as we consider (and in Scrapy this is also mentioned for DOWNLOAD_MAXSIZE) to check compressed and decompressed data, and only one way that I found how to check compressed data was to check content-length.

Yes, drop it. In Scrapy it's different, because by checking content-length Scrapy can prevent download before it happens.

For decompression, there is also a special support in Scrapy; it's unrelated to content-length. Scrapy decompresses in chunks, and keeps track of the total size of decompressed data. If the size grows over the limit, an exception is raised, and decompression is stopped. See https://github.com/scrapy/scrapy/blob/6d65708cb7f7b49b72fc17486fecfc1caf62e0af/scrapy/utils/_compression.py#L53. This also looks like something we can't apply here.

Got it, thanks!

kmike · 2024-10-18T08:32:07Z

scrapy_zyte_api/responses.py

+            (maxsize and expected_size < maxsize)
+            and (warnsize and expected_size < warnsize)
+        ):
+            expected_size = len(b64decode(api_response.get("httpResponseBody", b"")))


Is there a way to get size of base64 data without decoding it? Decoding can be costly.

(assuming no linebreaks or ignoring them)

It looks fine not to be byte-precise.

kmike · 2024-10-18T08:33:34Z

scrapy_zyte_api/responses.py

+    warnsize = request.meta.get("download_warnsize", default_warnsize)
+
+    if "browserHtml" in api_response:
+        expected_size = len(api_response["browserHtml"].encode(_DEFAULT_ENCODING))


Here while trying to limit the memory using DOWNLOAD_MAXSIZE we might create an additional memory spike, because we create another duplicate of browserHtml in memory, temporarily.

It seems we need to ensure _response_max_size_exceeded never makes copies of large data from the response.

I think you can either use sys.getsizeof (and maybe subtract some fixed overhead Python Unicode objects have), or consider the length of the unicode object instead of the length of the actual data, it could be good enough as well (though worse). Maybe there is some other solution.

We also calculate the size of the response body here
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L114
and
here
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L145
What do you think if we just move the check into these two functions and check separately? Then return None of the size is big.
In this case we only need to check additionally content-length in ZyteAPIResponse.

And, we already calculate base64 in two places
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L197
and
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L127
But not sure if this is an issue and we need to fix it the PR.

Another approach is to calculate the size and decoded/encoded version of the response body here before calling from_api_response and send the prepared body to from_api_response. In this case we make this expensive calculation only once and use this calculated body here too https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L197.

I was also thinking about moving it down the stack - check the size of the API response received by the client library before the json decoding.

But it could make the library less compatible with Scrapy. Let's say you have an existing spider, which uses some download limit. You switch to scrapy-zyte-api for downloads, and maybe also enable data extraction. But the API response is larger than that raw response size. So, the limit becomes more aggressive, and you might drop some pages which were working before.

Because of this, the approach you've taken originally - checking httpResponseBody size and browserHtml size, ignoring everything else (e.g. structured data sizes, or screenshot sizes) makes sense to me.

Ok, let me prepare another version with implementing check here https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L197
and
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L127

@kmike a new version is here (in this PR).
Now the number of encoding/decoding operations is the same as before. No additional calculations except for getting the length.

PyExplorer added 3 commits October 18, 2024 11:22

initial version for limiting response size

e566fbe

fix doc

f3e36af

add title for DOWNLOAD_MAXSIZE in doc

230a278

PyExplorer requested review from kmike, Gallaecio and wRAR October 18, 2024 08:27

fix formatting DOWNLOAD_WARNSIZE in doc

9e37208

kmike reviewed Oct 18, 2024

View reviewed changes

PyExplorer added 5 commits October 18, 2024 12:42

remove extra doc

5066d6d

restore original doc

988e853

keep original number of encoding/decoding

8f79ab0

fix logic for _body_max_size_exceeded

70e9b55

rewording messages (we don't have expected now)

11dd745

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE #227

Adding DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE #227

PyExplorer commented Oct 18, 2024 •

edited

Loading

kmike Oct 18, 2024

kmike Oct 18, 2024

PyExplorer Oct 18, 2024

kmike Oct 18, 2024 •

edited

Loading

PyExplorer Oct 18, 2024

kmike Oct 18, 2024

PyExplorer Oct 18, 2024

kmike Oct 18, 2024

wRAR Oct 18, 2024

wRAR Oct 18, 2024

kmike Oct 18, 2024

kmike Oct 18, 2024

kmike Oct 18, 2024

PyExplorer Oct 18, 2024 •

edited

Loading

PyExplorer Oct 18, 2024

PyExplorer Oct 18, 2024 •

edited

Loading

kmike Oct 18, 2024 •

edited

Loading

PyExplorer Oct 18, 2024

PyExplorer Oct 18, 2024 •

edited

Loading

Adding DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE #227

Are you sure you want to change the base?

Adding DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE #227

Conversation

PyExplorer commented Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PyExplorer Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PyExplorer Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

kmike Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PyExplorer Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

PyExplorer commented Oct 18, 2024 •

edited

Loading

kmike Oct 18, 2024 •

edited

Loading

PyExplorer Oct 18, 2024 •

edited

Loading

PyExplorer Oct 18, 2024 •

edited

Loading

kmike Oct 18, 2024 •

edited

Loading

PyExplorer Oct 18, 2024 •

edited

Loading