Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE #227

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

PyExplorer
Copy link
Collaborator

@PyExplorer PyExplorer commented Oct 18, 2024

The initial version without fixing tests. The docs is still under fixing.

.. setting:: DOWNLOAD_MAXSIZE

DOWNLOAD_MAXSIZE
================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to copy the documentation here, we should just mention that these two standard Scrapy settings are supported.

Comment on lines 207 to 211
expected_size = None
for header in api_response.get("httpResponseHeaders"):
if header["name"].lower() == "content-length":
expected_size = int(header["value"])
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is computing expected size needed? We already got the httpResponseBody, and can check its real size.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to check it first (as this is faster) and if "content-length" is exceed the limit - return without checking the length of the real body.

Copy link
Member

@kmike kmike Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't decode from base64, but use 0.75 approximation, then using content-length will not be any faster - it'd be slower, and also less reliable, as content-length might lie.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think to remove this check for content-length at all?
Actually, I've added this as we consider (and in Scrapy this is also mentioned for DOWNLOAD_MAXSIZE) to check compressed and decompressed data, and only one way that I found how to check compressed data was to check content-length.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, drop it. In Scrapy it's different, because by checking content-length Scrapy can prevent download before it happens.

For decompression, there is also a special support in Scrapy; it's unrelated to content-length. Scrapy decompresses in chunks, and keeps track of the total size of decompressed data. If the size grows over the limit, an exception is raised, and decompression is stopped. See https://github.com/scrapy/scrapy/blob/6d65708cb7f7b49b72fc17486fecfc1caf62e0af/scrapy/utils/_compression.py#L53. This also looks like something we can't apply here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

(maxsize and expected_size < maxsize)
and (warnsize and expected_size < warnsize)
):
expected_size = len(b64decode(api_response.get("httpResponseBody", b"")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to get size of base64 data without decoding it? Decoding can be costly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*.75

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(assuming no linebreaks or ignoring them)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks fine not to be byte-precise.

warnsize = request.meta.get("download_warnsize", default_warnsize)

if "browserHtml" in api_response:
expected_size = len(api_response["browserHtml"].encode(_DEFAULT_ENCODING))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here while trying to limit the memory using DOWNLOAD_MAXSIZE we might create an additional memory spike, because we create another duplicate of browserHtml in memory, temporarily.

It seems we need to ensure _response_max_size_exceeded never makes copies of large data from the response.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can either use sys.getsizeof (and maybe subtract some fixed overhead Python Unicode objects have), or consider the length of the unicode object instead of the length of the actual data, it could be good enough as well (though worse). Maybe there is some other solution.

Copy link
Collaborator Author

@PyExplorer PyExplorer Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also calculate the size of the response body here
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L114
and
here
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L145
What do you think if we just move the check into these two functions and check separately? Then return None of the size is big.
In this case we only need to check additionally content-length in ZyteAPIResponse.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@PyExplorer PyExplorer Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach is to calculate the size and decoded/encoded version of the response body here before calling from_api_response and send the prepared body to from_api_response. In this case we make this expensive calculation only once and use this calculated body here too https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L197.

Copy link
Member

@kmike kmike Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about moving it down the stack - check the size of the API response received by the client library before the json decoding.

But it could make the library less compatible with Scrapy. Let's say you have an existing spider, which uses some download limit. You switch to scrapy-zyte-api for downloads, and maybe also enable data extraction. But the API response is larger than that raw response size. So, the limit becomes more aggressive, and you might drop some pages which were working before.

Because of this, the approach you've taken originally - checking httpResponseBody size and browserHtml size, ignoring everything else (e.g. structured data sizes, or screenshot sizes) makes sense to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@PyExplorer PyExplorer Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike a new version is here (in this PR).
Now the number of encoding/decoding operations is the same as before. No additional calculations except for getting the length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants