Suspected regression in `/vsicurl` #9682

perliedman · 2024-04-16T15:25:08Z

What is the bug?

Using latest GDAL (3.8.5), fetching a COG using Python/ReadAsArray takes about four times longer than the same code using GDAL 3.7.1.

Enabling CPL_CURL_VERBOSE=YES, I can see that 3.8.5 makes a lot more range requests than 3.7.1:

3.8.5: 2796
3.7.1: 114

Steps to reproduce the issue

I fetch a COG using /vsicurl using this simple code:

time python -c "from osgeo import gdal; gdal.UseExceptions(); src=gdal.Open('/vsicurl/https://solvi-marketing-resources.s3.eu-west-1.amazonaws.com/public/ortho.tiff'); src.ReadAsArray()"

On my connection, this takes about 40 seconds.

Running the exact same code using GDAL 3.7.1, this takes about 10 seconds.

It seems something changed between these versions that caused a pretty major regression.

Versions and provenance

GDAL 3.8.5 test was run in a Docker container running Debian GNU/Linux 11 (bullseye)

GDAL 3.7.1 was tested on my Ubuntu 23.10

Additional context

I stumbled upon this in our production environment, where the code of course is a lot more complex, but it seems the same holds even for a simple example like then one above.

The text was updated successfully, but these errors were encountered:

perliedman · 2024-04-16T15:29:31Z

I have tried to experiment with these environment variables without making any clear difference in timings:

GDAL_HTTP_MULTIRANGE=YES
GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES
GDAL_HTTP_MULTIPLEX=YES
GDAL_HTTP_VERSION=2

rcoup · 2024-04-16T15:40:42Z

FWIW, I'm investigating a performance regression upgrading from 3.6.4 to 3.8.5, but I don't have a concrete reproducer or profile yet. This app is serving tiles from VRTs pointing at S3 COGs, and is using vaguely ~40% more cpu & memory with ~50% lower throughput under 3.8 cf 3.6. A few other things have changed too though, so I'm wary to attribute it all to GDAL.

jratike80 · 2024-04-16T17:21:19Z

I made a test with GDAL 3.8.5 from OSGeo4W. I suppose this is basically the same thing:

gdal_translate /vsicurl/https://solvi-marketing-resources.s3.eu-west-1.amazonaws.com/public/ortho.tiff out.tif --debug on

The command generated 58 range requests.

rouault · 2024-04-16T17:37:31Z

I can't reproduce that on my dev environment (Ubuntu 20.04), without any special config option set. I get 114 requests with GDAL master or 3.8.5:

$ CPL_DEBUG=ON python -c "from osgeo import gdal; gdal.UseExceptions(); src=gdal.Open('/vsicurl/http://localhost:8080/ortho.tiff'); src.ReadAsArray()" 2>&1 | grep Downloading | wc -l
114

perliedman · 2024-04-17T08:53:02Z

Thanks for testing and reporting back! You are of course correct that this is not a regression in later GDAL versions. It helped me pinpoint the issue, which is not the GDAL version, but the fact that I had set GDAL_NUM_THREADS=4 in my Docker container and forgot about it.

So, running the code above with GDAL_NUM_THREADS=4 is what cause the increased number of range requests, without it I consistently get 114 range requests in both 3.8.5 as well as 3.7.1. With four threads, the number of requests increase to 2796 and takes about four times longer.

After finding this, I also experimented with some other settings for GDAL_NUM_THREADS, and it seems the number of requests always end up 2796 for any value greater that 1, but the time it takes varies. With two threads, it took 1 minute 20 seconds and with 64 threads I could bring it as low as about 4 seconds.

Is this the intended way for GDAL_NUM_THREADS to work? My naïve expectation was not that I had to use as high values as 16 to get equivalent performance as no threading, and also not that it would increase the amount of requests so enormously.

rouault · 2024-04-18T10:53:15Z

and it seems the number of requests always end up 2796 for any value greater that 1

This raster has indeed ceil(7255.0 / 128) * ceil(6237.0 / 128) = 2793 tiles. When you enable multi-threading, the driver decides then to split the reading in as many jobs as they are tiles, and each job is responsible to acquire the data. There's however a logic where jobs declare in advance the data they will need so that the /vsicurl/ layer can try to group together consecutive requests, but that's limited to a maximum of 100 MB simultaneously read, and here your file is bigger than that, hence no effort is made to merge HTTP GET requests. I suspect the GTiff driver implementation could be made smarter to split the RasterIO() request into smaller ones that don't exceed the 100 MB threshold

vincentsarago · 2024-04-18T14:44:57Z

might be related cogeotiff/rio-tiler#697

I'm seeing an increase number of HEAD and GET requests, especially some GetFileSize and GetFileList VSICURL request event when GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR is set

…mit to reduce the number of I/O requests Fixes OSGeo#9682

rouault · 2024-04-18T15:06:33Z

I suspect the GTiff driver implementation could be made smarter to split the RasterIO() request into smaller ones that don't exceed the 100 MB threshold

improvement implementing that in #9694

perliedman · 2024-04-19T08:33:27Z

@rouault thanks a lot for explaining the logic behind this, it makes sense now! For our purposes, I think just not setting GDAL_NUM_THREADS will work well, but of course improvements will be welcome when they arrive.

rouault self-assigned this Apr 18, 2024

rouault added a commit to rouault/gdal that referenced this issue Apr 18, 2024

GTiff: MultiThreadedRead(): make it take into account AdviseRead() li…

5b948d1

…mit to reduce the number of I/O requests Fixes OSGeo#9682

rouault mentioned this issue Apr 18, 2024

GTiff: MultiThreadedRead(): make it take into account AdviseRead() limit to reduce the number of I/O requests #9694

Merged

rouault closed this as completed in #9694 Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suspected regression in `/vsicurl` #9682

Suspected regression in `/vsicurl` #9682

perliedman commented Apr 16, 2024

perliedman commented Apr 16, 2024

rcoup commented Apr 16, 2024

jratike80 commented Apr 16, 2024

rouault commented Apr 16, 2024

perliedman commented Apr 17, 2024 •

edited

Loading

rouault commented Apr 18, 2024

vincentsarago commented Apr 18, 2024 •

edited

Loading

rouault commented Apr 18, 2024

perliedman commented Apr 19, 2024

Suspected regression in /vsicurl #9682

Suspected regression in /vsicurl #9682

Comments

perliedman commented Apr 16, 2024

What is the bug?

Steps to reproduce the issue

Versions and provenance

Additional context

perliedman commented Apr 16, 2024

rcoup commented Apr 16, 2024

jratike80 commented Apr 16, 2024

rouault commented Apr 16, 2024

perliedman commented Apr 17, 2024 • edited Loading

rouault commented Apr 18, 2024

vincentsarago commented Apr 18, 2024 • edited Loading

rouault commented Apr 18, 2024

perliedman commented Apr 19, 2024

Suspected regression in `/vsicurl` #9682

Suspected regression in `/vsicurl` #9682

perliedman commented Apr 17, 2024 •

edited

Loading

vincentsarago commented Apr 18, 2024 •

edited

Loading