Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ogrct.cpp to remove shared shared proj context (m_psLastContex… #11846

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

samiermerchant
Copy link

@samiermerchant samiermerchant commented Feb 13, 2025

…t), which can lead to scenarios where one thread destroys the context while another is still using it

Undo 94ede75, which causes segfaults when using gdal in a multi-threaded environment such as a thread worker pool.

It might be the case that one thread "destroys" the context (m_psLastContext) while another thread is still using it, leading to a segfault. As you can see with the following line: m_psLastContext = OSRGetProjTLSContext();

What does this PR do?

Update ogrct.cpp to remove shared shared proj context (m_psLastContext), which can lead to scenarios where one thread destroys the context while another is still using it.

What are related issues/pull requests?

Tasklist

  • AI (Copilot or something similar) supported my development of this PR
  • Make sure code is correctly formatted (cf pre-commit configuration)
  • Add test case(s)
  • Add documentation
  • Updated Python API documentation (swig/include/python/docs/)
  • Review
  • Adjust for comments
  • All CI builds and checks have passed
  • ADD YOUR TASKS HERE

Environment

Provide environment details, if relevant:

  • OS:
  • Compiler:

…t), which can lead to scenarios where one thread destroys the context while another is still using it

Undo OSGeo@94ede75, which causes segfaults when using gdal in a multi-threaded environment such as a thread worker pool.

It might be the case that one thread "destroys" the context (m_psLastContext) while another thread is still using it, leading to a segfault. As you can see with the following line: m_psLastContext = OSRGetProjTLSContext();
Copy link

@fredzyda fredzyda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thank you!

@schwehr
Copy link
Member

schwehr commented Feb 13, 2025

This trouble is happening in the google3 mono repo with GDAL at head and a 5 year old version of proj - OSGeo/PROJ@cbf2eeb (I know... I'm working on the upgrade, but it's challenging).

LLVM within a couple weeks of head; python 3.11; tried both swig 4.2.1 and 4.4 (from head)

Stack trace from asan mode:

#0 pj_ctx_get_errno third_party/proj4/proj/src/ctx.cpp
#1 proj_errno third_party/proj4/proj/src/4D_api.cpp:1292:12
#2 proj_errno_reset third_party/proj4/proj/src/4D_api.cpp:1366:18
#3 OGRProjCT::TransformWithErrorCodes(unsigned long, double*, double*, double*, double*, int*) third_party/gdal/ogr/ogrct.cpp:2703:13
#4 OGRProjCT::Transform(unsigned long, double*, double*, double*, double*, int*) third_party/gdal/ogr/ogrct.cpp:2253:22
#5 OGRCoordinateTransformation::Transform(unsigned long, double*, double*, double*, int*) third_party/gdal/ogr/ogrct.cpp:2191:22
#6 GDALComputeAreaOfInterest(OGRSpatialReference*, double*, int, int, double&, double&, double&, double&) third_party/gdal/alg/gdaltransformer.cpp:1509:19
#7 GDALCreateGenImgProjTransformer2 third_party/gdal/alg/gdaltransformer.cpp:2044:13
#8 GDALWarpCreateOutput(int, void**, char const*, char const*, char**, char const* const*, GDALDataType, void**, bool, GDALWarpAppOptions*) third_party/gdal/apps/gdalwarp_lib.cpp:3870:13
#9 CreateOutput third_party/gdal/apps/gdalwarp_lib.cpp:1712:19
[SNIP]

@rouault rouault added the backport release/3.10 Backport to release/3.10 branch label Feb 13, 2025
@rouault
Copy link
Member

rouault commented Feb 13, 2025

If you say it fixes things, I guess I must trust you, but I feel stupid not understanding what's wrong exactly in the code that is reverted. Someone should withdraw my license to do C++ multi-threaded code... sigh

@schwehr
Copy link
Member

schwehr commented Feb 13, 2025

If you are seeing this message in CI, reproduce locally with: `pre-commit run --all-files`.
To run `pre-commit` as part of git workflow, use `pre-commit install`.

@schwehr
Copy link
Member

schwehr commented Feb 13, 2025

We have some more investigation to do on our side. I haven't fully figured out what's up yet.

@rouault
Copy link
Member

rouault commented Feb 13, 2025

@samiermerchant @fredzyda @schwehr This ticket initially appeared suspicious to me for following reasons:

  • @samiermerchant has never contributed to GDAL and his profile doesn't indicate any affiliation. But, weirdly enough, CI checks started, which shouldn't be the case for a first contribution (maybe because @fredzyda approved?)
  • @fredzyda also unknown to me, and also without affiliation, approve the pull request within minutes. My own pull requests can sit for months with anyone looking at them.
  • My mental health is not at its best, so it looked to me like a xz-utils type of social engineering
  • Luckily @schwehr showed up, so I assume you are all colleagues.

@schwehr
Copy link
Member

schwehr commented Feb 13, 2025

Hey Even,

Sorry for the surprise and thanks for your note. Mostly hoping you might think of something I missed about the threading issue as I feel over my head on this one.

I triggered the CI and have been working with them on the issue. I am also working with Owen who just posted to the list. My getting the google3 copy of GDAL to head has triggered issues in a number of teams. And am hoping to get more involvement from Google engineers with GDAL et al. on GitHub.

Never hurts to call out unusual events! I will work on making things more obvious / transparent as we go as we work through an enormous amount of tech dept.

I need to make sure I check that any new people follow the Google GitHub process and I need to make an internal doc for how best to document PRs. We use perforce and a very different change control UI inside, so it's jarring for folks who are new to GitHub or haven't done it for a long time.

I also haven't been good about sharing our local ports of autotest checks and the extra tests we have written. I will try to work my gdal-autotest2 repo back into my workflow. I know the tests are not the same code style and can't directly work in autotest, but being able to compare existing or forward port changes we have could be useful. I felt for a while that the amgdal-autotest2 repo had done its job as best as possible (except Geos is still on tut and I couldn't convince the Kakadu folk to do any unit tests), but times are always changing.

@fredzyda
Copy link

@rouault: apologies for the surprises. @samiermerchant and I are working together at Google to fix the bug we found, so he pinged me directly as soon as he put this PR up. I am in the Google open source and Googlers organizations on github, so I think I'm properly listed here as affiliated with Google.

Samier and I work on various forms of weather forecasting at Google and make extensive use of gdal in our data ingestion pipelines.

@coveralls
Copy link
Collaborator

Coverage Status

coverage: 70.037% (-4.3%) from 74.343%
when pulling d01d493 on samiermerchant:patch-1
into c0e4280 on OSGeo:master.

@schwehr
Copy link
Member

schwehr commented Feb 14, 2025

Dec 9th from before Kurt started to make the significant changes we see issues with. Cut down the logs to make it more obvious:

WARNING: ThreadSanitizer: data race (pid=3918202)
Write of size 4 at 0x564e5ac5d118 by thread T169:
#0 proj_errno_reset third_party/proj4/proj/src/4D_api.cpp:1370:14
#1 pj_ellipsoid(PJconsts*) third_party/proj4/proj/src/ell_set.cpp:77:15
#2 pj_init_ctx_with_allow_init_epsg(projCtx_t*, int, char**, int) third_party/proj4/proj/src/init.cpp:663:11

Previous write of size 4 at 0x564e5ac5d118 by thread T168:
#0 proj_errno_reset third_party/proj4/proj/src/4D_api.cpp:1370:14 
#1 OGRProjCT::TransformWithErrorCodes(unsigned long, double*, double*, double*, double*, int*) third_party/gdal/ogr/ogrct.cpp:2691:13
#2 OGRProjCT::Transform(unsigned long, double*, double*, double*, double*, int*) third_party/gdal/ogr/ogrct.cpp:2248:22

Location is global 'pj_errno' of size 4

SUMMARY: ThreadSanitizer: data race third_party/proj4/proj/src/4D_api.cpp:1370:14 in proj_errno_reset

with the fix cl with tsan still get an error

WARNING: ThreadSanitizer: data race
 Read of size 8 at 0x723c00215938 by thread T184:
GDALDriver::GetOpenCallback() third_party/gdal/gcore/gdal_priv.h:2227:16

Previous write of size 8 at 0x723c00215938 by thread T185:
#0 GDALDriver::Open(GDALOpenInfo*, bool) third_party/gdal/gcore/gdaldriver.cpp:98:13
#1 GDALOpenEx third_party/gdal/gcore/gdaldataset.cpp:3835:39

Location is heap block of size 232 at 0x723c002158e0 allocated by main thread:
#0 operator new(unsigned long) tsan_new_delete.cpp:64:3 
#1 GDALRegister_GTiff third_party/gdal/frmts/gtiff/geotiff.cpp:1228:28 
#2 GDALAllRegister third_party/gdal/frmts/gdalallregister.cpp:331:5

SUMMARY: ThreadSanitizer: data race third_party/gdal/gcore/gdal_priv.h:2227:16 in GDALDriver::GetOpenCallback()

The code:

typedef GDALDataset *(*OpenCallback)(GDALOpenInfo *);

OpenCallback pfnOpen = nullptr;

virtual OpenCallback GetOpenCallback()
{
        return pfnOpen;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport release/3.10 Backport to release/3.10 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants