Processing multiple body pairs at an instance #613

jenetscaria-mcw · 2023-07-07T06:58:11Z

jenetscaria-mcw
Jul 7, 2023

Hi @jrouwe,
While exploring the current pipeline, we saw that according to the current design, after creating a queue of active bodies and finding the corresponding colliding pairs in the queue, we process each pair individually using ProcessBodyPair function and accordingly SSE, AVX2 and AVX512 instructions are used to process individual pair of active bodies. Can we look at processing more body pairs from the queue at once so as to make a complete utilization of 512-bit registers and possibly get a higher performance.

jrouwe · 2023-07-07T11:10:48Z

jrouwe
Jul 7, 2023
Maintainer

Can we look at processing more body pairs from the queue at once so as to make a complete utilization of 512-bit registers and possibly get a higher performance.

Do you mean that you want to look into this or are you asking for me to look into this?

AVX-512 is not supported on the Intel 12th and 13th gen consumer processors, and while it may be coming back, it means that it will be at least another 10 years before we can reasonably assume that players will have hardware that supports AVX-512. So I don't want to spend much energy in supporting this as the only use case would be games where the server runs the simulation and the client doesn't. My 12th gen Intel doesn't support it, so I cannot test any AVX-512 code myself.

Besides that I think it will be very hard to introduce parallelism at the level of ProcessBodyPair. In that function the two shapes of two bodies are tested for collisions, but the shape types wildly vary from body to body (compounds, spheres, boxes, meshes, heighfields etc.) so if you wanted to run 4 collision tests in parallel using AVX-512 you would have to happen to get 4 shape pairs that have the exact same structure. Let's say that we manage to implement this (maybe as an additional queue) then I think AVX-512 only makes sense for certain collision pairs, e.g. testing a sphere vs sphere is so trivial that the setup cost of moving everything in 512 bit registers probably outweighs the benefit of the faster calculation. It only really makes sense when we're testing complex vs complex shapes (e.g. convex vs mesh). For convex vs mesh the next problem is going to be that the algorithm alternates between testing nodes and testing triangles, so unless the structure of the bounding volume trees for the 4 meshes we're testing in parallel are very similar I don't think we're going to get very good occupancy. The added code complexity for a system like this means that any future changes will become much harder and the code will become very hard to read.

A simpler solution would be to not try to test multiple body pairs at the same time, but to optimize the collision check in e.g. the convex vs mesh case. At the leaves of the bounding volume tree there are X triangles. These triangles all have to be tested against the same shape, so it would be possible to test 4 triangles at the same time. There's no guarantee that this will be much faster though, as testing a convex vs triangle requires an unknown amount of GJK / EPA iterations. In fact, I tried this idea years ago when Jolt was still a test frame work for ray vs triangle testing. This version didn't make it into the final article since it was not better than other implementations.

Finally, in my experience it is not the collision checking that is the bottleneck. It is my crappy LockFreeHashMap implementation that dies from cache misses because the key/values belonging to a certain hash are in a linked list. I've been meaning to write a better implementation for this.

0 replies

jrouwe · 2023-07-14T13:33:06Z

jrouwe
Jul 14, 2023
Maintainer

Moving to a discussion as I don't think this is something that can be implemented.

0 replies

jenetscaria-mcw · 2023-07-31T08:19:48Z

jenetscaria-mcw
Jul 31, 2023
Author

@jrouwe , thanks for the response,
I was just wondering what do you mean by Cache misses belonging to a certain hash are in a linked list for the LockFreeHashMap algorithm. From what I saw, in the contact constraint manager, we try to find the cached body pair and if the corresponding key value pair is not found, we create a new entry in the linked list which is inserted as a new element in the beginning of the linked list.
Can you tell what causes the misses here? Will usage of any other library or data structure be helpful here?

0 replies

jrouwe · 2023-07-31T14:19:49Z

jrouwe
Jul 31, 2023
Maintainer

LockFreeHashMap uses 'Separate Chaining' (see https://en.wikipedia.org/wiki/Hash_table) which means that to look something up in the hash map you first have a cache miss to get the linked list head pointer and then another one to actually get to the first entry (and in case there are multiple entries in the map you'll have more cache misses as you traverse the linked list structure). I think using a different technique like 'Open Addressing' could help but I haven't investigated how that would work for a lock free map.

0 replies

jenetscaria-mcw · 2023-08-01T19:50:46Z

jenetscaria-mcw
Aug 1, 2023
Author

@jrouwe, if we are able to implement Open Addressing for LockFreeHashMap and with that if we try to process multiple body pairs of similar structure together at once, will it be feasible? Open Addressing can help increase cache performance (maybe by using linear probing), but if we process more body pairs here, will it cause an increase in cache misses?

6 replies

jenetscaria-mcw Aug 2, 2023
Author

We were actually looking into any scope of vectorization or further optimization so as to contribute to Jolt Physics. That's why, I wanted to check with you whether it would be feasible to vectorize the code in order to process multiple body pairs of similar shapes together.

The batch of active bodies is being fetched in the JobFindCollisions function, so if we are able to process a batch of bodies with similar structures together, that might help to get a good boost in performance.
Also, for the LockFreeHashMap algorithm, will processing more body pairs lead to increase in cache misses as well?

jrouwe Aug 2, 2023
Maintainer

Also, for the LockFreeHashMap algorithm, will processing more body pairs lead to increase in cache misses as well?

No I don't think so. In any case, I would strongly advice against starting vectorization at ProcessBodyPairs level for reasons listed in my first reply. It is unlikely that I would merge this into the master branch as I don't see how it can be done without making the code unreadable/unmaintainable.

jenetscaria-mcw Aug 21, 2023
Author

@jrouwe , Thanks for the insights,
We are currently checking on Open Addressing method for implementing LockFreeHashMap and simultaneously for optimizing collision check for 4 triangles processing for ConvexVsMesh example.

For the collision check inside the Collide function, after finding the penetration depth, in order to perform the GJK and EPA calculations, will it be fine to modify GetPenetrationDepthStepGJK and GetPenetrationDepthStepEPA functions to process data from all 4 triangles and corresponding convex shapes together?

jrouwe Aug 21, 2023
Maintainer

I would prefer to keep the current implementation as reference, maybe you can create a new implementation like GJKClosestPoint4. I'm not sure if there will be any benefit in trying to run EPA on 4 triangles at a time because the algorithm only runs in a fraction of the cases. Please make sure to not use any intrinsics in the implementation, intrinsics are only allowed in classes like Vec4, UVec4 etc.

jrouwe Aug 21, 2023
Maintainer

B.t.w. it will be a lot easier to do a function like MeshShape::sCollideSphereVsMesh first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing multiple body pairs at an instance #613

{{title}}

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Processing multiple body pairs at an instance #613

jenetscaria-mcw Jul 7, 2023

Replies: 5 comments · 6 replies

jrouwe Jul 7, 2023 Maintainer

jrouwe Jul 14, 2023 Maintainer

jenetscaria-mcw Jul 31, 2023 Author

jrouwe Jul 31, 2023 Maintainer

jenetscaria-mcw Aug 1, 2023 Author

jenetscaria-mcw Aug 2, 2023 Author

jrouwe Aug 2, 2023 Maintainer

jenetscaria-mcw Aug 21, 2023 Author

jrouwe Aug 21, 2023 Maintainer

jrouwe Aug 21, 2023 Maintainer

jenetscaria-mcw
Jul 7, 2023

Replies: 5 comments 6 replies

jrouwe
Jul 7, 2023
Maintainer

jrouwe
Jul 14, 2023
Maintainer

jenetscaria-mcw
Jul 31, 2023
Author

jrouwe
Jul 31, 2023
Maintainer

jenetscaria-mcw
Aug 1, 2023
Author

jenetscaria-mcw Aug 2, 2023
Author

jrouwe Aug 2, 2023
Maintainer

jenetscaria-mcw Aug 21, 2023
Author

jrouwe Aug 21, 2023
Maintainer

jrouwe Aug 21, 2023
Maintainer