Processing multiple body pairs at an instance #613
Replies: 5 comments 6 replies
-
Do you mean that you want to look into this or are you asking for me to look into this? AVX-512 is not supported on the Intel 12th and 13th gen consumer processors, and while it may be coming back, it means that it will be at least another 10 years before we can reasonably assume that players will have hardware that supports AVX-512. So I don't want to spend much energy in supporting this as the only use case would be games where the server runs the simulation and the client doesn't. My 12th gen Intel doesn't support it, so I cannot test any AVX-512 code myself. Besides that I think it will be very hard to introduce parallelism at the level of ProcessBodyPair. In that function the two shapes of two bodies are tested for collisions, but the shape types wildly vary from body to body (compounds, spheres, boxes, meshes, heighfields etc.) so if you wanted to run 4 collision tests in parallel using AVX-512 you would have to happen to get 4 shape pairs that have the exact same structure. Let's say that we manage to implement this (maybe as an additional queue) then I think AVX-512 only makes sense for certain collision pairs, e.g. testing a sphere vs sphere is so trivial that the setup cost of moving everything in 512 bit registers probably outweighs the benefit of the faster calculation. It only really makes sense when we're testing complex vs complex shapes (e.g. convex vs mesh). For convex vs mesh the next problem is going to be that the algorithm alternates between testing nodes and testing triangles, so unless the structure of the bounding volume trees for the 4 meshes we're testing in parallel are very similar I don't think we're going to get very good occupancy. The added code complexity for a system like this means that any future changes will become much harder and the code will become very hard to read. A simpler solution would be to not try to test multiple body pairs at the same time, but to optimize the collision check in e.g. the convex vs mesh case. At the leaves of the bounding volume tree there are X triangles. These triangles all have to be tested against the same shape, so it would be possible to test 4 triangles at the same time. There's no guarantee that this will be much faster though, as testing a convex vs triangle requires an unknown amount of GJK / EPA iterations. In fact, I tried this idea years ago when Jolt was still a test frame work for ray vs triangle testing. This version didn't make it into the final article since it was not better than other implementations. Finally, in my experience it is not the collision checking that is the bottleneck. It is my crappy LockFreeHashMap implementation that dies from cache misses because the key/values belonging to a certain hash are in a linked list. I've been meaning to write a better implementation for this. |
Beta Was this translation helpful? Give feedback.
-
Moving to a discussion as I don't think this is something that can be implemented. |
Beta Was this translation helpful? Give feedback.
-
@jrouwe , thanks for the response, |
Beta Was this translation helpful? Give feedback.
-
LockFreeHashMap uses 'Separate Chaining' (see https://en.wikipedia.org/wiki/Hash_table) which means that to look something up in the hash map you first have a cache miss to get the linked list head pointer and then another one to actually get to the first entry (and in case there are multiple entries in the map you'll have more cache misses as you traverse the linked list structure). I think using a different technique like 'Open Addressing' could help but I haven't investigated how that would work for a lock free map. |
Beta Was this translation helpful? Give feedback.
-
@jrouwe, if we are able to implement Open Addressing for LockFreeHashMap and with that if we try to process multiple body pairs of similar structure together at once, will it be feasible? Open Addressing can help increase cache performance (maybe by using linear probing), but if we process more body pairs here, will it cause an increase in cache misses? |
Beta Was this translation helpful? Give feedback.
-
Hi @jrouwe,
While exploring the current pipeline, we saw that according to the current design, after creating a queue of active bodies and finding the corresponding colliding pairs in the queue, we process each pair individually using ProcessBodyPair function and accordingly SSE, AVX2 and AVX512 instructions are used to process individual pair of active bodies. Can we look at processing more body pairs from the queue at once so as to make a complete utilization of 512-bit registers and possibly get a higher performance.
Beta Was this translation helpful? Give feedback.
All reactions