You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I went through the process of parallelizing numpy, I encountered: Gustafson's law
Assume in this thought example we parallelize ufuncs, which constitute about 50% of the numpy loops in our thought example. Let's say we get a 5 times speed up on average for that 50% of processing time. The other 50% of processing time we could not parallelize because the code was not in ufunc form (for example routines like sort, stack, putmask, fancy index).
Our goal is that 10 minutes turns into 2 minutes. That would be great! 5x the speed! But instead it takes 6 minutes... not even twice as fast (sad face). That is because 5 minutes of the 10 minutes stays the same because of routines we could not parallelize because they were not exposed. This is Gustafson's law.
When we parallelize only some routines, the end user will encounter this lack of wow factor. The solution -- parallelize more routines! However numpy has not broken out all of the loops.
Therefore core numpy developers will have to make a choice:
Turn almost every numpy loop into an exported loop (to allow it to be optimized and parallelized).
Which means finding the spot in numpy code where it calls the exported loop, adding the call, and then copying the once intenral loop into the new math lib or similar.
Internalize threading (and avoid exporting the loop). This is easier -- quick and dirty.
If choice 2 is taken, then this project will be a waste of time because threading is exported in this project.
If choice 1 is taken, then the world will benefit from a cross platform portable math lib.
If choice 1 is taken, I hope it can commence soon. We need to methodically go through the code and loop by loop, start exporting them. This will only occur if there is rough consensus, and a directive from the top to proceed down this path.
The text was updated successfully, but these errors were encountered:
tdimitri
changed the title
Gustaffson's Law and why I hope numpy loops can be broken out
Gustafson's Law and why I hope numpy loops can be broken out
Oct 20, 2020
When I went through the process of parallelizing numpy, I encountered: Gustafson's law
Assume in this thought example we parallelize ufuncs, which constitute about 50% of the numpy loops in our thought example. Let's say we get a 5 times speed up on average for that 50% of processing time. The other 50% of processing time we could not parallelize because the code was not in ufunc form (for example routines like sort, stack, putmask, fancy index).
Our goal is that 10 minutes turns into 2 minutes. That would be great! 5x the speed! But instead it takes 6 minutes... not even twice as fast (sad face). That is because 5 minutes of the 10 minutes stays the same because of routines we could not parallelize because they were not exposed. This is Gustafson's law.
When we parallelize only some routines, the end user will encounter this lack of wow factor. The solution -- parallelize more routines! However numpy has not broken out all of the loops.
Therefore core numpy developers will have to make a choice:
Which means finding the spot in numpy code where it calls the exported loop, adding the call, and then copying the once intenral loop into the new math lib or similar.
If choice 2 is taken, then this project will be a waste of time because threading is exported in this project.
If choice 1 is taken, then the world will benefit from a cross platform portable math lib.
If choice 1 is taken, I hope it can commence soon. We need to methodically go through the code and loop by loop, start exporting them. This will only occur if there is rough consensus, and a directive from the top to proceed down this path.
The text was updated successfully, but these errors were encountered: