-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move sterf to CPU; Add experimental parallelism for sterf #462
base: develop
Are you sure you want to change the base?
Conversation
Thanks. I'm working on SYEVD performance myself, but this is a very different approach. I'm not able to review your changes right this moment, but I will take a look as soon as possible. I know @jzuniga-amd has talked about hybrid host/device algorithms before. Thus far, rocSOLVER has been a purely GPU library but I don't think we've ruled out hybrid approaches. I wonder if an environment variable might be better than a compile-time flag for switching from a pure GPU algorithm to a hybrid algorithm. In general, that sort of decision may depend on the balance of CPU and GPU resources available, which is a very dynamic condition. It's a function of both the hardware capacity and the mix of other jobs running on the same hardware. |
Substituting it here with a simple sorting algorithm. If more performance is required in | ||
the future, lasrt_increasing should be debugged or another quick-sort method | ||
could be implemented) **/ | ||
for(int ii = 1; ii < n; ii++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea may to use shell sort (https://en.wikipedia.org/wiki/Shellsort) or call lapack lasrt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is a copy-paste from sterf (rocauxiliary_sterf.hpp). The original code contained a TODO which mentioned problems in lasrt so I decided to leave sort as is. I can implement shell sort if needed though.
In any case sorting takes a small fraction of total kernel time (most time-consuming part is QR/QL iteration) so this is probably lower priority...
to compute the eigenvalues of a symmetric tridiagonal matrix given by D | ||
and E on CPUi, non batched version**/ | ||
template <typename T> | ||
void sterf_cpu(const rocblas_int n, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about calling lapack dsterf directly?
Is there OpenMP parallelism for sterf on CPU host, perhaps "-fopenmp" supported by clang/flang ?
One may also consider OpenMP parallelism across batch on CPU host.
If it is a serial algorithm on a single cpu core, would sterf still be an expensive part or take too much time when n is large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about calling lapack dsterf directly?
That might be nice to control with a CMake configuration option.
Is there OpenMP parallelism for sterf on CPU host, perhaps "-fopenmp" supported by clang/flang ?
I'm not sure if support for CPU OpenMP with ROCm is something that just happens to work or if it's an officially supported feature. If it just happens to work, the feature might not be maintained and that could leave us in a tough spot in the future. OpenMP GPU offloading is explicitly supported with ROCm, but I'm not sure how OpenMP CPU and GPU offloading interact.
If you know more, I'd be interested to learn about how OpenMP is supposed to work in ROCm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is something like "#pragma omp parallel for" will be ignore or treated as comment if the compiler does not support that pragma. There might be a warning at compile time but should not affect correctness, just affect performance. I think the current rocm/hipcc do support "-fopenmp" OpenMP. Perhaps the hipcc/clang compiler group can verify whether OpenMP will be supported. I suspect the OpenMP for cpu host is mature, but support for OpenMP 5.0 with " target offload" to GPU may still be under development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about calling lapack dsterf directly?
We could do that but dsterf comes from lapack/cblas which seems to be used only in tests and benchmarks. Is it ok to introduce such dependency in rocSOLVER proper?
If it is a serial algorithm on a single cpu core, would sterf still be an expensive part or take too much time when n is large?
Sterf on CPU seems to be much faster than GPU for all sizes that I was able to measure:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think routine dsterf() is part of lapack (liblapack.a).
One may need to call the routine with the "Fortran" version of subroutine arguments.
|
||
const rocblas_int tid = hipThreadIdx_x; | ||
|
||
l0 = l = split_ranges[2 * tid]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor comment that the letter "el" can look like "1" one in some fonts.
Perhaps it can use another name to avoid "el" or "l0" to look like ten "10"?
Just a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sure. I can fix this.
T* h_E = new T[n]; | ||
rocblas_int h_info = 0; | ||
|
||
hipDeviceSynchronize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there asynchronous kernel launching on streams?
If so, perhaps stream synchronize can be used instead of whole device synchronize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I`ll fix this too.
Does anyone know how to access logs for failing builds btw? It seems they are on some internal AMD CI server. |
It looks like there are failures in several functions unrelated to sterf. This seems to be affecting other PRs too, so it doesn't appear to have been caused by your changes. |
I'm afraid it's not possible for community members to see the logs. If it were an actual issue, I'd copy the output into this thread. However, the cause of the failure has nothing to do with your changes. rocSOLVER PRs are tested using the latest build of the corresponding rocBLAS branch and it seems there's been a bug in syr2k on the rocBLAS develop branch. The rocBLAS team is aware of the problem and they'll fix it before release, but it will cause failures in the rocSOLVER CI until it is fixed. |
I believe I addressed all code review remarks. Let me know if anything else is needed. |
The team is still discussing the best ways to introduce pure-CPU or hybrid algorithms into the library. There is nothing wrong with offloading computations to the CPU (especially when the problem/algorithm is not very suitable for GPUs), either by calling internal code (sterf_cpu) or linking to another CPU library (lapack_sterf), but the change must be introduced carefully. Currently, rocSOLVER is a 100% GPU library; calls to rocSOLVER APIs are asynchronous on the host; they can return immediately even if the computations on the device are not done yet and, in practice, no CPU cycles are used. This behavior is expected by some users and changing it could be problematic for some workflows, especially if the switch is embedded in the build process (at compile time). If users can purposely switch between CPU, GPU or hybrid modes at run time could provide more flexibility, but we need to carefully plan the design as this is also related to other upcoming features in our roadmaps. (The part of this PR that optimizes/parallelizes the GPU code would be easier to review and merge at this time -and we can leave the CPU/hybrid stuff for a future PR-, but it is up to you whether you want to divide your contribution into two different PRs). Normally, when we add a new function to the library, we initially focus on the correctness of the algorithm (accuracy of results, etc.) and the functionality of the API. For the optimization round, we want that the performance gain really justifies the introduction of, possibly, more complicated code and its implications on the code maintenance process. It is clear that the Is there any difference between the in-house We also need to think of a rationale that justifies the proposed approach on a higher level in a workflow. Advanced users may know whether their problems are better suited for CPU or GPU computations, and there is nothing preventing them to use different libraries according to their needs; rocSOLVER is not intended to be a single-entry-point fit-all-cases solution, if a user doesn’t get enough performance out of a GPU solution, they can directly call a CPU library instead (and AMD CPU library or others). Right now, with the memory model that we are using, I don’t see why a user would like to transfer data to the GPU to be able to call a rocSOLVER API that will simply copy the data back to the host to perform the computations on the CPU (and this is essentially what A scenario like this (with internal data transfers) makes more sense with a truly hybrid code, i.e. a code that processes the data via a GPU-CPU collaboration, which is not the case of the proposed In the meantime, please notice that |
Hi, I've been trying to improve performance of SYEVD function lately. The sterf kernel is the most time-consuming part of the code. I tried to use two ways to improve it:
Efficiency of the first approach depends on input values of tridiagonal matrix and doesn't provide stable improvement. The second variant provides significant acceleration in all cases (acceleration in 10 times).
I am attaching the relevant patch.