Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hybrid execution of sterf #865

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from
Open

Conversation

tfalders
Copy link
Collaborator

@tfalders tfalders commented Dec 4, 2024

This is a revamp of #462 using the new hybrid infrastructure that was introduced for GESVD. Here's a brief summary of changes:

  • Added CPU execution path for STERF. Tests for STERF_HYBRID and SYEV_HYBRID have been added to verify correctness.
  • Rearranged functions in lib_device_helpers, so that device functions are no longer present in the kernels section.
  • is_device_pointer has been moved from a lambda function to a reusable function in lib_host_helpers.
  • Created a new class, rocsolver_hybrid_array, to assist with memory allocation and data transfers to and from the device. I have revised rocsolver_bdsqr_host_batch_template to use this new functionality.

@tfalders tfalders added the noOptimizations Disable optimized kernels for small sizes for some routines label Dec 4, 2024
Copy link
Contributor

@amd-jnovotny amd-jnovotny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changelog ok

If uplo = rocblas_fill_upper, only the upper triangular part is copied
If uplo = rocblas_fill_lower, only the lower triangular part is copied **/
template <typename T, typename U, typename Mask = no_mask>
ROCSOLVER_KERNEL void copy_mat(copymat_direction direction,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: if these is intended to be a frequently reused library routine, perhaps we might consider a more general version from (A,shiftA,lda,strideA) to (B,shiftB,ldb,strideB). The special case of a linear "buffer" may be shiftB = 0, strideB = (m*n), ldb = m; By using a more general interface, perhaps we don't need the extra enum for "direction". We just need to reverse the order of the "A" and "B" arguments.

Perhaps consider a more suggestive name for Mask. It is not immediately obvious to me whether we perform the copy if the mask is non-zero, or skip the copy if the mask is non-zero.

Is there a convention to place action arguments before the data? Say in TRSM, we place the uplo, side, trans, diag, arguments before the data, instead of placing the diag or uplo at the end of subroutine argument list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this function has multiple overloads where B has fewer associated parameters than A. In these cases, specifying a direction is necessary since we can't easily swap A and B. It's not necessary for this particular overload, but I think we kept it for consistency.

I don't entirely remember how the mask works either, so we should definitely improve documentation at a minimum.

Conventionally, yes, those arguments are placed at the start, but when they have default values they can't be put before any arguments without default values. I seem to remember they were tacked on after the fact, and assigned default values to avoid breaking existing calls.

T* Ap = load_ptr_batch<T>(A, b, shiftA, strideA);
T* Bp = &buffer[b * strideB];

if(direction == copymat_to_buffer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether this library routine can be further generalized to add a "trans" option to perform conjugate_transpose or transpose or none. Magma blas has a routine to perform efficient matrix transpose. Just a suggestion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a copy_trans_mat function that does have this functionality.

ROCSOLVER_BEGIN_NAMESPACE

template <typename T, typename I, typename U>
struct rocsolver_hybrid_array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if there is more documentation or description not of the internal implementation details but on the interface what it is intended to do and what functions or method are available. Is it keeping a data structure (strided array) on CPU and mirror it on GPU device, or vice versa? If so, perhaps the routines are trying to "sync"? Should there be "async" and "noasync" versions since the name of the routine "_async" is suggestive there is a noasync version. Are there calls to synchronize stream in the "_async" routines? If these are not truyly "async" routines, perhaps just leave out "_async"? I wonder whether the std::vector can be leveraged so there is less worry about memory leaks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the documentation is rather sparse at the moment. I can take another look and try to improve it.

I added the _async suffix mostly to indicate to anyone using the function that a hipStreamSynchronize needs to be done before using the results of the function. I could have synchronized within the functions themselves, but I wanted to give us the option of queuing a number of hipMemcpys that can be sync'ed all at once for better performance.

@@ -103,11 +105,19 @@ class SYEV_HEEV : public ::TestWithParam<syev_heev_tuple>
}
};

class SYEV : public SYEV_HEEV
class SYEV : public SYEV_HEEV<0>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: is there a convention to use all upper case for compile time constant or #define macros or constants? If so, perhaps consider using a mixed case? Just a thought.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that changing the capitalization of the test class will also change the capitalization of the test suite output. I like the all caps text in the test output as it makes it very easy to pick out the function name.

if(batch_array && (val_array || this->dim < 0))
free(batch_array);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this class owns malloc-ed memory it might be a good idea to delete its implicit copy constructor and assignment operator to guard against a double free.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nicer to follow the rule of 5 also explicitly default or delete the move constructor and the move assignment operator (those are implicitly deleted here, but you could also default them without any issues if this bodes well with your idea of how the struct is meant to be used).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether using thrust::host_vector and thrust::device_vector would potentially simplify the code. This is in the spirit of "eating our own dog food" or "drinking our own champagne" in re-using AMD rocThrust software. Just a thought.

Copy link
Collaborator

@jmachado-amd jmachado-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good, thanks @tfalders!

The rocsolver_hybrid_array struct complements the implementation of the current hybrid methods well. I've just included a few comments there, let me know if you want me to clarify anything.

kernels may be called on each individual batch instance. A typical workflow will call init_pointers_only
to allocate a host buffer for the device pointers, and then execute device kernels on the pointers
provided by the [] operator. */
template <typename T, typename I, typename U>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this structs is using C functions to manage memory "with type" T, it is good practice to make sure that it is_standard_layout and maybe consider adding checks to make sure that it is not a pointer or reference.

Comment on lines +87 to +90
if(val_array)
free(val_array);
if(batch_array && (val_array || this->dim < 0))
free(batch_array);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic if a previous memory allocation fails and someone tries to use one of the init_* methods again. The common practice is to always follow a free(ptr) with ptr = nullptr. Moreover, since downstream users don't have access to those internal pointers, we should also memset the pointed memory to 0 before freeing it.

// allocate space on host for data from device
size_t dim_bytes = sizeof(T) * dim;
size_t val_bytes = sizeof(T) * dim * batch_count;
val_array = (T*)malloc(val_bytes);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments here: Surprisingly enough, the spec does not define the behaviour of malloc when val_bytes == 0. Moreover, we should check for possible bad allocs and pass those errors downstream; and also memset this memory to 0 before it is used.

On a different topic: We probably also want those memory allocations to be aligned for performance. If you don't want to go with an in-house method to return aligned memory, we can go with posix_menalign (in Linux) and _aligned_malloc (in Windows).


/* Used to read device pointers from a batched array for use on the host; no other data is read from the
device. */
rocblas_status init_pointers_only(U array, rocblas_stride stride, I batch_count, hipStream_t stream)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment (code design): the behaviour of this struct when initialized with init_pointers_only is so different from the bahaviour yielded by init_async that it makes sense to break it apart into two different structs. Maybe something to be pondered for the future?

to allocate a host buffer for the device pointers, and then execute device kernels on the pointers
provided by the [] operator. */
template <typename T, typename I, typename U>
struct rocsolver_hybrid_array
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment (code design): I am not sure that this struct satisfies all of the semantics of an array (e.g., it is not stored on the stack), thus I would change its name to something more general like rocsolver_hybrid_storage.

Comment on lines +53 to +58
I dim, batch_count;
rocblas_stride stride;

U src_array;
T** batch_array;
T* val_array;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor question (code design): what would be the use case in which we would want to access those members directly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
noOptimizations Disable optimized kernels for small sizes for some routines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants