Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: rocblas_gemm_ex returns rocblas_status_internal_error in specific order #1529

Open
et16kr opened this issue Dec 16, 2024 · 6 comments
Assignees

Comments

@et16kr
Copy link

et16kr commented Dec 16, 2024

Describe the bug

  • When executed in a certain order, rocblas_gemm returns rocblas_status_internal_error.
  • Each works fine individually or in a different order.
  • There were other solution_ids occurring under similar conditions besides the provided reproducible example.

To Reproduce

  • install rocm :
wget https://repo.radeon.com/amdgpu-install/6.2/ubuntu/jammy/amdgpu-install_6.2.60200-1_all.deb
apt-get install -y ./amdgpu-install_6.2.60200-1_all.deb
amdgpu-install --usecase=rocm,rocmdev,rocmdevtools,rocmdevtools,opencl,openclsdk,hip,hiplibsdk,openmpsdk,mllib,mlsdk -y
  • rocblas version: 4.2.0.54f305c1-dirty (/opt/rocm/include/rocblas/internal/rocblas-version.h)
  • compile
hipcc -o example example.cpp -lrocblas
  • run
./example
  • reproducible code (example.cpp)
#include <iostream>
#include <hip/hip_runtime.h>
#include <rocblas/rocblas.h>

void run_gemm(rocblas_handle handle, ulong solution_id,
	      int B, int M, int N, int K, bool trans_a, bool trans_b,
              float alpha = 1.0f, float beta = 0.0f) {

  auto transA = trans_a ? rocblas_operation_transpose : rocblas_operation_none;
  auto transB = trans_b ? rocblas_operation_transpose : rocblas_operation_none;

  float *dA, *dB, *dC;
  hipMalloc(&dA, B * M * K * sizeof(float));
  hipMalloc(&dB, B * N * K * sizeof(float));
  hipMalloc(&dC, B * M * N * sizeof(float));

  auto dtype = rocblas_datatype_bf16_r;
  rocblas_gemm_algo algo = solution_id > 0 ? rocblas_gemm_algo_solution_index : rocblas_gemm_algo_standard;

  rocblas_int lda = trans_a ? K : M;
  rocblas_int ldb = trans_b ? N : K;
  rocblas_int ldc = M;

  rocblas_status status;

  if ( B > 1 ) {
    rocblas_stride strideA = K * M;
    rocblas_stride strideB = N * K;
    rocblas_stride strideC = M * N;
    status = rocblas_gemm_strided_batched_ex(handle,
					     transA,
					     transB,
					     M, N, K,
					     &alpha,
					     dA, dtype, lda, strideA,
					     dB, dtype, ldb, strideB,
					     &beta,
					     dC, dtype, ldc, strideC,
					     dC, dtype, ldc, strideC,
					     B,
					     rocblas_datatype_f32_r,
					     algo, solution_id, rocblas_gemm_flags_none);
  } else {
    status = rocblas_gemm_ex(handle,
			     transA,
			     transB,
			     M, N, K,
			     &alpha,
			     dA, dtype, lda,
			     dB, dtype, ldb,
			     &beta,
			     dC, dtype, ldc,
			     dC, dtype, ldc,
			     rocblas_datatype_f32_r,
			     algo, solution_id, rocblas_gemm_flags_none);
  }

  std::cout << "rocblas_gemm_ex solution_id: " << solution_id;
  if (status != rocblas_status_success) {
    std::cerr << " failed - " << rocblas_status_to_string(status) << std::endl;
  } else {
    std::cout << " succeeded" << std::endl;
  }

  hipFree(dA);
  hipFree(dB);
  hipFree(dC);
}

int main() {
  rocblas_handle handle;
  rocblas_create_handle(&handle);

  run_gemm(handle, 685, 4, 1024, 64, 1, false, false);
  run_gemm(handle, 621285612, 32, 128, 1024, 1024, false, false);
  run_gemm(handle, 621285596, 1, 36864, 4096, 4608, false, false);

  rocblas_destroy_handle(handle);

  return 0;
}
  • The success or failure depends on the order of gemm calls
  • success
$ ./example
rocblas_gemm_ex solution_id: 621285612 succeeded
rocblas_gemm_ex solution_id: 621285596 succeeded
rocblas_gemm_ex solution_id: 685 succeeded
  • failure
$ ./example
rocblas_gemm_ex solution_id: 685 succeeded
rocblas_gemm_ex solution_id: 621285612 failed - rocblas_status_internal_error
rocblas_gemm_ex solution_id: 621285596 failed - rocblas_status_internal_error

Expected behavior

  • It should succeed regardless of the gemm call order.
$ ./example
rocblas_gemm_ex solution_id: 685 succeeded
rocblas_gemm_ex solution_id: 621285612 succeeded
rocblas_gemm_ex solution_id: 621285596 succeeded

Log-files

There are no logs.

Environment

Hardware description
CPU AMD EPYC 9474F 48-Core Processor
GPU AMD Instinct MI300X
Software version
rocm-core 6.2.0.60200-66~22.04
rocblas 4.2.0.60200-66~22.04

Attach environment.txt
environment.txt

Additional context

  • The solution_id was chosen from a list obtained using functions like rocblas_gemm_ex_get_solutions(), rocblas_gemm_strided_batched_ex_get_solutions(), or rocblas_gemm_ex_get_solutions_by_type(), based on the shortest duration.
@ppanchad-amd
Copy link

Hi @et16kr. Internal ticket has been created to investigate your issue. Thanks!

@NaveenElumalaiAMD
Copy link
Contributor

Hi @et16kr, thanks for reporting the issue. Could you also provide the ROCBLAS_TENSILE_COMMIT_ID (below the ROCBLAS_VERSION_TWEAK) in /opt/rocm/include/rocblas/internal/rocblas-version.h?

@et16kr
Copy link
Author

et16kr commented Dec 20, 2024

Hi @et16kr, thanks for reporting the issue. Could you also provide the ROCBLAS_TENSILE_COMMIT_ID (below the ROCBLAS_VERSION_TWEAK) in /opt/rocm/include/rocblas/internal/rocblas-version.h?

  • tensile commit id
    • "54f305c18f0d509466557e106ac4b1d7e42c85a5" , "dbc2062dced66e4cbee8e0591d76e0a1588a4c70"
/opt/rocm/include/rocblas/internal$ cat rocblas-version.h | grep TENSILE
#define ROCBLAS_TENSILE_COMMIT_ID   "54f305c18f0d509466557e106ac4b1d7e42c85a5" , "dbc2062dced66e4cbee8e0591d76e0a1588a4c70"

@babakpst
Copy link
Contributor

@et16kr Thanks for the comment. The solution id changes for each build (even if you use the same commit to build the rocBLAS). Are you running all different examples on the same system using only ONE build?

If so, please re-run all cases with TENSILE_DB=0X28000 ./example, and share the log.

@et16kr
Copy link
Author

et16kr commented Jan 21, 2025

@babakpst Which build are you referring to?

We did not build ROCm, rocBLAS, Instead, we used https://repo.radeon.com/amdgpu-install/6.2/ubuntu/jammy/amdgpu-install_6.2.60200-1_all.deb.
The installation steps are described in the main text.

The example and the program used to obtain the solution_id are separate programs but use the same rocBLAS library.
So far, it seems that as long as the ROCm and rocBLAS libraries are the same, we can access the same kernel, so we've been using it that way.
If the ID changes every time the example is built, even with the same ROCm library, it would be a significant issue for us, so please let us know.

The information you requested is as follows:

  • failure (order: 685, 621285612, 621285596)
$ TENSILE_DB=0X28000  ./example
Running kernel: Cijk_Ailk_Bljk_BBS_BH_MT64x32x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPPA0_LBSPPB0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_PKA0_SIA1_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT4_2_TLDS0_UMLDSA0_UMLDSB0_USFGRO0_VAW1_VS1_VW2_VWB2_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8
rocblas_gemm_ex solution_id: 685 succeeded
Running kernel: Cijk_Ailk_Bljk_BBS_BH_MT128x128x64_MI16x16x16x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM8_AF1EM8_AMAS3_ASE_ASGT3256_ASLT_ASM_ASAE01_ASCE01_ASEM64_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA8_GLVWB8_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU4_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT4_64_TLDS1_UMLDSA0_UMLDSB1_USFGROn1_VAW1_VSn1_VW4_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG32_8_1_WGMn16
rocblas_gemm_ex solution_id: 621285612 failed - rocblas_status_internal_error
Running kernel: Cijk_Ailk_Bljk_BBS_BH_MT256x256x32_MI32x32x8x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM8_AF1EM8_AMAS3_ASE_ASGT3256_ASLT_ASM_ASAE01_ASCE01_ASEM64_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA8_GLVWB8_GRCGA1_GRCGB1_GRPM1_GRVW8_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB8_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA2_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_256_TLDS1_UMLDSA0_UMLDSB1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP1_WSGRA1_WSGRB0_WS64_WG128_2_1_WGM32
rocblas_gemm_ex solution_id: 621285596 failed - rocblas_status_internal_error
  • success (order: 621285612, 621285596, 685)
$ TENSILE_DB=0X28000  ./example
Running kernel: Cijk_Ailk_Bljk_BBS_BH_MT128x128x64_MI16x16x16x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM8_AF1EM8_AMAS3_ASE_ASGT3256_ASLT_ASM_ASAE01_ASCE01_ASEM64_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA8_GLVWB8_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU4_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT4_64_TLDS1_UMLDSA0_UMLDSB1_USFGROn1_VAW1_VSn1_VW4_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG32_8_1_WGMn16
rocblas_gemm_ex solution_id: 621285612 succeeded
Running kernel: Cijk_Ailk_Bljk_BBS_BH_MT256x256x32_MI32x32x8x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM8_AF1EM8_AMAS3_ASE_ASGT3256_ASLT_ASM_ASAE01_ASCE01_ASEM64_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA8_GLVWB8_GRCGA1_GRCGB1_GRPM1_GRVW8_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB8_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA2_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_256_TLDS1_UMLDSA0_UMLDSB1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP1_WSGRA1_WSGRB0_WS64_WG128_2_1_WGM32
rocblas_gemm_ex solution_id: 621285596 succeeded
Running kernel: Cijk_Ailk_Bljk_BBS_BH_MT64x32x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPPA0_LBSPPB0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_PKA0_SIA1_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT4_2_TLDS0_UMLDSA0_UMLDSB0_USFGRO0_VAW1_VS1_VW2_VWB2_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8
rocblas_gemm_ex solution_id: 685 succeeded

@et16kr
Copy link
Author

et16kr commented Feb 4, 2025

I did some debugging. Please review the cause and the workaround.

findAllSolutions() is used to load the solution list in runContractionProblem().
However, due to tensile_prob being set with B:4, M:1024, N:64, K:1, so codeObjectFilename is not assigned for some solutions.

if(algo == rocblas_gemm_algo_solution_index && solution_index > 0)
{
solution = library->getSolutionByIndex(solution_index - 1);
// load solution if not already loaded
if(!solution)
{
library->findAllSolutions(tensile_prob, *hardware);
solution = library->getSolutionByIndex(solution_index - 1);
}
}

https://github.com/ROCm/Tensile/blob/f940143bde4c1dc3b1c0ec6a86186d72a01da00e/Tensile/Source/lib/include/Tensile/PlaceholderLibrary.hpp#L194-L210

        virtual SolutionSet<MySolution> findAllSolutions(MyProblem const& problem,
                                                         Hardware const&  hardware) const override
        {
            if(!library)
            {
                loadPlaceholderLibrary();
            }

            auto solutions = library->findAllSolutions(problem, hardware);

            for(auto& solution : solutions)
            {
                solution->codeObjectFilename = getCodeObjectFileName(hardware, *solution);
            }

            return solutions;
        }
  • solution_id 685 (B:4, M:1024, N:64, K:1)
    • No solution exists for solution_id 685.
    • findAllSolutions() is executed to load the solution list.
    • In findAllSolutions(), the solution is assigned an object filename based on B:4, M:1024, N:64, K:1.
    • gemm executes and succeeds.
  • solution_id 621285612 (K:32, M:128, N:1024, K:1024)
    • A solution exists for solution_id 621285612.
    • findAllSolutions() is not executed.
    • solution_id 621285612 does not work for B:4, M:1024, N:64, K:1, so when findAllSolutions() was executed for solution_id 685, no object filename was assigned.
    • Failing to read the object file causes gemm to fail.
  • I think the issue is that runContractionProblem() calls library->findAllSolutions(tensile_prob, *hardware); to load all libraries, but since it passes a fixed B, M, N, K in tensile_prob, only some solutions get a codeObjectFilename assigned.

Workaround Used

  • I resolved this issue by adding the following code to our implementation:
+  rocblas_int size = 0;
+  auto dtype = rocblas_datatype_bf16_r;
+  rocblas_gemm_ex_get_solutions_by_type(handle, dtype, dtype, rocblas_datatype_f32_r, rocblas_gemm_flags_none, nullptr, &size);

   run_gemm(handle, 685, 4, 1024, 64, 1, false, false);
   run_gemm(handle, 621285612, 32, 128, 1024, 1024, false, false);
   run_gemm(handle, 621285596, 1, 36864, 4096, 4608, false, false);
  • rocblas_gemm_ex_get_solutions_by_type() is a function that retrieves all solution IDs for a given dtype.
  • I expected that calling this API would preload the solution list and assign a codeObjectFilename to each solution.
  • In practice, it worked as expected.
$ ./example 
rocblas_gemm_ex solution_id: 685 succeeded
rocblas_gemm_ex solution_id: 621285612 succeeded
rocblas_gemm_ex solution_id: 621285596 succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants