Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Zero TCC_HIT_sum all the time #150

Open
RookieT0T opened this issue Dec 20, 2024 · 19 comments
Open

[Issue]: Zero TCC_HIT_sum all the time #150

RookieT0T opened this issue Dec 20, 2024 · 19 comments

Comments

@RookieT0T
Copy link

RookieT0T commented Dec 20, 2024

Problem Description

While using the rocprofv2 to collect performance counters like TCC_HIT_sum and TCC_MISS_sum on Vega 20, I found the value of TCC_HIT_sum is always 0 and TCC_MISS_sum shows some non-zero values, which I assume it works. If you can investigate why hit information is always 0 (including all hit information from 16 cache banks) and double check if the value of tcc miss is correct, that will be much appreciated. BTW, this problem exists regardless I collect the performance counters in ROCm version of 6.2.2-116 or in the docker image of 6.3.0.

Example output.csv returned from the profiler:
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,FlatVMemInsts,TCC_EA_RDREQ_sum,TCC_EA_RDREQ_32B_sum,TCC_HIT_sum,TCC_MISS_sum,TCC_MISS[12],TCC_MISS[13],TCC_MISS[14],TCC_MISS[15],TCC_HIT[0],TCC_HIT[1],TCC_HIT[2],TCC_HIT[3],TCC_HIT[4],TCC_HIT[5],TCC_HIT[6],TCC_HIT[7],TCC_HIT[8],TCC_HIT[9],TCC_HIT[10],TCC_HIT[11],TCC_HIT[12],TCC_HIT[13],TCC_HIT[14],TCC_HIT[15],TA_FLAT_WRITE_WAVEFRONTS_sum,TA_FLAT_READ_WAVEFRONTS_sum,TCC_EA_RDREQ[0],TCC_EA_RDREQ[1],TCC_EA_RDREQ[2],TCC_EA_RDREQ[3],TCC_EA_RDREQ[4],TCC_EA_RDREQ[5],TCC_EA_RDREQ[6],TCC_EA_RDREQ[7],TCC_EA_RDREQ[8],TCC_EA_RDREQ[9],TCC_EA_RDREQ[10],TCC_EA_RDREQ[11],TCC_EA_RDREQ[12],TCC_EA_RDREQ[13],TCC_EA_RDREQ[14],TCC_EA_RDREQ[15],TCC_MISS[0],TCC_MISS[1],TCC_MISS[2],TCC_MISS[3],TCC_MISS[4],TCC_MISS[5],TCC_MISS[6],TCC_MISS[7],TCC_MISS[8],TCC_MISS[9],TCC_MISS[10],TCC_MISS[11]
0,"kernel(int*) [clone .kd]",1,0,1,14761,14761,1,1,0,0,40,0,48,64,0x0,0x79eecbe84540,60.0000000000,68.0000000000,0.0000000000,0.0000000000,102.0000000000,4,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0000000000,60.0000000000,0,0,0,0,0,0,6,60,1,0,0,0,0,0,0,0,4,0,4,6,0,10,4,61,5,0,7,4

Operating System

Ubuntu 24.04.1 LTS

CPU

AMD Ryzen 9 3900X 12-Core Processor

GPU

gfx906 (AMD Vega 7nm also referred to as AMD Vega 20)

ROCm Version

ROCm 6.3.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@ppanchad-amd
Copy link

Hi @RookieT0T. Internal ticket has been created to assist with your issue. Thanks!

@zichguan-amd
Copy link

Hi @RookieT0T, can you share the workload that you are trying to profile? It's normal to have 0 L2 hit rate if your workload doesn't reuse any cached data.

@RookieT0T RookieT0T reopened this Dec 31, 2024
@RookieT0T
Copy link
Author

RookieT0T commented Dec 31, 2024

Hi @RookieT0T, can you share the workload that you are trying to profile? It's normal to have 0 L2 hit rate if your workload doesn't reuse any cached data.

My workload has a bunch of flat_load_dwordx2 instructions contained in the asm volatile brackets (in the kernel function). The addresses specified in those load instructions should incur some cache hits. Also, "glc" flag is specified at the end of each load instruction to enforce the cache accesses bypass the L1 cache like TCP and then directly go to the L2 cache like TCC.

Example of kernel function with only one load instruction:

global void kernel(int * arr) {
uint64_t a = 0;

asm volatile(
"s_waitcnt vmcnt(0) & lgkmcnt(0)\n\t"
"buffer_wbinvl1\n\t"
"flat_load_dwordx2 %[out0], %[in1] glc\n\t"
"s_waitcnt vmcnt(0) & lgkmcnt(0)\n\t"
"s_nop 0\n\t"

      : [out0]"=v"(a)
      : [in1]"v"((uint64_t *)&arr[0])
      : "memory");

}

@RookieT0T
Copy link
Author

Hi @RookieT0T, can you share the workload that you are trying to profile? It's normal to have 0 L2 hit rate if your workload doesn't reuse any cached data.

I am wondering if the flag "glc" is added, will the instruction cache hits also be part of the TCC hits sum reported by the profiler in addition to the data cache hits incurred by the program?

@RookieT0T
Copy link
Author

Are there any progress?

@RookieT0T RookieT0T reopened this Jan 8, 2025
@jrmadsen
Copy link

jrmadsen commented Jan 9, 2025

@RookieT0T please try using rocprofv3 from the new rocprofiler-sdk package instead of rocprofv2. rocprofv2 was always a beta and the design of the underlying rocprofiler v2 library was problematic and poorly tested. We plan to continue doing bug-fixes for rocprof for a while since rocprof reached production release but we do not plan to do this for rocprofv2 since v2 was scrapped.

v2 was rushed out the door and it shows. We took our time with rocprofv3 and rocprofiler-sdk and learned from the previous issues. Despite their “beta” status, they are already far more reliable and far better tested than any of their predecessors.

@RookieT0T
Copy link
Author

RookieT0T commented Jan 9, 2025

@RookieT0T please try using rocprofv3 from the new rocprofiler-sdk package instead of rocprofv2. rocprofv2 was always a beta and the design of the underlying rocprofiler v2 library was problematic and poorly tested. We plan to continue doing bug-fixes for rocprof for a while since rocprof reached production release but we do not plan to do this for rocprofv2 since v2 was scrapped.

v2 was rushed out the door and it shows. We took our time with rocprofv3 and rocprofiler-sdk and learned from the previous issues. Despite their “beta” status, they are already far more reliable and far better tested than any of their predecessors.

Hi there, I just tried rocprofv3 with this rocm docker image. Unfortunately, the cache hit info is still zero.
"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value"
1,1,1,1,1415542,1415542,1,"kernel(int*)",1,0,0,36,32,"SQ_WAVES",1
1,1,1,1,1415542,1415542,1,"kernel(int*)",1,0,0,36,32,"TCC_HIT_sum",0
1,1,1,1,1415542,1415542,1,"kernel(int*)",1,0,0,36,32,"TCC_MISS_sum",33

In terms of my workload, 17 cache line are accessed; c_0, c_1, c_2, c_3, c_4, c_5, c_6, c_7, c_8, c_9, c_10, c_11, c_12, c_13, c_14, c_15, c_15, and the last access to c_15 must be a cache hit, which is never reported by the profiler.

@RookieT0T
Copy link
Author

RookieT0T commented Jan 10, 2025

@RookieT0T please try using rocprofv3 from the new rocprofiler-sdk package instead of rocprofv2. rocprofv2 was always a beta and the design of the underlying rocprofiler v2 library was problematic and poorly tested. We plan to continue doing bug-fixes for rocprof for a while since rocprof reached production release but we do not plan to do this for rocprofv2 since v2 was scrapped.

v2 was rushed out the door and it shows. We took our time with rocprofv3 and rocprofiler-sdk and learned from the previous issues. Despite their “beta” status, they are already far more reliable and far better tested than any of their predecessors.

If you want to test my workload, this is one example. When I ran this example workload on the older rocm docker images like 4.0 or 3.7, the hit data was shown (the input.csv in the my linked repository is the output file returned by the profiler 3.7 running the example workload; one cache hit is expected). Hope this helps

@ApoKalipse-V
Copy link
Contributor

Hi @RookieT0T
I'm able to reproduce the issue with rocprof in your image. It is fixed in 6.3+.
Can you confirm upgrading to 6.3 fixes the issue?

@RookieT0T
Copy link
Author

RookieT0T commented Jan 13, 2025

Hi, I am glad to hear that. To reiterate my problem, the cache hit is never reported in docker images of rocm 6.3.0 and older versions except version 4.0 or 3.7 and rocm 6.2.2 that has been natively installed on my graphics card. The AMD graphics card that I am currently using has rocm 6.2.2-116 installed, and I tried to run my example program both on rocm 6.2.2-116 (no dockers used) and several docker images including rocm 6.3.1.

Therefore, based on your reply, I need to first upgrade my graphics card to rocm 6.3.1 and then use the upgraded rocprofiler to get the cache hits. I am wondering if the version of rocm installed on the graphics card and the version of rocm we are using on Linux (like these docker images I used previously) must be the same to get the cache hits. The final question is that you said 6.3+ in your reply, did you mean the incoming rocm 6.3.2 or any rocm version starting with 6.3.X like 6.3.0, 6.3.1, ...

@bwelton
Copy link
Contributor

bwelton commented Jan 14, 2025

The graphics driver used should make little difference for this issue since the underlying problem was a software ROCM related issue. Does this issue still exist if you run your application in a docker container with 6.3.1 (or in this container https://hub.docker.com/r/rocm/dev-ubuntu-22.04)?

@RookieT0T
Copy link
Author

RookieT0T commented Jan 15, 2025

Hi, all. I just tried the docker image with 6.3.1. Unfortunately, the result of using rocprofv3 showed that the cache hit was still 0. Have you tried my example workload on your machines and they showed the hit data?

Output:
"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
1,1,1,1,131,131,1,16,"kernel(int*)",1,0,0,36,32,"SQ_WAVES",1.000000,807736593474300,807736593509020
1,1,1,1,131,131,1,16,"kernel(int*)",1,0,0,36,32,"TCC_HIT_sum",0.00000000e+00,807736593474300,807736593509020
1,1,1,1,131,131,1,16,"kernel(int*)",1,0,0,36,32,"TCC_MISS_sum",36.000000,807736593474300,807736593509020

I am not sure if I missed something or set things incorrectly. When I typed "which rocprofv3" in the docker image, "/opt/rocm-6.3.1/bin/rocprofv3" was shown, and the command of docker run --rm -it --device /dev/kfd --device /dev/dri -v ./:/workdir --security-opt seccomp=unconfined --group-add 1 --group-add 2 rocm/dev-ubuntu-22.04:6.3.1-complete bash was used to initiate the docker image.

@jrmadsen
Copy link

jrmadsen commented Jan 15, 2025

@RookieT0T I looked into your assembly (note: please format in a code block in the future) and it is unclear why you are expecting data to be in the L2 cache. Since there are not any previous accesses to int* arr, why would it be already loaded into the L2 cache? You won’t get an L2 cache hit for data read in from global memory.

@jrmadsen
Copy link

It should get loaded into L2 as a result of the read from global memory but if our HW counted that as a L2 cache hit, what would be the point of that counter?

@RookieT0T
Copy link
Author

RookieT0T commented Jan 15, 2025

@RookieT0T I looked into your assembly (note: please format in a code block in the future) and it is unclear why you are expecting data to be in the L2 cache. Since there are not any previous accesses to int* arr, why would it be already loaded into the L2 cache? You won’t get an L2 cache hit for data read in from global memory.

The addresses specified in each load instruction are carefully calculated to ensure that the cache lines containing the data (array elements) map to the same cache set, enabling subsequent load instructions with calculated addresses to access the same cache line, resulting in cache hits. The expectation that the data will be in the L2 cache arises from the glc flag appended to each load instruction, which forces loads to bypass the TCP (L1 cache) and access the TCC (the L2 cache). Additionally, not every element in the array int* arr is accessed, as the access pattern is non-sequential and strategically designed to maximize cache performance.

In general, you can think the example workload as a trial of accessing a bunch of cache lines which map to the single cache set and then analyzing the cache hit/miss stats. This is why I need cache hit data. I believe there may be still something wrong with the rocprofiler or how I set things up locally because prior developers in my project group acquired both cache hit/miss data like roughly 1 year ago using rocprofiler. Then, our group upgraded the Linux version to Ubuntu 24, and then cache hit data was never reported. Sorry, this is as much as I can tell.

@jrmadsen
Copy link

@RookieT0T If I use this code:

#ifdef NDEBUG
#    undef NDEBUG
#endif

#include <hip/hip_runtime.h>

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <stdexcept>
#include <vector>

#define HIP_API_CALL(CALL)                                                                         \
    {                                                                                              \
        hipError_t error_ = (CALL);                                                                \
        if(error_ != hipSuccess)                                                                   \
        {                                                                                          \
            fprintf(stderr,                                                                        \
                    "%s:%d :: HIP error : %s\n",                                                   \
                    __FILE__,                                                                      \
                    __LINE__,                                                                      \
                    hipGetErrorString(error_));                                                    \
            throw std::runtime_error("hip_api_call");                                              \
        }                                                                                          \
    }

__global__ void
kernel(uint64_t* arr)
{
    uint64_t a = 0;

    __asm volatile(R"(
        s_waitcnt vmcnt(0) & lgkmcnt(0)
        buffer_wbinvl1
        flat_load_dwordx2 %[out0], %[in1] glc
        s_waitcnt vmcnt(0) & lgkmcnt(0)
        s_nop 0
        )"
                   : [out0] "=v"(a)
                   : [in1] "v"((uint64_t*) &arr[0])
                   : "memory");

    // assert(a == 1);
}

int
main(int /*argc*/, char** /*argv*/)
{
    constexpr size_t length    = 4;
    constexpr size_t num_bytes = length * sizeof(uint64_t);

    uint64_t*  data = nullptr;
    auto out  = std::array<uint64_t, length>{};
    out.fill(1);

    HIP_API_CALL(hipMalloc(&data, num_bytes));
    HIP_API_CALL(hipMemset(data, 0, num_bytes));
    HIP_API_CALL(hipMemcpy(data, out.data(), num_bytes, hipMemcpyHostToDevice));

    kernel<<<1, 1>>>(data);
    HIP_API_CALL(hipDeviceSynchronize());
}

using rocprofv3 on my Vega20:

$ rocprofv3 --runtime-trace --pmc FlatVMemInsts TCC_EA_RDREQ_sum TCC_EA_RDREQ_32B_sum TCC_HIT_sum TCC_MISS_sum --kernel-include-regex "kernel.*" -d tcc-hit -o out -- ./bin/tcc-hit-assembly

I get TCC_HIT_sum = 0.0:

"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"FlatVMemInsts",1.000000,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_EA_RDREQ_32B_sum",0.00000000e+00,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_EA_RDREQ_sum",3.000000,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_HIT_sum",0.00000000e+00,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_MISS_sum",9.000000,239069576712464,239069576717424

However, if I change uint64_t a to be volatile:

    volatile uint64_t a = 0;

I get TCC_HIT_SUM = 2.0:

"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"FlatVMemInsts",3.000000,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_32B_sum",0.00000000e+00,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_sum",12.000000,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_HIT_sum",2.000000,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_MISS_sum",14.000000,239497802778690,239497802785090

Futhermore, if I uncomment the assert(a == 1) and run rocprofv3 again, I get TCC_HIT_SUM = 4.0:

"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"FlatVMemInsts",4.000000,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_32B_sum",0.00000000e+00,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_sum",10.000000,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_HIT_sum",4.000000,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_MISS_sum",13.000000,239625327580938,239625327585898

@ApoKalipse-V
Copy link
Contributor

The previous TCC_HIT you were seeing is likely instruction fetch or similar. It should only hit on the second load:

global void kernel(int * arr)
{
uint64_t a = 0;

asm volatile(
    "flat_load_dwordx2 %[out0], %[in1] glc\n"
    "s_waitcnt vmcnt(0)\n"
    : [out0]"=v"(a) : [in1]"v"((uint64_t *)&arr[0]) : "memory"
);
asm volatile(
    "flat_load_dwordx2 %[out0], %[in1] glc\n"
    "s_waitcnt vmcnt(0)\n"
    : [out0]"=v"(a) : [in1]"v"((uint64_t *)&arr[0]) : "memory"
);

}

@RookieT0T
Copy link
Author

@RookieT0T If I use this code:

#ifdef NDEBUG

undef NDEBUG

#endif

#include <hip/hip_runtime.h>

#include
#include
#include
#include
#include

#define HIP_API_CALL(CALL)
{
hipError_t error_ = (CALL);
if(error_ != hipSuccess)
{
fprintf(stderr,
"%s:%d :: HIP error : %s\n",
FILE,
LINE,
hipGetErrorString(error_));
throw std::runtime_error("hip_api_call");
}
}

global void
kernel(uint64_t* arr)
{
uint64_t a = 0;

__asm volatile(R"(
    s_waitcnt vmcnt(0) & lgkmcnt(0)
    buffer_wbinvl1
    flat_load_dwordx2 %[out0], %[in1] glc
    s_waitcnt vmcnt(0) & lgkmcnt(0)
    s_nop 0
    )"
               : [out0] "=v"(a)
               : [in1] "v"((uint64_t*) &arr[0])
               : "memory");

// assert(a == 1);

}

int
main(int /argc/, char** /argv/)
{
constexpr size_t length = 4;
constexpr size_t num_bytes = length * sizeof(uint64_t);

uint64_t*  data = nullptr;
auto out  = std::array<uint64_t, length>{};
out.fill(1);

HIP_API_CALL(hipMalloc(&data, num_bytes));
HIP_API_CALL(hipMemset(data, 0, num_bytes));
HIP_API_CALL(hipMemcpy(data, out.data(), num_bytes, hipMemcpyHostToDevice));

kernel<<<1, 1>>>(data);
HIP_API_CALL(hipDeviceSynchronize());

}
using rocprofv3 on my Vega20:

$ rocprofv3 --runtime-trace --pmc FlatVMemInsts TCC_EA_RDREQ_sum TCC_EA_RDREQ_32B_sum TCC_HIT_sum TCC_MISS_sum --kernel-include-regex "kernel.*" -d tcc-hit -o out -- ./bin/tcc-hit-assembly
I get TCC_HIT_sum = 0.0:

"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"FlatVMemInsts",1.000000,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_EA_RDREQ_32B_sum",0.00000000e+00,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_EA_RDREQ_sum",3.000000,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_HIT_sum",0.00000000e+00,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_MISS_sum",9.000000,239069576712464,239069576717424

However, if I change uint64_t a to be volatile:

volatile uint64_t a = 0;

I get TCC_HIT_SUM = 2.0:

"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"FlatVMemInsts",3.000000,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_32B_sum",0.00000000e+00,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_sum",12.000000,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_HIT_sum",2.000000,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_MISS_sum",14.000000,239497802778690,239497802785090

Futhermore, if I uncomment the assert(a == 1) and run rocprofv3 again, I get TCC_HIT_SUM = 4.0:

"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"FlatVMemInsts",4.000000,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_32B_sum",0.00000000e+00,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_sum",10.000000,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_HIT_sum",4.000000,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_MISS_sum",13.000000,239625327580938,239625327585898

Thanks, I will take a look right now.

@RookieT0T
Copy link
Author

The previous TCC_HIT you were seeing is likely instruction fetch or similar. It should only hit on the second load:

global void kernel(int * arr) { uint64_t a = 0;

asm volatile(
    "flat_load_dwordx2 %[out0], %[in1] glc\n"
    "s_waitcnt vmcnt(0)\n"
    : [out0]"=v"(a) : [in1]"v"((uint64_t *)&arr[0]) : "memory"
);
asm volatile(
    "flat_load_dwordx2 %[out0], %[in1] glc\n"
    "s_waitcnt vmcnt(0)\n"
    : [out0]"=v"(a) : [in1]"v"((uint64_t *)&arr[0]) : "memory"
);

}

Do you mean the TCC_HIT acquired by my project group members one year ago is likely instruction fetch (instruction hit)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants