-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Zero TCC_HIT_sum all the time #150
Comments
Hi @RookieT0T. Internal ticket has been created to assist with your issue. Thanks! |
Hi @RookieT0T, can you share the workload that you are trying to profile? It's normal to have 0 L2 hit rate if your workload doesn't reuse any cached data. |
My workload has a bunch of flat_load_dwordx2 instructions contained in the asm volatile brackets (in the kernel function). The addresses specified in those load instructions should incur some cache hits. Also, "glc" flag is specified at the end of each load instruction to enforce the cache accesses bypass the L1 cache like TCP and then directly go to the L2 cache like TCC. Example of kernel function with only one load instruction: global void kernel(int * arr) { asm volatile(
} |
I am wondering if the flag "glc" is added, will the instruction cache hits also be part of the TCC hits sum reported by the profiler in addition to the data cache hits incurred by the program? |
Are there any progress? |
@RookieT0T please try using rocprofv3 from the new rocprofiler-sdk package instead of rocprofv2. rocprofv2 was always a beta and the design of the underlying rocprofiler v2 library was problematic and poorly tested. We plan to continue doing bug-fixes for rocprof for a while since rocprof reached production release but we do not plan to do this for rocprofv2 since v2 was scrapped. v2 was rushed out the door and it shows. We took our time with rocprofv3 and rocprofiler-sdk and learned from the previous issues. Despite their “beta” status, they are already far more reliable and far better tested than any of their predecessors. |
Hi there, I just tried rocprofv3 with this rocm docker image. Unfortunately, the cache hit info is still zero. In terms of my workload, 17 cache line are accessed; c_0, c_1, c_2, c_3, c_4, c_5, c_6, c_7, c_8, c_9, c_10, c_11, c_12, c_13, c_14, c_15, c_15, and the last access to c_15 must be a cache hit, which is never reported by the profiler. |
If you want to test my workload, this is one example. When I ran this example workload on the older rocm docker images like 4.0 or 3.7, the hit data was shown (the input.csv in the my linked repository is the output file returned by the profiler 3.7 running the example workload; one cache hit is expected). Hope this helps |
Hi @RookieT0T |
Hi, I am glad to hear that. To reiterate my problem, the cache hit is never reported in docker images of rocm 6.3.0 and older versions except version 4.0 or 3.7 and rocm 6.2.2 that has been natively installed on my graphics card. The AMD graphics card that I am currently using has rocm 6.2.2-116 installed, and I tried to run my example program both on rocm 6.2.2-116 (no dockers used) and several docker images including rocm 6.3.1. Therefore, based on your reply, I need to first upgrade my graphics card to rocm 6.3.1 and then use the upgraded rocprofiler to get the cache hits. I am wondering if the version of rocm installed on the graphics card and the version of rocm we are using on Linux (like these docker images I used previously) must be the same to get the cache hits. The final question is that you said 6.3+ in your reply, did you mean the incoming rocm 6.3.2 or any rocm version starting with 6.3.X like 6.3.0, 6.3.1, ... |
The graphics driver used should make little difference for this issue since the underlying problem was a software ROCM related issue. Does this issue still exist if you run your application in a docker container with 6.3.1 (or in this container https://hub.docker.com/r/rocm/dev-ubuntu-22.04)? |
Hi, all. I just tried the docker image with 6.3.1. Unfortunately, the result of using rocprofv3 showed that the cache hit was still 0. Have you tried my example workload on your machines and they showed the hit data? Output: I am not sure if I missed something or set things incorrectly. When I typed "which rocprofv3" in the docker image, "/opt/rocm-6.3.1/bin/rocprofv3" was shown, and the command of docker run --rm -it --device /dev/kfd --device /dev/dri -v ./:/workdir --security-opt seccomp=unconfined --group-add 1 --group-add 2 rocm/dev-ubuntu-22.04:6.3.1-complete bash was used to initiate the docker image. |
@RookieT0T I looked into your assembly (note: please format in a code block in the future) and it is unclear why you are expecting data to be in the L2 cache. Since there are not any previous accesses to |
It should get loaded into L2 as a result of the read from global memory but if our HW counted that as a L2 cache hit, what would be the point of that counter? |
The addresses specified in each load instruction are carefully calculated to ensure that the cache lines containing the data (array elements) map to the same cache set, enabling subsequent load instructions with calculated addresses to access the same cache line, resulting in cache hits. The expectation that the data will be in the L2 cache arises from the glc flag appended to each load instruction, which forces loads to bypass the TCP (L1 cache) and access the TCC (the L2 cache). Additionally, not every element in the array int* arr is accessed, as the access pattern is non-sequential and strategically designed to maximize cache performance. In general, you can think the example workload as a trial of accessing a bunch of cache lines which map to the single cache set and then analyzing the cache hit/miss stats. This is why I need cache hit data. I believe there may be still something wrong with the rocprofiler or how I set things up locally because prior developers in my project group acquired both cache hit/miss data like roughly 1 year ago using rocprofiler. Then, our group upgraded the Linux version to Ubuntu 24, and then cache hit data was never reported. Sorry, this is as much as I can tell. |
@RookieT0T If I use this code: #ifdef NDEBUG
# undef NDEBUG
#endif
#include <hip/hip_runtime.h>
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <stdexcept>
#include <vector>
#define HIP_API_CALL(CALL) \
{ \
hipError_t error_ = (CALL); \
if(error_ != hipSuccess) \
{ \
fprintf(stderr, \
"%s:%d :: HIP error : %s\n", \
__FILE__, \
__LINE__, \
hipGetErrorString(error_)); \
throw std::runtime_error("hip_api_call"); \
} \
}
__global__ void
kernel(uint64_t* arr)
{
uint64_t a = 0;
__asm volatile(R"(
s_waitcnt vmcnt(0) & lgkmcnt(0)
buffer_wbinvl1
flat_load_dwordx2 %[out0], %[in1] glc
s_waitcnt vmcnt(0) & lgkmcnt(0)
s_nop 0
)"
: [out0] "=v"(a)
: [in1] "v"((uint64_t*) &arr[0])
: "memory");
// assert(a == 1);
}
int
main(int /*argc*/, char** /*argv*/)
{
constexpr size_t length = 4;
constexpr size_t num_bytes = length * sizeof(uint64_t);
uint64_t* data = nullptr;
auto out = std::array<uint64_t, length>{};
out.fill(1);
HIP_API_CALL(hipMalloc(&data, num_bytes));
HIP_API_CALL(hipMemset(data, 0, num_bytes));
HIP_API_CALL(hipMemcpy(data, out.data(), num_bytes, hipMemcpyHostToDevice));
kernel<<<1, 1>>>(data);
HIP_API_CALL(hipDeviceSynchronize());
} using rocprofv3 on my Vega20: $ rocprofv3 --runtime-trace --pmc FlatVMemInsts TCC_EA_RDREQ_sum TCC_EA_RDREQ_32B_sum TCC_HIT_sum TCC_MISS_sum --kernel-include-regex "kernel.*" -d tcc-hit -o out -- ./bin/tcc-hit-assembly I get TCC_HIT_sum = 0.0:
However, if I change volatile uint64_t a = 0; I get TCC_HIT_SUM = 2.0:
Futhermore, if I uncomment the
|
The previous TCC_HIT you were seeing is likely instruction fetch or similar. It should only hit on the second load: global void kernel(int * arr)
} |
Thanks, I will take a look right now. |
Do you mean the TCC_HIT acquired by my project group members one year ago is likely instruction fetch (instruction hit)? |
Problem Description
While using the rocprofv2 to collect performance counters like TCC_HIT_sum and TCC_MISS_sum on Vega 20, I found the value of TCC_HIT_sum is always 0 and TCC_MISS_sum shows some non-zero values, which I assume it works. If you can investigate why hit information is always 0 (including all hit information from 16 cache banks) and double check if the value of tcc miss is correct, that will be much appreciated. BTW, this problem exists regardless I collect the performance counters in ROCm version of 6.2.2-116 or in the docker image of 6.3.0.
Example output.csv returned from the profiler:
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,FlatVMemInsts,TCC_EA_RDREQ_sum,TCC_EA_RDREQ_32B_sum,TCC_HIT_sum,TCC_MISS_sum,TCC_MISS[12],TCC_MISS[13],TCC_MISS[14],TCC_MISS[15],TCC_HIT[0],TCC_HIT[1],TCC_HIT[2],TCC_HIT[3],TCC_HIT[4],TCC_HIT[5],TCC_HIT[6],TCC_HIT[7],TCC_HIT[8],TCC_HIT[9],TCC_HIT[10],TCC_HIT[11],TCC_HIT[12],TCC_HIT[13],TCC_HIT[14],TCC_HIT[15],TA_FLAT_WRITE_WAVEFRONTS_sum,TA_FLAT_READ_WAVEFRONTS_sum,TCC_EA_RDREQ[0],TCC_EA_RDREQ[1],TCC_EA_RDREQ[2],TCC_EA_RDREQ[3],TCC_EA_RDREQ[4],TCC_EA_RDREQ[5],TCC_EA_RDREQ[6],TCC_EA_RDREQ[7],TCC_EA_RDREQ[8],TCC_EA_RDREQ[9],TCC_EA_RDREQ[10],TCC_EA_RDREQ[11],TCC_EA_RDREQ[12],TCC_EA_RDREQ[13],TCC_EA_RDREQ[14],TCC_EA_RDREQ[15],TCC_MISS[0],TCC_MISS[1],TCC_MISS[2],TCC_MISS[3],TCC_MISS[4],TCC_MISS[5],TCC_MISS[6],TCC_MISS[7],TCC_MISS[8],TCC_MISS[9],TCC_MISS[10],TCC_MISS[11]
0,"kernel(int*) [clone .kd]",1,0,1,14761,14761,1,1,0,0,40,0,48,64,0x0,0x79eecbe84540,60.0000000000,68.0000000000,0.0000000000,0.0000000000,102.0000000000,4,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0000000000,60.0000000000,0,0,0,0,0,0,6,60,1,0,0,0,0,0,0,0,4,0,4,6,0,10,4,61,5,0,7,4
Operating System
Ubuntu 24.04.1 LTS
CPU
AMD Ryzen 9 3900X 12-Core Processor
GPU
gfx906 (AMD Vega 7nm also referred to as AMD Vega 20)
ROCm Version
ROCm 6.3.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: