Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling error with kineto on a ML workload #1030

Open
SKPsanjeevi opened this issue Jan 22, 2025 · 0 comments
Open

Profiling error with kineto on a ML workload #1030

SKPsanjeevi opened this issue Jan 22, 2025 · 0 comments

Comments

@SKPsanjeevi
Copy link

I am profiling a ML workload using torch profiler. The code appears as:

    with profile(activities=[
                    ProfilerActivity.CPU,
                    ProfilerActivity.CUDA],
                record_shapes=True
            ) as prof:
        main_args = parse_main_args()
        main(main_args, DETECTED_SYSTEM)
    prof.export_chrome_trace("torch_trace.json")
    # print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=20))
    # print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=20))

The code runs fine without the profiler. The code also runs fine to finish with the torch profiler. However when the profiler reaches the export statement, I get the following error:

[mlperf-inference-skps-x86-64-29200:6413 :0:6413] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x55ea1b76a8cc)
==== backtrace (tid:   6413) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000006743c49 libkineto::CuptiCallbackApi::__callback_switchboard()  ???:0
 2 0x00000000067441ba libkineto::callback_switchboard()  CuptiCallbackApi.cpp:0
 3 0x0000000000117456 cuptiEnableAllDomains()  ???:0
 4 0x000000000010f5c4 cuptiGetRecommendedBufferSize()  ???:0
 5 0x000000000010d3a8 cuptiGetRecommendedBufferSize()  ???:0
 6 0x00000000001b295d cudbgApiInit()  ???:0
 7 0x00000000001b393b cudbgApiInit()  ???:0
 8 0x00000000001ae05c cudbgApiInit()  ???:0
 9 0x00000000002d2188 cuStreamWaitEvent()  ???:0
10 0x0000000000027ee8 __cudaRegisterUnifiedTable()  ???:0
11 0x000000000002856d __cudaRegisterUnifiedTable()  ???:0
12 0x0000000000045495 secure_getenv()  ???:0
13 0x0000000000045610 exit()  ???:0
14 0x0000000000029d97 __libc_init_first()  ???:0
15 0x0000000000029e40 __libc_start_main()  ???:0
16 0x000000000024ec65 _start()  ???:0
=================================
/bin/bash: line 1:  6413 Segmentation fault      (core dumped) LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/work/build/inference/loadgen/build python3.10 -m code.main --benchmarks=dlrm-v2 --scenarios=offline --action="run_harness" 2>&1
      6414 Done                    | tee /work/build/logs/2025.01.21-19.47.54/stdout.txt
make: *** [Makefile:46: run_harness] Error 139

How to resolve this error? The machine is DGX H200x8, Ubuntu 22.04.4 LTS (Jammy Jellyfish)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant