Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working towards returning KernelArgumentHolder/PolymorphicVal's from fusion execution. #3890

Merged
merged 96 commits into from
Feb 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
04989fb
Move IValueToPolymorphicValue to polymorphic_value.cpp and remove Ker…
csarofeen Feb 11, 2025
17bc49d
Merge branch 'main' into kernel_arg_holder_constructor
csarofeen Feb 11, 2025
d0d35e9
Change arg to const ref, expose KAH ctor with NVF_API
jacobhinkle Feb 12, 2025
63abd45
Remove PolymorphicValue pointers.
csarofeen Feb 12, 2025
6e7d7a4
Last build issue, now segfualting.
csarofeen Feb 12, 2025
b259e62
Continue working on removing PolymorphicValue pointers. Working on fu…
csarofeen Feb 13, 2025
330eec1
Merge branch 'main' into kernel_arg_holder_constructor
csarofeen Feb 13, 2025
8952a2d
Making progress.
csarofeen Feb 13, 2025
2542301
Merge branch 'main' of https://github.com/NVIDIA/Fuser into polymorph…
csarofeen Feb 13, 2025
dd04cc1
Merge remote-tracking branch 'origin/kernel_arg_holder_constructor' i…
csarofeen Feb 13, 2025
63110e2
Fixes.
csarofeen Feb 13, 2025
b9729bc
Cleanup.
csarofeen Feb 13, 2025
a91d2be
Minor cleanup.
csarofeen Feb 13, 2025
c352c58
Merge branch 'kernel_arg_holder_constructor' of https://github.com/NV…
csarofeen Feb 13, 2025
f5861fc
Working on moving runFusionWithInputs to take in KernelArgumentHolder…
csarofeen Feb 14, 2025
011435b
Update csrc/serde/polymorphic_value.cpp
csarofeen Feb 14, 2025
565250c
Fix for checking args in erase.
csarofeen Feb 14, 2025
ade175b
Merge conflicts.
csarofeen Feb 14, 2025
580137e
Fix for empty argumnets to segmenter.
csarofeen Feb 14, 2025
3a9d7bd
Debugging recursive calls into KernelArgumentHolder constructor.
csarofeen Feb 14, 2025
87e2522
Prevent recursive calls into KernelArgHolder constructor.
csarofeen Feb 14, 2025
57a7014
Fix tests, remove support for ArrayType temporarily as shown in AllIn…
csarofeen Feb 15, 2025
d229cb1
Fix multi-device segmentation.
csarofeen Feb 15, 2025
ba20902
Merge branch 'main' into polymorphic_kernel_arg
csarofeen Feb 15, 2025
216b2e6
Take optional for device argument in setDeviceIndex.
csarofeen Feb 15, 2025
088e2bc
Start moving cache system over to KernelArgumentHolder.
csarofeen Feb 15, 2025
285f8bf
Add optional support to KernelArgumentHolder.
csarofeen Feb 15, 2025
13b5791
Start moving tests off the deprecated fusion executor cache function.
csarofeen Feb 15, 2025
c70df80
Standardize as much as possible on initializer list for inputs.
csarofeen Feb 15, 2025
42b06a3
Standardize as much as possible on initializer list for inputs of tests.
csarofeen Feb 15, 2025
40a8871
In tests auto output type when possible, remove c10::IValue for input…
csarofeen Feb 16, 2025
c1364b6
Revert "Standardize as much as possible on initializer list for input…
csarofeen Feb 16, 2025
95bb1aa
Merge test cleanup PR.
csarofeen Feb 16, 2025
436a13b
Restore deprecated functions.
csarofeen Feb 16, 2025
a75b946
Standardize as much as possible on initializer list for inputs.
csarofeen Feb 15, 2025
2ad67de
Merge branch 'polymorphic_outs_step_1' of https://github.com/NVIDIA/F…
csarofeen Feb 16, 2025
c8acaf1
Merge branch 'main' of https://github.com/NVIDIA/Fuser into polymorph…
csarofeen Feb 16, 2025
d97e923
Fix resharding test.
csarofeen Feb 16, 2025
1544505
Merge branch 'main' of https://github.com/NVIDIA/Fuser into polymorph…
csarofeen Feb 16, 2025
b8c3ecc
C++ test fixes.
csarofeen Feb 16, 2025
9aa87a2
clang tidy
csarofeen Feb 16, 2025
27be8b0
Merge branch 'c10_tests' into polymorphic_outs_step_1
csarofeen Feb 16, 2025
95c4ccf
Merge branch 'polymorphic_kernel_arg' into polymorphic_outs_step_1
csarofeen Feb 16, 2025
e708ac2
Clang tidy.
csarofeen Feb 16, 2025
c379b79
Fix bad merge resolution.
csarofeen Feb 16, 2025
ca32ca5
Remove _deprecated uses with initializer list.
csarofeen Feb 16, 2025
50489c5
Test fix with new host ir exec.
csarofeen Feb 16, 2025
85ee62e
Merge branch 'main' of https://github.com/NVIDIA/Fuser into c10_tests
csarofeen Feb 16, 2025
6522214
Merge branch 'polymorphic_kernel_arg' into polymorphic_outs_step_1
csarofeen Feb 16, 2025
ec6a061
Merge branch 'c10_tests' into polymorphic_outs_step_1
csarofeen Feb 16, 2025
e5295d2
Test fix.
csarofeen Feb 17, 2025
2ce0a03
Lint
csarofeen Feb 17, 2025
f565664
Merge branch 'main' of https://github.com/NVIDIA/Fuser into polymorph…
csarofeen Feb 17, 2025
9436eff
Clang.
csarofeen Feb 17, 2025
6a0ffd5
Add run_tests file
csarofeen Feb 17, 2025
687ea0c
Add run_tests file
csarofeen Feb 17, 2025
27b6878
Add dry run to test to check it's correct.
csarofeen Feb 17, 2025
65fc908
Fix parallel execution of tests, run long running tests first.
csarofeen Feb 17, 2025
48c1dd2
Add summary collection.
csarofeen Feb 17, 2025
f274e50
Supress multidevice output to console.
csarofeen Feb 17, 2025
841f0eb
Remove false positive from test checking.
csarofeen Feb 17, 2025
1f82775
Fix multidevice tests, remove exception handling as it hides stack er…
csarofeen Feb 17, 2025
992e543
Adjust timeouts.
csarofeen Feb 17, 2025
1ec9ba4
Test fix.
csarofeen Feb 17, 2025
31b7ec3
Generalize run tests to any number of gpus.
csarofeen Feb 17, 2025
6b28091
Add symlink to latest results.
csarofeen Feb 17, 2025
5843282
Add license.
csarofeen Feb 17, 2025
42a46c1
Test matmul fix.
csarofeen Feb 17, 2025
03b86be
Merge branch 'test_infra' into polymorphic_outs_step_1
csarofeen Feb 17, 2025
dc3a64b
Test fix.
csarofeen Feb 17, 2025
7d41049
Merge branch 'test_infra' into polymorphic_outs_step_1
csarofeen Feb 17, 2025
768ee11
Fix running duplicate long tests. Increase long running timeout becau…
csarofeen Feb 17, 2025
1918dc5
Move benchmarks to using KernelArgHolder.
csarofeen Feb 17, 2025
3b48594
Fix profiling with new fusion executor cache function.
csarofeen Feb 17, 2025
5cf4666
Merge branch 'polymorphic_outs_step_1' of https://github.com/NVIDIA/F…
csarofeen Feb 17, 2025
311f097
Benchmark fix.
csarofeen Feb 17, 2025
cfaf765
Update tests to use non-deprecated run function.
csarofeen Feb 17, 2025
217c0fc
Convert python frontend.
csarofeen Feb 17, 2025
20d0527
Fix multidevice test.
csarofeen Feb 17, 2025
665b581
Remove FusionExecutorCache::runFusionWithInputs_deprecated.
csarofeen Feb 17, 2025
9cc5c62
Update test file.
csarofeen Feb 18, 2025
0d24ecc
lint
csarofeen Feb 18, 2025
6bc14b0
Merge branch 'main' of https://github.com/NVIDIA/Fuser into polymorph…
csarofeen Feb 18, 2025
adfe996
Merge branch 'main' of https://github.com/NVIDIA/Fuser into polymorph…
csarofeen Feb 18, 2025
89c2174
Cleanup.
csarofeen Feb 19, 2025
2df2a77
Clang, fix examples.
csarofeen Feb 19, 2025
3faca48
Merge branch 'main' of https://github.com/NVIDIA/Fuser into polymorph…
csarofeen Feb 19, 2025
309accf
Remove test file as it's in the test_infra branch/PR.
csarofeen Feb 19, 2025
4bd6838
PR comments.
csarofeen Feb 19, 2025
4fd950e
Try removing optional tensor support in KernelArgumentHolder.
csarofeen Feb 19, 2025
0fb2bba
Get rid of some overloads for KernelArgumentHolder::push
csarofeen Feb 19, 2025
90dec9c
Pass by value.
csarofeen Feb 19, 2025
226dc53
Pass by value.
csarofeen Feb 19, 2025
7739452
Merge branch 'main' into polymorphic_outs_step_1
csarofeen Feb 19, 2025
7ba2654
Test fix.
csarofeen Feb 20, 2025
f527bf0
Merge branch 'main' into polymorphic_outs_step_1
csarofeen Feb 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions benchmarks/cpp/batch_norm_channels_first.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -99,10 +99,10 @@ static void NvFuserScheduler_BatchNorm(
at::Tensor at_bias = at::zeros({input_shape[1]}, options);
at::Tensor at_run_mean = at::zeros({input_shape[1]}, fp32_options);
at::Tensor at_run_var = at::ones({input_shape[1]}, fp32_options);
std::vector<c10::IValue> aten_inputs(
{at_x, at_weight, at_bias, at_run_mean, at_run_var});
KernelArgumentHolder args = {
at_x, at_weight, at_bias, at_run_mean, at_run_var};

runBenchmarkIterations(benchmark_state, executor_cache, aten_inputs);
runBenchmarkIterations(benchmark_state, executor_cache, args);

benchmark_state.SetBytesProcessed(
int64_t(benchmark_state.iterations()) *
Expand Down
6 changes: 3 additions & 3 deletions benchmarks/cpp/batch_norm_channels_first_backward.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -112,10 +112,10 @@ static void NvFuserScheduler_BatchNorm_BWD(
at::Tensor save_mean = at::zeros({input_shape[1]}, fp32_options);
at::Tensor save_var = at::ones({input_shape[1]}, fp32_options);

std::vector<c10::IValue> aten_inputs(
{input, grad_out, weight, run_mean, run_var, save_mean, save_var});
KernelArgumentHolder args = {
input, grad_out, weight, run_mean, run_var, save_mean, save_var};

runBenchmarkIterations(benchmark_state, executor_cache, aten_inputs);
runBenchmarkIterations(benchmark_state, executor_cache, args);

benchmark_state.SetBytesProcessed(
int64_t(benchmark_state.iterations()) *
Expand Down
6 changes: 3 additions & 3 deletions benchmarks/cpp/batch_norm_channels_last.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -100,10 +100,10 @@ static void NvFuserScheduler_BatchNorm_nhwc(
at::Tensor at_bias = at::zeros({input_shape[3]}, options);
at::Tensor at_run_mean = at::zeros({input_shape[3]}, fp32_options);
at::Tensor at_run_var = at::ones({input_shape[3]}, fp32_options);
std::vector<c10::IValue> aten_inputs(
{at_x, at_weight, at_bias, at_run_mean, at_run_var});
KernelArgumentHolder args = {
at_x, at_weight, at_bias, at_run_mean, at_run_var};

runBenchmarkIterations(benchmark_state, executor_cache, aten_inputs);
runBenchmarkIterations(benchmark_state, executor_cache, args);

benchmark_state.SetBytesProcessed(
int64_t(benchmark_state.iterations()) *
Expand Down
6 changes: 3 additions & 3 deletions benchmarks/cpp/batch_norm_channels_last_backward.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -113,10 +113,10 @@ static void NvFuserScheduler_BatchNorm_nhwc_BWD(
at::Tensor save_mean = at::zeros({input_shape[3]}, fp32_options);
at::Tensor save_var = at::ones({input_shape[3]}, fp32_options);

std::vector<c10::IValue> aten_inputs(
{input, grad_out, weight, run_mean, run_var, save_mean, save_var});
KernelArgumentHolder args = {
input, grad_out, weight, run_mean, run_var, save_mean, save_var};

runBenchmarkIterations(benchmark_state, executor_cache, aten_inputs);
runBenchmarkIterations(benchmark_state, executor_cache, args);

benchmark_state.SetBytesProcessed(
int64_t(benchmark_state.iterations()) *
Expand Down
30 changes: 12 additions & 18 deletions benchmarks/cpp/bert.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -132,10 +132,9 @@ static void NvFuserScheduler_DivMaxSoftDropFwd(
at::Tensor t0 = at::randn({w, 1, 1, z}, options);
at::Tensor t1 = at::randn({w, x, y, z}, options);

std::vector<c10::IValue> at_inputs = {t0, t1};
KernelArgumentHolder args = {t0, t1};

auto bytes =
runBenchmarkIterations(benchmark_state, executor_cache, at_inputs);
auto bytes = runBenchmarkIterations(benchmark_state, executor_cache, args);

benchmark_state.SetBytesProcessed(
bytes * int64_t(benchmark_state.iterations()));
Expand All @@ -159,10 +158,9 @@ static void NvFuserScheduler_DivMaxSoftDropBwd(
at::Tensor t2 = at::randn({w, x, y, z}, options);
at::Tensor t3 = at::randn({w, x, y, z}, options).round().to(at::kBool);

std::vector<c10::IValue> at_inputs = {t0, t1, t2, t3};
KernelArgumentHolder args = {t0, t1, t2, t3};

auto bytes =
runBenchmarkIterations(benchmark_state, executor_cache, at_inputs);
auto bytes = runBenchmarkIterations(benchmark_state, executor_cache, args);

// Some reason t1 isn't used, ignore it.
bytes -=
Expand Down Expand Up @@ -244,10 +242,9 @@ static void NvFuserScheduler_BiasDropoutAddLayernormFwd(
at::Tensor t3 = at::randn({x, y, z}, options);
at::Tensor t4 = at::randn({z}, options);

std::vector<c10::IValue> at_inputs = {t0, t1, t2, t3, t4};
KernelArgumentHolder args = {t0, t1, t2, t3, t4};

auto bytes =
runBenchmarkIterations(benchmark_state, executor_cache, at_inputs);
auto bytes = runBenchmarkIterations(benchmark_state, executor_cache, args);

benchmark_state.SetBytesProcessed(
bytes * int64_t(benchmark_state.iterations()));
Expand Down Expand Up @@ -319,10 +316,9 @@ static void NvFuserScheduler_BiasDropoutAddLayernormBwd1(
at::Tensor t2 = at::randn({x, y, 1}, options);
at::Tensor t3 = at::randn({x, y, 1}, options);

std::vector<c10::IValue> at_inputs = {t0, t1, t2, t3};
KernelArgumentHolder args = {t0, t1, t2, t3};

auto bytes =
runBenchmarkIterations(benchmark_state, executor_cache, at_inputs);
auto bytes = runBenchmarkIterations(benchmark_state, executor_cache, args);

benchmark_state.SetBytesProcessed(
bytes * int64_t(benchmark_state.iterations()));
Expand Down Expand Up @@ -395,10 +391,9 @@ static void NvFuserScheduler_BiasDropoutAddLayernormBwd2(
at::Tensor t1 = at::randn({x, y, z}, options);
at::Tensor t8 = at::randn({x, y, z}, options);

std::vector<c10::IValue> at_inputs = {t4, t5, t1, t8};
KernelArgumentHolder args = {t4, t5, t1, t8};

auto bytes =
runBenchmarkIterations(benchmark_state, executor_cache, at_inputs);
auto bytes = runBenchmarkIterations(benchmark_state, executor_cache, args);

benchmark_state.SetBytesProcessed(
bytes * int64_t(benchmark_state.iterations()));
Expand Down Expand Up @@ -451,10 +446,9 @@ static void NvFuserScheduler_BiasDropoutAddLayernormBwd3(
at::Tensor t0 = at::randn({x, y, z}, options);
at::Tensor t21 = at::randn({x, y, z}, options);

std::vector<c10::IValue> at_inputs = {t0, t21};
KernelArgumentHolder args = {t0, t21};

auto bytes =
runBenchmarkIterations(benchmark_state, executor_cache, at_inputs);
auto bytes = runBenchmarkIterations(benchmark_state, executor_cache, args);

benchmark_state.SetBytesProcessed(
bytes * int64_t(benchmark_state.iterations()));
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/cpp/broadcast.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -72,9 +72,9 @@ static void NvFuserScheduler_Broadcast(

at::Tensor t1 = at::randn({iter_size}, options);

std::vector<c10::IValue> aten_inputs({t0, t1});
KernelArgumentHolder args = {t0, t1};

runBenchmarkIterations(benchmark_state, executor_cache, aten_inputs);
runBenchmarkIterations(benchmark_state, executor_cache, args);

benchmark_state.SetBytesProcessed(
int64_t(benchmark_state.iterations()) *
Expand Down
43 changes: 20 additions & 23 deletions benchmarks/cpp/gelu_backward.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ static void setupFusion(Fusion* fusion) {
fusion->addOutput(t27);
}

static std::vector<c10::IValue> setupInputs() {
static KernelArgumentHolder setupInputs() {
at::manual_seed(0);

auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
Expand Down Expand Up @@ -112,12 +112,12 @@ static void NvFuserScheduler_GeluBackward_AutoSchedule(
benchmark_state.PauseTiming();
Fusion fusion;
setupFusion(&fusion);
std::vector<c10::IValue> inputs = setupInputs();
KernelArgumentHolder args = setupInputs();
benchmark_state.ResumeTiming();

// Auto-schedule
SchedulerEntry::scheduleWith(
&fusion, SchedulerType::PointWise, c10::ArrayRef<c10::IValue>(inputs));
&fusion, SchedulerType::PointWise, args.toC10Array());
}
}

Expand All @@ -134,10 +134,10 @@ static void NvFuserScheduler_GeluBackward_Lower(
setupFusion(&fusion);

// inputs
std::vector<c10::IValue> inputs = setupInputs();
KernelArgumentHolder args = setupInputs();

SchedulerEntry::scheduleWith(
&fusion, SchedulerType::PointWise, c10::ArrayRef<c10::IValue>(inputs));
&fusion, SchedulerType::PointWise, args.toC10Array());

for (auto _ : benchmark_state) {
GpuLower(&fusion).run();
Expand All @@ -156,14 +156,14 @@ static void NvFuserScheduler_GeluBackward_Compile(
setupFusion(&fusion);

// inputs
std::vector<c10::IValue> inputs = setupInputs();
KernelArgumentHolder args = setupInputs();

auto heuristic_params = SchedulerEntry::scheduleWith(
&fusion, SchedulerType::PointWise, c10::ArrayRef<c10::IValue>(inputs));
&fusion, SchedulerType::PointWise, args.toC10Array());

for (auto _ : benchmark_state) {
KernelExecutor ke;
ke.compile(&fusion, inputs, heuristic_params->lparams);
ke.compile(&fusion, args.toC10Array(), heuristic_params->lparams);
}
}

Expand All @@ -179,22 +179,21 @@ static void NvFuserScheduler_GeluBackward_RunFusion(
setupFusion(&fusion);

// inputs
std::vector<c10::IValue> inputs = setupInputs();
KernelArgumentHolder args = setupInputs();

// outputs
std::vector<at::Tensor> outputs;

auto heuristic_params = SchedulerEntry::scheduleWith(
&fusion, SchedulerType::PointWise, c10::ArrayRef<c10::IValue>(inputs));
&fusion, SchedulerType::PointWise, args.toC10Array());

KernelExecutor ke;
ke.compile(&fusion, inputs, heuristic_params->lparams);
ke.compile(&fusion, args.toC10Array(), heuristic_params->lparams);

C10_CUDA_CHECK(cudaDeviceSynchronize());

for (auto _ : benchmark_state) {
outputs =
ke.run(c10::ArrayRef<c10::IValue>(inputs), heuristic_params->lparams);
outputs = ke.run(args.toC10Array(), heuristic_params->lparams);
C10_CUDA_CHECK(cudaDeviceSynchronize());
clearL2Cache();
}
Expand All @@ -213,16 +212,15 @@ static void NvFuserScheduler_GeluBackward_RunFusion_GpuOnly(
setupFusion(&fusion);

// inputs
std::vector<c10::IValue> inputs = setupInputs();
KernelArgumentHolder args = setupInputs();

auto heuristic_params = SchedulerEntry::scheduleWith(
&fusion, SchedulerType::PointWise, c10::ArrayRef<c10::IValue>(inputs));
&fusion, SchedulerType::PointWise, args.toC10Array());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any plan to update SchedulerEntry::scheduleWith to take KernelArgumentHolder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there will be follow ups to incrementally convert to KernelArgumentHolder consistently in the codebase. #3916 is the next PR which does kernel executor, then validator, then SchedulerEntry and I'll look for other places.


KernelExecutor ke;
ke.compile(&fusion, inputs, heuristic_params->lparams);
ke.compile(&fusion, args.toC10Array(), heuristic_params->lparams);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any plan to update KernelExecutor::compile to take KernelArgumentHolder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, already in the above mentioned PR.


runBenchmarkIterations(
benchmark_state, &ke, inputs, heuristic_params->lparams);
runBenchmarkIterations(benchmark_state, &ke, args, heuristic_params->lparams);
}

BENCHMARK(NvFuserScheduler_GeluBackward_RunFusion_GpuOnly)
Expand All @@ -239,21 +237,20 @@ static void NvFuserScheduler_GeluBackward_RunFusion_CpuOnly(
setupFusion(&fusion);

// inputs
std::vector<c10::IValue> inputs = setupInputs();
KernelArgumentHolder args = setupInputs();

// outputs
std::vector<at::Tensor> outputs;

auto heuristic_params = SchedulerEntry::scheduleWith(
&fusion, SchedulerType::PointWise, c10::ArrayRef<c10::IValue>(inputs));
&fusion, SchedulerType::PointWise, args.toC10Array());

KernelExecutor ke;
ke.setExecuteKernelFlag(false);
ke.compile(&fusion, inputs, heuristic_params->lparams);
ke.compile(&fusion, args.toC10Array(), heuristic_params->lparams);

for (auto _ : benchmark_state) {
outputs =
ke.run(c10::ArrayRef<c10::IValue>(inputs), heuristic_params->lparams);
outputs = ke.run(args.toC10Array(), heuristic_params->lparams);
}
}

Expand Down
4 changes: 2 additions & 2 deletions benchmarks/cpp/gelu_backward_reduction.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -110,9 +110,9 @@ static void NvFuserScheduler_GeluBackwardReduction(
(reduction_dim ? at::randn({iter_size, reduction_size}, options)
: at::randn({reduction_size, iter_size}, options));

std::vector<c10::IValue> aten_inputs = {aten_input_grad, aten_input_x};
KernelArgumentHolder args = {aten_input_grad, aten_input_x};

runBenchmarkIterations(benchmark_state, executor_cache, aten_inputs);
runBenchmarkIterations(benchmark_state, executor_cache, args);

// inputs: gradient tensor + input tensor
// outputs: output, output_of_reduction
Expand Down
12 changes: 4 additions & 8 deletions benchmarks/cpp/heuristic_cache.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,15 +31,13 @@ static void NvFuserScheduler_LayerNormBackward_HeuristicCache(

// PreAllocate
std::unique_ptr<FusionExecutorCache> executor_cache;
std::vector<c10::IValue> aten_inputs;
KernelArgumentHolder args;

std::vector<int64_t> shape{20, 100, 35, 67};
std::vector<int64_t> norm_shape{67};

auto runtime = getLayerBackwardNormRuntime(
std::move(fusion_ptr), executor_cache, aten_inputs, shape, norm_shape);

KernelArgumentHolder args(aten_inputs);
std::move(fusion_ptr), executor_cache, args, shape, norm_shape);

NVF_ERROR(runtime->getMaybeHeuristicsFor(args).has_value());

Expand All @@ -56,15 +54,13 @@ static void NvFuserScheduler_LayerNormForward_HeuristicCache(

// PreAllocate
std::unique_ptr<FusionExecutorCache> executor_cache;
std::vector<c10::IValue> aten_inputs;
KernelArgumentHolder args;

std::vector<int64_t> shape{20, 100, 35, 67};
std::vector<int64_t> norm_shape{67};

auto runtime = getLayerForwardNormRuntime(
std::move(fusion_ptr), executor_cache, aten_inputs, shape, norm_shape);

KernelArgumentHolder args(aten_inputs);
std::move(fusion_ptr), executor_cache, args, shape, norm_shape);

NVF_ERROR(runtime->getMaybeHeuristicsFor(args).has_value());

Expand Down
12 changes: 4 additions & 8 deletions benchmarks/cpp/heuristic_lookup.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,15 +31,13 @@ static void NvFuserScheduler_LayerNormBackward_HeuristicLookup(

// PreAllocate
std::unique_ptr<FusionExecutorCache> executor_cache;
std::vector<c10::IValue> aten_inputs;
KernelArgumentHolder args;

std::vector<int64_t> shape{20, 100, 35, 67};
std::vector<int64_t> norm_shape{67};

auto runtime = getLayerBackwardNormRuntime(
std::move(fusion_ptr), executor_cache, aten_inputs, shape, norm_shape);

KernelArgumentHolder args(aten_inputs);
std::move(fusion_ptr), executor_cache, args, shape, norm_shape);

NVF_ERROR(runtime->getMaybeHeuristicsFor(args).has_value());

Expand All @@ -56,15 +54,13 @@ static void NvFuserScheduler_LayerNormForward_HeuristicLookup(

// PreAllocate
std::unique_ptr<FusionExecutorCache> executor_cache;
std::vector<c10::IValue> aten_inputs;
KernelArgumentHolder args;

std::vector<int64_t> shape{20, 100, 35, 67};
std::vector<int64_t> norm_shape{67};

auto runtime = getLayerForwardNormRuntime(
std::move(fusion_ptr), executor_cache, aten_inputs, shape, norm_shape);

KernelArgumentHolder args(aten_inputs);
std::move(fusion_ptr), executor_cache, args, shape, norm_shape);

NVF_ERROR(runtime->getMaybeHeuristicsFor(args).has_value());

Expand Down
Loading