-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working towards returning KernelArgumentHolder/PolymorphicVal's from fusion execution. #3890
Merged
+649
−652
Merged
Changes from all commits
Commits
Show all changes
96 commits
Select commit
Hold shift + click to select a range
04989fb
Move IValueToPolymorphicValue to polymorphic_value.cpp and remove Ker…
csarofeen 17bc49d
Merge branch 'main' into kernel_arg_holder_constructor
csarofeen d0d35e9
Change arg to const ref, expose KAH ctor with NVF_API
jacobhinkle 63abd45
Remove PolymorphicValue pointers.
csarofeen 6e7d7a4
Last build issue, now segfualting.
csarofeen b259e62
Continue working on removing PolymorphicValue pointers. Working on fu…
csarofeen 330eec1
Merge branch 'main' into kernel_arg_holder_constructor
csarofeen 8952a2d
Making progress.
csarofeen 2542301
Merge branch 'main' of https://github.com/NVIDIA/Fuser into polymorph…
csarofeen dd04cc1
Merge remote-tracking branch 'origin/kernel_arg_holder_constructor' i…
csarofeen 63110e2
Fixes.
csarofeen b9729bc
Cleanup.
csarofeen a91d2be
Minor cleanup.
csarofeen c352c58
Merge branch 'kernel_arg_holder_constructor' of https://github.com/NV…
csarofeen f5861fc
Working on moving runFusionWithInputs to take in KernelArgumentHolder…
csarofeen 011435b
Update csrc/serde/polymorphic_value.cpp
csarofeen 565250c
Fix for checking args in erase.
csarofeen ade175b
Merge conflicts.
csarofeen 580137e
Fix for empty argumnets to segmenter.
csarofeen 3a9d7bd
Debugging recursive calls into KernelArgumentHolder constructor.
csarofeen 87e2522
Prevent recursive calls into KernelArgHolder constructor.
csarofeen 57a7014
Fix tests, remove support for ArrayType temporarily as shown in AllIn…
csarofeen d229cb1
Fix multi-device segmentation.
csarofeen ba20902
Merge branch 'main' into polymorphic_kernel_arg
csarofeen 216b2e6
Take optional for device argument in setDeviceIndex.
csarofeen 088e2bc
Start moving cache system over to KernelArgumentHolder.
csarofeen 285f8bf
Add optional support to KernelArgumentHolder.
csarofeen 13b5791
Start moving tests off the deprecated fusion executor cache function.
csarofeen c70df80
Standardize as much as possible on initializer list for inputs.
csarofeen 42b06a3
Standardize as much as possible on initializer list for inputs of tests.
csarofeen 40a8871
In tests auto output type when possible, remove c10::IValue for input…
csarofeen c1364b6
Revert "Standardize as much as possible on initializer list for input…
csarofeen 95bb1aa
Merge test cleanup PR.
csarofeen 436a13b
Restore deprecated functions.
csarofeen a75b946
Standardize as much as possible on initializer list for inputs.
csarofeen 2ad67de
Merge branch 'polymorphic_outs_step_1' of https://github.com/NVIDIA/F…
csarofeen c8acaf1
Merge branch 'main' of https://github.com/NVIDIA/Fuser into polymorph…
csarofeen d97e923
Fix resharding test.
csarofeen 1544505
Merge branch 'main' of https://github.com/NVIDIA/Fuser into polymorph…
csarofeen b8c3ecc
C++ test fixes.
csarofeen 9aa87a2
clang tidy
csarofeen 27be8b0
Merge branch 'c10_tests' into polymorphic_outs_step_1
csarofeen 95c4ccf
Merge branch 'polymorphic_kernel_arg' into polymorphic_outs_step_1
csarofeen e708ac2
Clang tidy.
csarofeen c379b79
Fix bad merge resolution.
csarofeen ca32ca5
Remove _deprecated uses with initializer list.
csarofeen 50489c5
Test fix with new host ir exec.
csarofeen 85ee62e
Merge branch 'main' of https://github.com/NVIDIA/Fuser into c10_tests
csarofeen 6522214
Merge branch 'polymorphic_kernel_arg' into polymorphic_outs_step_1
csarofeen ec6a061
Merge branch 'c10_tests' into polymorphic_outs_step_1
csarofeen e5295d2
Test fix.
csarofeen 2ce0a03
Lint
csarofeen f565664
Merge branch 'main' of https://github.com/NVIDIA/Fuser into polymorph…
csarofeen 9436eff
Clang.
csarofeen 6a0ffd5
Add run_tests file
csarofeen 687ea0c
Add run_tests file
csarofeen 27b6878
Add dry run to test to check it's correct.
csarofeen 65fc908
Fix parallel execution of tests, run long running tests first.
csarofeen 48c1dd2
Add summary collection.
csarofeen f274e50
Supress multidevice output to console.
csarofeen 841f0eb
Remove false positive from test checking.
csarofeen 1f82775
Fix multidevice tests, remove exception handling as it hides stack er…
csarofeen 992e543
Adjust timeouts.
csarofeen 1ec9ba4
Test fix.
csarofeen 31b7ec3
Generalize run tests to any number of gpus.
csarofeen 6b28091
Add symlink to latest results.
csarofeen 5843282
Add license.
csarofeen 42a46c1
Test matmul fix.
csarofeen 03b86be
Merge branch 'test_infra' into polymorphic_outs_step_1
csarofeen dc3a64b
Test fix.
csarofeen 7d41049
Merge branch 'test_infra' into polymorphic_outs_step_1
csarofeen 768ee11
Fix running duplicate long tests. Increase long running timeout becau…
csarofeen 1918dc5
Move benchmarks to using KernelArgHolder.
csarofeen 3b48594
Fix profiling with new fusion executor cache function.
csarofeen 5cf4666
Merge branch 'polymorphic_outs_step_1' of https://github.com/NVIDIA/F…
csarofeen 311f097
Benchmark fix.
csarofeen cfaf765
Update tests to use non-deprecated run function.
csarofeen 217c0fc
Convert python frontend.
csarofeen 20d0527
Fix multidevice test.
csarofeen 665b581
Remove FusionExecutorCache::runFusionWithInputs_deprecated.
csarofeen 9cc5c62
Update test file.
csarofeen 0d24ecc
lint
csarofeen 6bc14b0
Merge branch 'main' of https://github.com/NVIDIA/Fuser into polymorph…
csarofeen adfe996
Merge branch 'main' of https://github.com/NVIDIA/Fuser into polymorph…
csarofeen 89c2174
Cleanup.
csarofeen 2df2a77
Clang, fix examples.
csarofeen 3faca48
Merge branch 'main' of https://github.com/NVIDIA/Fuser into polymorph…
csarofeen 309accf
Remove test file as it's in the test_infra branch/PR.
csarofeen 4bd6838
PR comments.
csarofeen 4fd950e
Try removing optional tensor support in KernelArgumentHolder.
csarofeen 0fb2bba
Get rid of some overloads for KernelArgumentHolder::push
csarofeen 90dec9c
Pass by value.
csarofeen 226dc53
Pass by value.
csarofeen 7739452
Merge branch 'main' into polymorphic_outs_step_1
csarofeen 7ba2654
Test fix.
csarofeen f527bf0
Merge branch 'main' into polymorphic_outs_step_1
csarofeen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -77,7 +77,7 @@ static void setupFusion(Fusion* fusion) { | |
fusion->addOutput(t27); | ||
} | ||
|
||
static std::vector<c10::IValue> setupInputs() { | ||
static KernelArgumentHolder setupInputs() { | ||
at::manual_seed(0); | ||
|
||
auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0); | ||
|
@@ -112,12 +112,12 @@ static void NvFuserScheduler_GeluBackward_AutoSchedule( | |
benchmark_state.PauseTiming(); | ||
Fusion fusion; | ||
setupFusion(&fusion); | ||
std::vector<c10::IValue> inputs = setupInputs(); | ||
KernelArgumentHolder args = setupInputs(); | ||
benchmark_state.ResumeTiming(); | ||
|
||
// Auto-schedule | ||
SchedulerEntry::scheduleWith( | ||
&fusion, SchedulerType::PointWise, c10::ArrayRef<c10::IValue>(inputs)); | ||
&fusion, SchedulerType::PointWise, args.toC10Array()); | ||
} | ||
} | ||
|
||
|
@@ -134,10 +134,10 @@ static void NvFuserScheduler_GeluBackward_Lower( | |
setupFusion(&fusion); | ||
|
||
// inputs | ||
std::vector<c10::IValue> inputs = setupInputs(); | ||
KernelArgumentHolder args = setupInputs(); | ||
|
||
SchedulerEntry::scheduleWith( | ||
&fusion, SchedulerType::PointWise, c10::ArrayRef<c10::IValue>(inputs)); | ||
&fusion, SchedulerType::PointWise, args.toC10Array()); | ||
|
||
for (auto _ : benchmark_state) { | ||
GpuLower(&fusion).run(); | ||
|
@@ -156,14 +156,14 @@ static void NvFuserScheduler_GeluBackward_Compile( | |
setupFusion(&fusion); | ||
|
||
// inputs | ||
std::vector<c10::IValue> inputs = setupInputs(); | ||
KernelArgumentHolder args = setupInputs(); | ||
|
||
auto heuristic_params = SchedulerEntry::scheduleWith( | ||
&fusion, SchedulerType::PointWise, c10::ArrayRef<c10::IValue>(inputs)); | ||
&fusion, SchedulerType::PointWise, args.toC10Array()); | ||
|
||
for (auto _ : benchmark_state) { | ||
KernelExecutor ke; | ||
ke.compile(&fusion, inputs, heuristic_params->lparams); | ||
ke.compile(&fusion, args.toC10Array(), heuristic_params->lparams); | ||
} | ||
} | ||
|
||
|
@@ -179,22 +179,21 @@ static void NvFuserScheduler_GeluBackward_RunFusion( | |
setupFusion(&fusion); | ||
|
||
// inputs | ||
std::vector<c10::IValue> inputs = setupInputs(); | ||
KernelArgumentHolder args = setupInputs(); | ||
|
||
// outputs | ||
std::vector<at::Tensor> outputs; | ||
|
||
auto heuristic_params = SchedulerEntry::scheduleWith( | ||
&fusion, SchedulerType::PointWise, c10::ArrayRef<c10::IValue>(inputs)); | ||
&fusion, SchedulerType::PointWise, args.toC10Array()); | ||
|
||
KernelExecutor ke; | ||
ke.compile(&fusion, inputs, heuristic_params->lparams); | ||
ke.compile(&fusion, args.toC10Array(), heuristic_params->lparams); | ||
|
||
C10_CUDA_CHECK(cudaDeviceSynchronize()); | ||
|
||
for (auto _ : benchmark_state) { | ||
outputs = | ||
ke.run(c10::ArrayRef<c10::IValue>(inputs), heuristic_params->lparams); | ||
outputs = ke.run(args.toC10Array(), heuristic_params->lparams); | ||
C10_CUDA_CHECK(cudaDeviceSynchronize()); | ||
clearL2Cache(); | ||
} | ||
|
@@ -213,16 +212,15 @@ static void NvFuserScheduler_GeluBackward_RunFusion_GpuOnly( | |
setupFusion(&fusion); | ||
|
||
// inputs | ||
std::vector<c10::IValue> inputs = setupInputs(); | ||
KernelArgumentHolder args = setupInputs(); | ||
|
||
auto heuristic_params = SchedulerEntry::scheduleWith( | ||
&fusion, SchedulerType::PointWise, c10::ArrayRef<c10::IValue>(inputs)); | ||
&fusion, SchedulerType::PointWise, args.toC10Array()); | ||
|
||
KernelExecutor ke; | ||
ke.compile(&fusion, inputs, heuristic_params->lparams); | ||
ke.compile(&fusion, args.toC10Array(), heuristic_params->lparams); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any plan to update There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, already in the above mentioned PR. |
||
|
||
runBenchmarkIterations( | ||
benchmark_state, &ke, inputs, heuristic_params->lparams); | ||
runBenchmarkIterations(benchmark_state, &ke, args, heuristic_params->lparams); | ||
} | ||
|
||
BENCHMARK(NvFuserScheduler_GeluBackward_RunFusion_GpuOnly) | ||
|
@@ -239,21 +237,20 @@ static void NvFuserScheduler_GeluBackward_RunFusion_CpuOnly( | |
setupFusion(&fusion); | ||
|
||
// inputs | ||
std::vector<c10::IValue> inputs = setupInputs(); | ||
KernelArgumentHolder args = setupInputs(); | ||
|
||
// outputs | ||
std::vector<at::Tensor> outputs; | ||
|
||
auto heuristic_params = SchedulerEntry::scheduleWith( | ||
&fusion, SchedulerType::PointWise, c10::ArrayRef<c10::IValue>(inputs)); | ||
&fusion, SchedulerType::PointWise, args.toC10Array()); | ||
|
||
KernelExecutor ke; | ||
ke.setExecuteKernelFlag(false); | ||
ke.compile(&fusion, inputs, heuristic_params->lparams); | ||
ke.compile(&fusion, args.toC10Array(), heuristic_params->lparams); | ||
|
||
for (auto _ : benchmark_state) { | ||
outputs = | ||
ke.run(c10::ArrayRef<c10::IValue>(inputs), heuristic_params->lparams); | ||
outputs = ke.run(args.toC10Array(), heuristic_params->lparams); | ||
} | ||
} | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any plan to update
SchedulerEntry::scheduleWith
to takeKernelArgumentHolder
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there will be follow ups to incrementally convert to KernelArgumentHolder consistently in the codebase. #3916 is the next PR which does kernel executor, then validator, then SchedulerEntry and I'll look for other places.