Name	Name	Last commit message	Last commit date
parent directory ..
00_basic_gemm	00_basic_gemm
01_cutlass_utilities	01_cutlass_utilities
02_dump_reg_shmem	02_dump_reg_shmem
03_visualize_layout	03_visualize_layout
04_tile_iterator	04_tile_iterator
05_batched_gemm	05_batched_gemm
06_splitK_gemm	06_splitK_gemm
07_volta_tensorop_gemm	07_volta_tensorop_gemm
08_turing_tensorop_gemm	08_turing_tensorop_gemm
09_turing_tensorop_conv2dfprop	09_turing_tensorop_conv2dfprop
10_planar_complex	10_planar_complex
11_planar_complex_array	11_planar_complex_array
12_gemm_bias_relu	12_gemm_bias_relu
13_two_tensor_op_fusion	13_two_tensor_op_fusion
14_ampere_tf32_tensorop_gemm	14_ampere_tf32_tensorop_gemm
15_ampere_sparse_tensorop_gemm	15_ampere_sparse_tensorop_gemm
16_ampere_tensorop_conv2dfprop	16_ampere_tensorop_conv2dfprop
17_fprop_per_channel_bias	17_fprop_per_channel_bias
18_ampere_fp64_tensorop_affine2_gemm	18_ampere_fp64_tensorop_affine2_gemm
19_tensorop_canonical	19_tensorop_canonical
20_simt_canonical	20_simt_canonical
21_quaternion_gemm	21_quaternion_gemm
22_quaternion_conv	22_quaternion_conv
23_ampere_gemm_operand_reduction_fusion	23_ampere_gemm_operand_reduction_fusion
24_gemm_grouped	24_gemm_grouped
25_ampere_fprop_mainloop_fusion	25_ampere_fprop_mainloop_fusion
26_ampere_wgrad_mainloop_fusion	26_ampere_wgrad_mainloop_fusion
27_ampere_3xtf32_fast_accurate_tensorop_gemm	27_ampere_3xtf32_fast_accurate_tensorop_gemm
28_ampere_3xtf32_fast_accurate_tensorop_fprop	28_ampere_3xtf32_fast_accurate_tensorop_fprop
29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm	29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm
30_wgrad_split_k	30_wgrad_split_k
31_basic_syrk	31_basic_syrk
32_basic_trmm	32_basic_trmm
33_ampere_3xtf32_tensorop_symm	33_ampere_3xtf32_tensorop_symm
34_transposed_conv2d	34_transposed_conv2d
35_gemm_softmax	35_gemm_softmax
36_gather_scatter_fusion	36_gather_scatter_fusion
37_gemm_layernorm_gemm_fusion	37_gemm_layernorm_gemm_fusion
38_syr2k_grouped	38_syr2k_grouped
39_gemm_permute	39_gemm_permute
40_cutlass_py	40_cutlass_py
41_fused_multi_head_attention	41_fused_multi_head_attention
42_ampere_tensorop_group_conv	42_ampere_tensorop_group_conv
43_ell_block_sparse_gemm	43_ell_block_sparse_gemm
44_multi_gemm_ir_and_codegen	44_multi_gemm_ir_and_codegen
45_dual_gemm	45_dual_gemm
46_depthwise_simt_conv2dfprop	46_depthwise_simt_conv2dfprop
47_ampere_gemm_universal_streamk	47_ampere_gemm_universal_streamk
48_hopper_warp_specialized_gemm	48_hopper_warp_specialized_gemm
49_hopper_gemm_with_collective_builder	49_hopper_gemm_with_collective_builder
50_hopper_gemm_with_epilogue_swizzle	50_hopper_gemm_with_epilogue_swizzle
51_hopper_gett	51_hopper_gett
52_hopper_gather_scatter_fusion	52_hopper_gather_scatter_fusion
53_hopper_gemm_permute	53_hopper_gemm_permute
54_hopper_fp8_warp_specialized_gemm	54_hopper_fp8_warp_specialized_gemm
55_hopper_mixed_dtype_gemm	55_hopper_mixed_dtype_gemm
56_hopper_ptr_array_batched_gemm	56_hopper_ptr_array_batched_gemm
57_hopper_grouped_gemm	57_hopper_grouped_gemm
58_ada_fp8_gemm	58_ada_fp8_gemm
59_ampere_gather_scatter_conv	59_ampere_gather_scatter_conv
60_cutlass_import	60_cutlass_import
61_hopper_gemm_with_topk_and_softmax	61_hopper_gemm_with_topk_and_softmax
62_hopper_sparse_gemm	62_hopper_sparse_gemm
63_hopper_gemm_with_weight_prefetch	63_hopper_gemm_with_weight_prefetch
64_ada_fp8_gemm_grouped	64_ada_fp8_gemm_grouped
65_distributed_gemm	65_distributed_gemm
67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling	67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling
70_blackwell_gemm	70_blackwell_gemm
71_blackwell_gemm_with_collective_builder	71_blackwell_gemm_with_collective_builder
72_blackwell_narrow_precision_gemm	72_blackwell_narrow_precision_gemm
73_blackwell_gemm_preferred_cluster	73_blackwell_gemm_preferred_cluster
74_blackwell_gemm_streamk	74_blackwell_gemm_streamk
75_blackwell_grouped_gemm	75_blackwell_grouped_gemm
76_blackwell_conv	76_blackwell_conv
77_blackwell_fmha	77_blackwell_fmha
common	common
cute	cute
python	python
CMakeLists.txt	CMakeLists.txt
README.md	README.md

CUTLASS - Programming Examples

00_basic_gemm

launches a basic GEMM with single precision inputs and outputs
01_cutlass_utilities

demonstrates CUTLASS Utilities for allocating and initializing tensors
02_dump_reg_smem

debugging utilities for printing register and shared memory contents
03_visualize_layout

utility for visualizing all layout functions in CUTLASS
04_tile_iterator

example demonstrating an iterator over tiles in memory
05_batched_gemm

example demonstrating CUTLASS's batched strided GEMM operation
06_splitK_gemm

example demonstrating CUTLASS's Split-K parallel reduction kernel
07_volta_tensorop_gemm

example demonstrating mixed precision GEMM using Volta Tensor Cores
08_turing_tensorop_gemm

example demonstrating integer GEMM using Turing Tensor Cores
09_turing_tensorop_conv2dfprop

example demonstrating integer implicit GEMM convolution (forward propagation) using Turing Tensor Cores
10_planar_complex

example demonstrating planar complex GEMM kernels
11_planar_complex_array

example demonstrating planar complex kernels with batch-specific problem sizes
12_gemm_bias_relu

example demonstrating GEMM fused with bias and relu
13_two_tensor_op_fusion

example demonstrating two GEMMs or convolutions fused in one kernel
14_ampere_tf32_tensorop_gemm

example demonstrating FP32 GEMM with implicit TF32 conversion
15_ampere_sparse_tensorop_gemm

example demonstrating usage of Sparse Tensor cores
16_ampere_tensorop_conv2dfprop

example demonstrating forward convolution on tensors of layout NHWC
17_fprop_per_channel_bias

example demonstrating convolution fused with per channel bias and relu
18_ampere_fp64_tensorop_affine2_gemm

example demonstrating Affine-2 GEMM
19_tensorop_canonical

Canonical GEMM using tensor cores
20_simt_canonical

Canonical GEMM using SIMT
21_quaternion_gemm

example demonstrating Quaternion GEMM computations
22_quaternion conv

example demonstrating Quaternion convolution
23_ampere_gemm_operand_reduction_fusion

example demonstrating how to reduce one of the operands of the GEMM along the k-dimension when computing GEMM
24_gemm_grouped

example demonstrating batch of GEMM operations with distinct problem sizes
25_ampere_fprop_mainloop_fusion

example demonstrating fusing activation's per channel scale+bias+relu into the fgrad mainloop
26_ampere_wgrad_mainloop_fusion

example demonstrating fusing activation's per channel scale+bias+relu into the wgrad mainloop
27_ampere_3xtf32_fast_accurate_tensorop_gemm

example demonstrating emulation of a fast accurate SGEMM with TF32 operations
28_ampere_3xtf32_fast_accurate_tensorop_fprop

example demonstrating emulation of a fast accurate FP32 convolution with TF32 operation
29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm

example demonstrating emulation of a fast accurate CGEMM with TF32 operation
30_wgrad_split_k

example demonstrating how to compute conv2d gradient with respect to weight (wgrad) together with split-K
31_basic_syrk

example demonstrating Symmetric Rank-K update
32_basic_trmm

example demonstrating Triangular Matrix-Matrix multiplication
33_ampere_3xtf32_tensorop_symm

example demonstrating Symmetric Matrix-Matrix multiplication with FP32 emulation
34_transposed_conv2d

example demonstrating how to compute 2d transposed convolution, also known as deconvolution, using CUTLASS conv2d Dgrad kernels
35_gemm_softmax

example demonstrating GEMM fused with Softmax in mixed precision using Ampere Tensor Cores
36_gather_scatter_fusion

example demonstrating fuses gather before GEMM and scatter after GEMM into the same GEMM kernel
37_gemm_layernorm_gemm_fusion

example demonstrating fuses gemm->layernorm->gemm into one kernel.
38_syr2k_grouped

example demonstrating a batch of SYR2K operations with distinct problem sizes
39_gemm_permute

example demonstrating batched GEMM operations with output results permuted as reshaped tensors
40_cutlass_py

example demonstrating CUTLASS with Python interface
41_multi_head_attention

example demonstrating attention example with non-fixed sequence length input
42_ampere_tensorop_group_conv

example demonstrating how to run group convolution kernels using functions and data structures provided by CUTLASS using tensor cores
43_ell_block_sparse_gemm

example demonstrating a Block-Ell sparse gemm
44_fused_multi_head_attention

example demonstrating fused multihead attention (fixed & variable) using shared memory
45_dual_gemm

example demonstrating how to fuse two GEMMs sharing the same left input matrix into one kernel
46_depthwise_simt_conv2dfprop

example demonstrating depthwise 2d convolution kernels using functions and data structures provided by CUTLASS using SIMT instruction
47_ampere_gemm_universal_streamk

example contrasting the Stream-K parallel decomposition for GEMM threadblocks versus the "classic data-parallel" and "Split-K" decompositions.
48_hopper_warp_specialized_gemm

Simple tensorop GEMM example using CUTLASS 3.0 APIs targeting NVIDIA Hopper architecture
49_hopper_gemm_schedules_with_collective_builder

Hopper GEMM example leveraging collective operation builders to showcase the builder API and the various kernel scheduled supported in CUTLASS 3.0 such as warp specialized persistent mainloops.
50_hopper_gemm_with_epilogue_swizzle

Hopper GEMM example to create a GEMM kernel with custom a collective mainloop and a custom vectorized epilogue.
51_hopper_gett

Hopper GETT example illustrating the ease with which GETTs can be run due to CUTLASS 3.0's unified micro-kernels and CuTe's hierarchical layouts.
52_hopper_gather_scatter_fusion

Hopper example that fuses gather before GEMM and scatter after GEMM into the same kernel
53_hopper_gemm_permute

Hopper example demonstrating the fusion of tensor permutation operations with a GEMM kernel
54_hopper_fp8_warp_specialized_gemm

Hopper example of instantiating and running an FP8 GEMM kernel
55_hopper_mixed_dtype_gemm

Hopper GEMM example with different A and B data types using CUTLASS 3.x APIs for DL kernels with fused dequantization.
56_hopper_ptr_array_batched_gemm

Hopper Ptr-Array Batched GEMM example using CUTLASS 3.x API.
57_hopper_grouped_gemm

Hopper Grouped GEMM using CUTLASS 3.x API.
58_ada_fp8_gemm

Ada GEMM kernel targetting Ada FP8 tensor cores via the CUTLASS 2.x API.
59_ampere_gather_scatter_conv

CuTe and CUTLASS 3.x based Ampere convolution fprop kernel capable of operating on both affine and gather/scatter tensors, showing how kernel authors can re-use CUTLASS 3.x collectives in their custom kernels.
61_hopper_gemm_with_topk_and_softmax

Hopper GEMM kernel with Top-K and softmax epilogue fusion.

70_blackwell_gemm

Simple dense GEMM example targeting the NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
71_blackwell_gemm_with_collective_builder

Blackwell SM100 GEMM example demonstrating compatible mainloop+epilogue builder schedules and epilogue visitor tree (EVT) construction
72a_blackwell_narrow_precision_gemm

Block-scaled dense GEMM example targeting the NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
73_blackwell_gemm_preferred_cluster

Blackwell SM100 GEMM kernel with preferred cluster feature.
74_blackwell_gemm_streamk

Blackwell SM100 GEMM kernel using the Stream-K scheduler
75_blackwell_grouped_gemm

Blackwell SM100 grouped GEMM kernel
76_blackwell_conv

Simple convolution(fprop/dgrad/wgrad) example targeting NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
77_blackwell_fmha

Blackwell SM100 FMHA kernel

CuTe - Programming Examples

Examples that do not rely on CUTLASS and directly showcase the features of CuTe are located in cutlass/examples/cute.

Additionally, CuTe's core layout and layout algebra have their own test cases within cutlass/test/unit/cute/core/ that users might find useful as examples of CuTe.

Python Interface Examples

Examples leveraging CUTLASS's Python interface are located in cutlass/examples/python.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

README.md

CUTLASS - Programming Examples

CuTe - Programming Examples

Python Interface Examples

Files

examples

Directory actions

More options

Directory actions

More options

Latest commit

History

examples

Folders and files

parent directory

README.md

CUTLASS - Programming Examples

CuTe - Programming Examples

Python Interface Examples