-
launches a basic GEMM with single precision inputs and outputs
-
demonstrates CUTLASS Utilities for allocating and initializing tensors
-
debugging utilities for printing register and shared memory contents
-
utility for visualizing all layout functions in CUTLASS
-
example demonstrating an iterator over tiles in memory
-
example demonstrating CUTLASS's batched strided GEMM operation
-
example demonstrating CUTLASS's Split-K parallel reduction kernel
-
example demonstrating mixed precision GEMM using Volta Tensor Cores
-
example demonstrating integer GEMM using Turing Tensor Cores
-
09_turing_tensorop_conv2dfprop
example demonstrating integer implicit GEMM convolution (forward propagation) using Turing Tensor Cores
-
example demonstrating planar complex GEMM kernels
-
example demonstrating planar complex kernels with batch-specific problem sizes
-
example demonstrating GEMM fused with bias and relu
-
example demonstrating two GEMMs or convolutions fused in one kernel
-
example demonstrating FP32 GEMM with implicit TF32 conversion
-
15_ampere_sparse_tensorop_gemm
example demonstrating usage of Sparse Tensor cores
-
16_ampere_tensorop_conv2dfprop
example demonstrating forward convolution on tensors of layout NHWC
-
example demonstrating convolution fused with per channel bias and relu
-
18_ampere_fp64_tensorop_affine2_gemm
example demonstrating Affine-2 GEMM
-
Canonical GEMM using tensor cores
-
Canonical GEMM using SIMT
-
example demonstrating Quaternion GEMM computations
-
example demonstrating Quaternion convolution
-
23_ampere_gemm_operand_reduction_fusion
example demonstrating how to reduce one of the operands of the GEMM along the k-dimension when computing GEMM
-
example demonstrating batch of GEMM operations with distinct problem sizes
-
25_ampere_fprop_mainloop_fusion
example demonstrating fusing activation's per channel scale+bias+relu into the fgrad mainloop
-
26_ampere_wgrad_mainloop_fusion
example demonstrating fusing activation's per channel scale+bias+relu into the wgrad mainloop
-
27_ampere_3xtf32_fast_accurate_tensorop_gemm
example demonstrating emulation of a fast accurate SGEMM with TF32 operations
-
28_ampere_3xtf32_fast_accurate_tensorop_fprop
example demonstrating emulation of a fast accurate FP32 convolution with TF32 operation
-
29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm
example demonstrating emulation of a fast accurate CGEMM with TF32 operation
-
example demonstrating how to compute conv2d gradient with respect to weight (wgrad) together with split-K
-
example demonstrating Symmetric Rank-K update
-
example demonstrating Triangular Matrix-Matrix multiplication
-
33_ampere_3xtf32_tensorop_symm
example demonstrating Symmetric Matrix-Matrix multiplication with FP32 emulation
-
example demonstrating how to compute 2d transposed convolution, also known as deconvolution, using CUTLASS conv2d Dgrad kernels
-
example demonstrating GEMM fused with Softmax in mixed precision using Ampere Tensor Cores
-
example demonstrating fuses gather before GEMM and scatter after GEMM into the same GEMM kernel
-
example demonstrating fuses gemm->layernorm->gemm into one kernel.
-
example demonstrating a batch of SYR2K operations with distinct problem sizes
-
example demonstrating batched GEMM operations with output results permuted as reshaped tensors
-
example demonstrating CUTLASS with Python interface
-
example demonstrating attention example with non-fixed sequence length input
-
example demonstrating how to run group convolution kernels using functions and data structures provided by CUTLASS using tensor cores
-
example demonstrating a Block-Ell sparse gemm
-
example demonstrating fused multihead attention (fixed & variable) using shared memory
-
example demonstrating how to fuse two GEMMs sharing the same left input matrix into one kernel
-
example demonstrating depthwise 2d convolution kernels using functions and data structures provided by CUTLASS using SIMT instruction
-
47_ampere_gemm_universal_streamk
example contrasting the Stream-K parallel decomposition for GEMM threadblocks versus the "classic data-parallel" and "Split-K" decompositions.
-
48_hopper_warp_specialized_gemm
Simple tensorop GEMM example using CUTLASS 3.0 APIs targeting NVIDIA Hopper architecture
-
49_hopper_gemm_schedules_with_collective_builder
Hopper GEMM example leveraging collective operation builders to showcase the builder API and the various kernel scheduled supported in CUTLASS 3.0 such as warp specialized persistent mainloops.
-
50_hopper_gemm_with_epilogue_swizzle
Hopper GEMM example to create a GEMM kernel with custom a collective mainloop and a custom vectorized epilogue.
-
Hopper GETT example illustrating the ease with which GETTs can be run due to CUTLASS 3.0's unified micro-kernels and CuTe's hierarchical layouts.
-
52_hopper_gather_scatter_fusion
Hopper example that fuses gather before GEMM and scatter after GEMM into the same kernel
-
Hopper example demonstrating the fusion of tensor permutation operations with a GEMM kernel
-
54_hopper_fp8_warp_specialized_gemm
Hopper example of instantiating and running an FP8 GEMM kernel
-
Hopper GEMM example with different A and B data types using CUTLASS 3.x APIs for DL kernels with fused dequantization.
-
56_hopper_ptr_array_batched_gemm
Hopper Ptr-Array Batched GEMM example using CUTLASS 3.x API.
-
Hopper Grouped GEMM using CUTLASS 3.x API.
-
Ada GEMM kernel targetting Ada FP8 tensor cores via the CUTLASS 2.x API.
-
CuTe and CUTLASS 3.x based Ampere convolution fprop kernel capable of operating on both affine and gather/scatter tensors, showing how kernel authors can re-use CUTLASS 3.x collectives in their custom kernels.
-
61_hopper_gemm_with_topk_and_softmax
Hopper GEMM kernel with Top-K and softmax epilogue fusion.
-
Simple dense GEMM example targeting the NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
-
71_blackwell_gemm_with_collective_builder
Blackwell SM100 GEMM example demonstrating compatible mainloop+epilogue builder schedules and epilogue visitor tree (EVT) construction
-
72a_blackwell_narrow_precision_gemm
Block-scaled dense GEMM example targeting the NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
-
73_blackwell_gemm_preferred_cluster
Blackwell SM100 GEMM kernel with preferred cluster feature.
-
Blackwell SM100 GEMM kernel using the Stream-K scheduler
-
Blackwell SM100 grouped GEMM kernel
-
Simple convolution(fprop/dgrad/wgrad) example targeting NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
-
Blackwell SM100 FMHA kernel
Examples that do not rely on CUTLASS and directly showcase the features of CuTe are located in cutlass/examples/cute.
Additionally, CuTe's core layout and layout algebra have their own test cases within cutlass/test/unit/cute/core/ that users might find useful as examples of CuTe.
Examples leveraging CUTLASS's Python interface are located in cutlass/examples/python.