Release v0.2.0 · JuliaGPU/GemmKernels.jl

GemmKernels v0.2.0

Diff since v0.1.0

Merged pull requests:

Use native Float16 (#69) (@maleadt)
Parallelized testing using XUnit.jl. (#71) (@maleadt)
CompatHelper: bump compat for "CUDA" to "3.0" (#76) (@github-actions[bot])
Fix layout fragment type mismatch (#80) (@smnbl)
Fix CI (#82) (@maleadt)
Replace StaticArrays with a simple immutable array type (#83) (@maleadt)
update CUDA compat (#87) (@smnbl)
Update README (#88) (@thomasfaingnaert)
Update operator fusion benchmarks (#89) (@thomasfaingnaert)
Cleanup kernel launch code (#90) (@thomasfaingnaert)
Revert "Replace StaticArrays with a simple immutable array type (#83)" (#91) (@thomasfaingnaert)
Disable codecov status (#92) (@thomasfaingnaert)
Generalise WMMA Operator (#93) (@thomasfaingnaert)
Add tensor contraction benchmark (#94) (@thomasfaingnaert)
Replace GPUifyLoops with KernelAbstractions (#95) (@thomasfaingnaert)
CompatHelper: bump compat for KernelAbstractions to 0.8, (keep existing compat) (#96) (@github-actions[bot])
Re-land StaticArrays removal (#98) (@maleadt)
FPU operator (#101) (@wardvermeulen)
Add CI for Julia 1.9 (#102) (@thomasfaingnaert)
Bump compat bounds to use newer CUDA.jl (#103) (@maleadt)
CompatHelper: bump compat for LLVM to 6, (keep existing compat) (#106) (@github-actions[bot])
Replace KernelAbstractions with LLVMLoopInfo. (#107) (@maleadt)
Make LocalArray setindex convert. (#109) (@maleadt)
Make vectorized store convert and perform multiple stores if required (#111) (@maleadt)
Configure and check shared memory automatically. (#112) (@maleadt)
Enable use of FPU operator in BLAS wrappers. (#113) (@maleadt)
Add a benchmarks bot. (#116) (@maleadt)
Commit the Manifest. (#118) (@maleadt)
Introduce a helper macro to simplify immutable indexing. (#119) (@maleadt)
Add zero layout to optimize alpha/beta=zero. (#120) (@maleadt)
Use XUnit.jl for parallel testing. (#121) (@maleadt)
Unify WMMA and FPU operator typevars [NFC] (#122) (@maleadt)
Transform VecElement-contained values. (#123) (@maleadt)
Simplify tests. (#124) (@maleadt)
Update manifest (#126) (@github-actions[bot])
Fix vector op indexing and add boundscheck. (#127) (@maleadt)
BLAS: Convert alpha & beta to more appropriate types. (#129) (@maleadt)
Add layouts for accessing unaligned or non tile-sized global. (#130) (@maleadt)
Fix fragtypes of ColMajor and RowMajor fallback layouts. (#131) (@maleadt)
Put the BLAS interface directly in the GemmKernels.jl module. (#132) (@maleadt)
Add example. (#133) (@maleadt)
Detect alignment issues and throw a Julia error. (#134) (@maleadt)
Check if the warp doesn't index out of the tile subpartition. (#135) (@maleadt)
Simplify config definition and usage. (#136) (@maleadt)
Add a mechanism to expose execution details to callers. (#137) (@maleadt)
Show kernel details on benchmark differences. (#138) (@maleadt)
Update manifest (#139) (@github-actions[bot])
Update manifest (#141) (@github-actions[bot])
Update manifest (#144) (@github-actions[bot])
Update manifest (#145) (@github-actions[bot])
Update manifest (#146) (@github-actions[bot])
Update manifest (#147) (@github-actions[bot])
enable dependabot for GitHub actions (#148) (@ranocha)
Bump peter-evans/create-pull-request from 3 to 5 (#149) (@dependabot[bot])
Bump actions/checkout from 2 to 3 (#150) (@dependabot[bot])
Update manifest (#151) (@github-actions[bot])
Update manifest (#153) (@github-actions[bot])
CompatHelper: bump compat for "CUDA" to "5" (#155) (@github-actions[bot])
Update manifest (#156) (@github-actions[bot])
Bump actions/checkout from 3 to 4 (#157) (@dependabot[bot])
Update manifest (#158) (@github-actions[bot])
Rework benchmarks and tests (#160) (@thomasfaingnaert)
Add more flexible FPU operator (#161) (@wardvermeulen)
Update manifest (#162) (@github-actions[bot])
Update manifest (#163) (@github-actions[bot])
Fix configuration heuristic. (#164) (@maleadt)
Throw ConfigError for unsupported WMMA shapes (#166) (@thomasfaingnaert)
Add a check for the block shape in the K dimension (#167) (@wardvermeulen)
Adapt to CUDA.jl profile changes (#168) (@thomasfaingnaert)
Compare with cuBLAS during benchmarking (#169) (@thomasfaingnaert)
Refactor configs to use macros (#170) (@thomasfaingnaert)
Test more WMMA configurations (#171) (@thomasfaingnaert)
Improve heuristic for memcopy tile sizes (#172) (@thomasfaingnaert)
Check number of stages for pipelined kernel (#173) (@thomasfaingnaert)
Check number of threads before launching kernel (#174) (@thomasfaingnaert)
Fix alignment check for non 16-byte alignments (#175) (@thomasfaingnaert)
Do not hardcode vectorisation width in layouts (#176) (@thomasfaingnaert)
Fix typo in parallelise function name (#178) (@thomasfaingnaert)
Add script to tune parameters (#179) (@thomasfaingnaert)
Check tile sizes in config (#180) (@thomasfaingnaert)
FPUOp: Ensure the FMA operator is inlined. (#182) (@maleadt)
Extend set of WMMA operator shapes (#183) (@thomasfaingnaert)
Apply isapprox elementwise (#185) (@thomasfaingnaert)
Get benchmarks working again (#186) (@thomasfaingnaert)
Remove Julia 1.8 from CI (#187) (@thomasfaingnaert)
Refactor tuning script (#190) (@maleadt)
Bump julia-actions/setup-julia from 1 to 2 (#191) (@dependabot[bot])

Closed issues:

Errors on small array inputs (#52)
Feature request: support for matmul with integer matrices (#64)
Feature request: support Matrix{Float32} = Matrix{Float32} × Matrix{Float32} (#75)
Remove fragtype_a (#84)
Replace GPUifyLoops.@unroll (#86)
Use LLVMLoopInfo.jl (#104)
Optimizations when alpha or beta is 0 (#110)
Transform functions: pass values, not VecElements (#114)
Benchmark bot (#115)
Questions about usage of registers (#152)
A wrong function name parallellise (#177)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0

GemmKernels v0.2.0

Contributors