GemmKernels v0.2.0
Merged pull requests:
- Use native Float16 (#69) (@maleadt)
- Parallelized testing using XUnit.jl. (#71) (@maleadt)
- CompatHelper: bump compat for "CUDA" to "3.0" (#76) (@github-actions[bot])
- Fix layout fragment type mismatch (#80) (@smnbl)
- Fix CI (#82) (@maleadt)
- Replace StaticArrays with a simple immutable array type (#83) (@maleadt)
- update CUDA compat (#87) (@smnbl)
- Update README (#88) (@thomasfaingnaert)
- Update operator fusion benchmarks (#89) (@thomasfaingnaert)
- Cleanup kernel launch code (#90) (@thomasfaingnaert)
- Revert "Replace StaticArrays with a simple immutable array type (#83)" (#91) (@thomasfaingnaert)
- Disable codecov status (#92) (@thomasfaingnaert)
- Generalise WMMA Operator (#93) (@thomasfaingnaert)
- Add tensor contraction benchmark (#94) (@thomasfaingnaert)
- Replace GPUifyLoops with KernelAbstractions (#95) (@thomasfaingnaert)
- CompatHelper: bump compat for KernelAbstractions to 0.8, (keep existing compat) (#96) (@github-actions[bot])
- Re-land StaticArrays removal (#98) (@maleadt)
- FPU operator (#101) (@wardvermeulen)
- Add CI for Julia 1.9 (#102) (@thomasfaingnaert)
- Bump compat bounds to use newer CUDA.jl (#103) (@maleadt)
- CompatHelper: bump compat for LLVM to 6, (keep existing compat) (#106) (@github-actions[bot])
- Replace KernelAbstractions with LLVMLoopInfo. (#107) (@maleadt)
- Make LocalArray setindex convert. (#109) (@maleadt)
- Make vectorized store convert and perform multiple stores if required (#111) (@maleadt)
- Configure and check shared memory automatically. (#112) (@maleadt)
- Enable use of FPU operator in BLAS wrappers. (#113) (@maleadt)
- Add a benchmarks bot. (#116) (@maleadt)
- Commit the Manifest. (#118) (@maleadt)
- Introduce a helper macro to simplify immutable indexing. (#119) (@maleadt)
- Add zero layout to optimize alpha/beta=zero. (#120) (@maleadt)
- Use XUnit.jl for parallel testing. (#121) (@maleadt)
- Unify WMMA and FPU operator typevars [NFC] (#122) (@maleadt)
- Transform VecElement-contained values. (#123) (@maleadt)
- Simplify tests. (#124) (@maleadt)
- Update manifest (#126) (@github-actions[bot])
- Fix vector op indexing and add boundscheck. (#127) (@maleadt)
- BLAS: Convert alpha & beta to more appropriate types. (#129) (@maleadt)
- Add layouts for accessing unaligned or non tile-sized global. (#130) (@maleadt)
- Fix fragtypes of ColMajor and RowMajor fallback layouts. (#131) (@maleadt)
- Put the BLAS interface directly in the GemmKernels.jl module. (#132) (@maleadt)
- Add example. (#133) (@maleadt)
- Detect alignment issues and throw a Julia error. (#134) (@maleadt)
- Check if the warp doesn't index out of the tile subpartition. (#135) (@maleadt)
- Simplify config definition and usage. (#136) (@maleadt)
- Add a mechanism to expose execution details to callers. (#137) (@maleadt)
- Show kernel details on benchmark differences. (#138) (@maleadt)
- Update manifest (#139) (@github-actions[bot])
- Update manifest (#141) (@github-actions[bot])
- Update manifest (#144) (@github-actions[bot])
- Update manifest (#145) (@github-actions[bot])
- Update manifest (#146) (@github-actions[bot])
- Update manifest (#147) (@github-actions[bot])
- enable dependabot for GitHub actions (#148) (@ranocha)
- Bump peter-evans/create-pull-request from 3 to 5 (#149) (@dependabot[bot])
- Bump actions/checkout from 2 to 3 (#150) (@dependabot[bot])
- Update manifest (#151) (@github-actions[bot])
- Update manifest (#153) (@github-actions[bot])
- CompatHelper: bump compat for "CUDA" to "5" (#155) (@github-actions[bot])
- Update manifest (#156) (@github-actions[bot])
- Bump actions/checkout from 3 to 4 (#157) (@dependabot[bot])
- Update manifest (#158) (@github-actions[bot])
- Rework benchmarks and tests (#160) (@thomasfaingnaert)
- Add more flexible FPU operator (#161) (@wardvermeulen)
- Update manifest (#162) (@github-actions[bot])
- Update manifest (#163) (@github-actions[bot])
- Fix configuration heuristic. (#164) (@maleadt)
- Throw ConfigError for unsupported WMMA shapes (#166) (@thomasfaingnaert)
- Add a check for the block shape in the K dimension (#167) (@wardvermeulen)
- Adapt to CUDA.jl profile changes (#168) (@thomasfaingnaert)
- Compare with cuBLAS during benchmarking (#169) (@thomasfaingnaert)
- Refactor configs to use macros (#170) (@thomasfaingnaert)
- Test more WMMA configurations (#171) (@thomasfaingnaert)
- Improve heuristic for memcopy tile sizes (#172) (@thomasfaingnaert)
- Check number of stages for pipelined kernel (#173) (@thomasfaingnaert)
- Check number of threads before launching kernel (#174) (@thomasfaingnaert)
- Fix alignment check for non 16-byte alignments (#175) (@thomasfaingnaert)
- Do not hardcode vectorisation width in layouts (#176) (@thomasfaingnaert)
- Fix typo in parallelise function name (#178) (@thomasfaingnaert)
- Add script to tune parameters (#179) (@thomasfaingnaert)
- Check tile sizes in config (#180) (@thomasfaingnaert)
- FPUOp: Ensure the FMA operator is inlined. (#182) (@maleadt)
- Extend set of WMMA operator shapes (#183) (@thomasfaingnaert)
- Apply isapprox elementwise (#185) (@thomasfaingnaert)
- Get benchmarks working again (#186) (@thomasfaingnaert)
- Remove Julia 1.8 from CI (#187) (@thomasfaingnaert)
- Refactor tuning script (#190) (@maleadt)
- Bump julia-actions/setup-julia from 1 to 2 (#191) (@dependabot[bot])
Closed issues:
- Errors on small array inputs (#52)
- Feature request: support for matmul with integer matrices (#64)
- Feature request: support
Matrix{Float32} = Matrix{Float32} × Matrix{Float32}
(#75) - Remove fragtype_a (#84)
- Replace GPUifyLoops.@unroll (#86)
- Use LLVMLoopInfo.jl (#104)
- Optimizations when alpha or beta is 0 (#110)
- Transform functions: pass values, not VecElements (#114)
- Benchmark bot (#115)
- Questions about usage of registers (#152)
- A wrong function name
parallellise
(#177)