DeepGEMM - Efficient FP8 Matrix Multiplication Library
A clear and efficient FP8 matrix multiplication library with fine-grained scaling, optimized for NVIDIA Hopper architecture
DeepGEMM Architecture Design
DeepGEMM features a carefully designed architecture optimized for NVIDIA Hopper Tensor Cores, enabling efficient FP8 matrix multiplication.
Fully JIT Design
DeepGEMM employs a fully Just-In-Time (JIT) design with no install-time compilation. All kernels are compiled at runtime using a lightweight JIT module, treating GEMM shapes, block sizes, and pipeline stages as compile-time constants, saving registers and allowing more compiler optimizations.
Hopper TMA Features
Fully leverages Hopper architecture's Tensor Memory Accelerator (TMA) for faster and asynchronous data movement, including TMA loads, stores, broadcasts, and descriptor prefetching, significantly enhancing performance.
Core Optimization Techniques
DeepGEMM employs several advanced optimization techniques that go beyond traditional GEMM libraries to achieve exceptional performance.
Persistent Thread Specialization
Following CUTLASS design, overlaps data movement, tensor core MMA instructions, and CUDA core promotion to improve computational efficiency.
Unaligned Block Sizes
Supports unaligned block sizes (like 112) to better utilize SMs. For example, with M=256, N=7168, using BLOCK_M=128, BLOCK_N=112 allows 128 SMs to work instead of 112 with aligned block sizes.
FFMA SASS Interleaving
Modifies FFMA instructions in the compiled binary by flipping yield and reuse bits, creating more opportunities for overlapping MMA instructions with promotion FFMA instructions, yielding 10%+ performance in some cases.
Unified Optimized Block Scheduler
One scheduler for all non-grouped and grouped kernels, employing rasterization to enhance L2 cache reuse, improving overall performance.
Code Explained
DeepGEMM's core kernel function is only about 300 lines of code, with a clean design that makes it easy to learn about Hopper FP8 matrix multiplication and optimization techniques.
// Simplified version of the core GEMM kernel
template <
int BLOCK_M,
int BLOCK_N,
int BLOCK_K,
int CLUSTER_M,
int CLUSTER_N,
int STAGES,
int WARPGROUP_SIZE,
bool ENABLE_INTERLEAVE
>
__global__ void gemm_fp8_fp8_bf16_nt_kernel(
const void* __restrict__ lhs,
const void* __restrict__ lhs_scale,
const void* __restrict__ rhs,
void* __restrict__ output,
const void* __restrict__ rhs_scale,
int m, int n, int k,
int lhs_stride, int rhs_stride, int output_stride) {
// Using TMA to load LHS, LHS scale factors, and RHS
// Using CUDA cores for two-level accumulation (promotion)
// Using TMA to store output
// Persistent thread specialization implementation
// Unaligned block size support
// FFMA SASS interleaving optimization
}
Code Structure Features
- Templated design supporting different block sizes and optimization configurations
- Full utilization of Hopper architecture's TMA features for efficient data movement
- Two-level accumulation to solve the imprecise accumulation issue of FP8 tensor cores
- Persistent thread specialization implementation for optimized computation and data movement overlap
Interface Design
- Normal Dense GEMM: Using deep_gemm.gemm_fp8_fp8_bf16_nt function
- Grouped GEMM (Contiguous Layout): Using m_grouped_gemm_fp8_fp8_bf16_nt_contiguous function, suitable for MoE models where experts share the same shape
- Grouped GEMM (Masked Layout): Using m_grouped_gemm_fp8_fp8_bf16_nt_masked function, used in inference decoding phase with CUDA graphs enabled where CPU doesn't know the number of tokens each expert receives
Performance
DeepGEMM's performance is comparable or better than expert-tuned libraries across various matrix shapes, with up to 2.7x speedup in some cases.
Performance Highlights
- Normal GEMM: Up to 2.7x speedup on certain shapes
- Grouped GEMM with contiguous layout: Up to 1.2x speedup
- Grouped GEMM with masked layout: Also up to 1.2x speedup
- Computational performance up to 1358 TFLOPS tested on H800 SXM5
- High memory bandwidth utilization, reaching up to 2668 GB/s
Key Features of DeepGEMM
Exceptional Performance
DeepGEMM delivers up to 2.7x speedup compared to expert-tuned libraries on NVIDIA Hopper architecture, achieving over 1350 TFLOPS
Advanced Optimizations
DeepGEMM employs persistent thread specialization, unaligned block sizes, and FFMA SASS interleaving for maximum computational efficiency
Flexible Integration
DeepGEMM seamlessly integrates with deep learning frameworks and scientific computing libraries through clean, well-documented interfaces
Fine-grained Scaling
DeepGEMM efficiently scales from small matrices to large-scale computations with optimized block scheduling and rasterization techniques
Native FP8 Support
DeepGEMM provides first-class support for FP8 precision, optimized specifically for NVIDIA Hopper Tensor Cores
High Memory Bandwidth
DeepGEMM achieves exceptional memory bandwidth utilization up to 2668 GB/s, maximizing hardware capabilities
Clean Code Design
DeepGEMM features a concise 300-line core kernel with clear design patterns, making it easy to understand and extend
Optimized Grouped GEMM
DeepGEMM excels at both contiguous and masked grouped GEMM operations, perfect for Mixture of Experts (MoE) models
Frequently Asked Questions
- What is DeepGEMM and what makes it special?
- DeepGEMM is an efficient FP8 matrix multiplication library optimized for NVIDIA Hopper architecture. What makes it special is its clean design (only ~300 lines of core code), exceptional performance (up to 2.7x speedup over expert-tuned libraries), and advanced optimization techniques like persistent thread specialization and FFMA SASS interleaving.
- What hardware is DeepGEMM optimized for?
- DeepGEMM is specifically optimized for NVIDIA Hopper architecture, taking full advantage of Hopper's Tensor Memory Accelerator (TMA) features and Tensor Cores for FP8 matrix multiplication. It has been tested on H800 SXM5 GPUs, achieving up to 1358 TFLOPS of computational performance.
- How does DeepGEMM achieve its performance gains?
- DeepGEMM achieves its performance through several advanced techniques: fully JIT design with compile-time constants, unaligned block sizes for better SM utilization, FFMA SASS interleaving, two-level accumulation to solve FP8 precision issues, and a unified optimized block scheduler with rasterization for enhanced L2 cache reuse.
- What types of GEMM operations does DeepGEMM support?
- DeepGEMM supports three main types of GEMM operations: normal dense GEMM (using deep_gemm.gemm_fp8_fp8_bf16_nt), grouped GEMM with contiguous layout, and grouped GEMM with masked layout. The latter two are particularly useful for Mixture of Experts (MoE) models.
- Is DeepGEMM suitable for Mixture of Experts (MoE) models?
- Yes, DeepGEMM is highly suitable for MoE models. It provides specialized grouped GEMM operations with both contiguous and masked layouts, which are optimized for the specific computation patterns in MoE architectures. These operations can deliver up to 1.2x speedup compared to other libraries.
- What precision does DeepGEMM support?
- DeepGEMM primarily focuses on FP8 precision, which is increasingly important for efficient AI model inference. It implements a two-level accumulation approach to solve the imprecise accumulation issues that can occur with FP8 tensor cores, ensuring both performance and accuracy.
- How does DeepGEMM's memory bandwidth utilization compare to other libraries?
- DeepGEMM achieves exceptional memory bandwidth utilization, reaching up to 2668 GB/s. This is possible through its efficient use of Hopper's TMA features for faster and asynchronous data movement, including TMA loads, stores, broadcasts, and descriptor prefetching.
- Can I learn from DeepGEMM's implementation to optimize my own CUDA code?
- Absolutely! DeepGEMM's core kernel is only about 300 lines of code with a clean design, making it an excellent learning resource. It demonstrates advanced CUDA optimization techniques, efficient use of Hopper architecture features, and clever approaches to matrix multiplication that can be applied to other high-performance computing tasks.