DeepGEMM - Efficient FP8 Matrix Multiplication Library

A clear and efficient FP8 matrix multiplication library with fine-grained scaling, optimized for NVIDIA Hopper architecture

DeepGEMM Matrix Multiplication Visualization

DeepGEMM Architecture Design

DeepGEMM features a carefully designed architecture optimized for NVIDIA Hopper Tensor Cores, enabling efficient FP8 matrix multiplication.

DeepGEMM Architecture Diagram

Fully JIT Design

DeepGEMM employs a fully Just-In-Time (JIT) design with no install-time compilation. All kernels are compiled at runtime using a lightweight JIT module, treating GEMM shapes, block sizes, and pipeline stages as compile-time constants, saving registers and allowing more compiler optimizations.

Hopper TMA Features

Fully leverages Hopper architecture's Tensor Memory Accelerator (TMA) for faster and asynchronous data movement, including TMA loads, stores, broadcasts, and descriptor prefetching, significantly enhancing performance.

Core Optimization Techniques

DeepGEMM employs several advanced optimization techniques that go beyond traditional GEMM libraries to achieve exceptional performance.

DeepGEMM Optimization Techniques

Persistent Thread Specialization

Following CUTLASS design, overlaps data movement, tensor core MMA instructions, and CUDA core promotion to improve computational efficiency.

Unaligned Block Sizes

Supports unaligned block sizes (like 112) to better utilize SMs. For example, with M=256, N=7168, using BLOCK_M=128, BLOCK_N=112 allows 128 SMs to work instead of 112 with aligned block sizes.

FFMA SASS Interleaving

Modifies FFMA instructions in the compiled binary by flipping yield and reuse bits, creating more opportunities for overlapping MMA instructions with promotion FFMA instructions, yielding 10%+ performance in some cases.

Unified Optimized Block Scheduler

One scheduler for all non-grouped and grouped kernels, employing rasterization to enhance L2 cache reuse, improving overall performance.

Code Explained

DeepGEMM's core kernel function is only about 300 lines of code, with a clean design that makes it easy to learn about Hopper FP8 matrix multiplication and optimization techniques.

// Simplified version of the core GEMM kernel
template <
    int BLOCK_M,
    int BLOCK_N,
    int BLOCK_K,
    int CLUSTER_M,
    int CLUSTER_N,
    int STAGES,
    int WARPGROUP_SIZE,
    bool ENABLE_INTERLEAVE
>
__global__ void gemm_fp8_fp8_bf16_nt_kernel(
    const void* __restrict__ lhs,
    const void* __restrict__ lhs_scale,
    const void* __restrict__ rhs,
    void* __restrict__ output,
    const void* __restrict__ rhs_scale,
    int m, int n, int k,
    int lhs_stride, int rhs_stride, int output_stride) {
    
    // Using TMA to load LHS, LHS scale factors, and RHS
    // Using CUDA cores for two-level accumulation (promotion)
    // Using TMA to store output
    
    // Persistent thread specialization implementation
    // Unaligned block size support
    // FFMA SASS interleaving optimization
}

Code Structure Features

  • Templated design supporting different block sizes and optimization configurations
  • Full utilization of Hopper architecture's TMA features for efficient data movement
  • Two-level accumulation to solve the imprecise accumulation issue of FP8 tensor cores
  • Persistent thread specialization implementation for optimized computation and data movement overlap
DeepGEMM Code Flow Diagram

Interface Design

  • Normal Dense GEMM: Using deep_gemm.gemm_fp8_fp8_bf16_nt function
  • Grouped GEMM (Contiguous Layout): Using m_grouped_gemm_fp8_fp8_bf16_nt_contiguous function, suitable for MoE models where experts share the same shape
  • Grouped GEMM (Masked Layout): Using m_grouped_gemm_fp8_fp8_bf16_nt_masked function, used in inference decoding phase with CUDA graphs enabled where CPU doesn't know the number of tokens each expert receives

Performance

DeepGEMM's performance is comparable or better than expert-tuned libraries across various matrix shapes, with up to 2.7x speedup in some cases.

DeepGEMM Performance Comparison

Performance Highlights

  • Normal GEMM: Up to 2.7x speedup on certain shapes
  • Grouped GEMM with contiguous layout: Up to 1.2x speedup
  • Grouped GEMM with masked layout: Also up to 1.2x speedup
  • Computational performance up to 1358 TFLOPS tested on H800 SXM5
  • High memory bandwidth utilization, reaching up to 2668 GB/s

Key Features of DeepGEMM

Exceptional Performance

DeepGEMM delivers up to 2.7x speedup compared to expert-tuned libraries on NVIDIA Hopper architecture, achieving over 1350 TFLOPS

Advanced Optimizations

DeepGEMM employs persistent thread specialization, unaligned block sizes, and FFMA SASS interleaving for maximum computational efficiency

Flexible Integration

DeepGEMM seamlessly integrates with deep learning frameworks and scientific computing libraries through clean, well-documented interfaces

Fine-grained Scaling

DeepGEMM efficiently scales from small matrices to large-scale computations with optimized block scheduling and rasterization techniques

Native FP8 Support

DeepGEMM provides first-class support for FP8 precision, optimized specifically for NVIDIA Hopper Tensor Cores

High Memory Bandwidth

DeepGEMM achieves exceptional memory bandwidth utilization up to 2668 GB/s, maximizing hardware capabilities

Clean Code Design

DeepGEMM features a concise 300-line core kernel with clear design patterns, making it easy to understand and extend

Optimized Grouped GEMM

DeepGEMM excels at both contiguous and masked grouped GEMM operations, perfect for Mixture of Experts (MoE) models

Frequently Asked Questions

What is DeepGEMM and what makes it special?
DeepGEMM is an efficient FP8 matrix multiplication library optimized for NVIDIA Hopper architecture. What makes it special is its clean design (only ~300 lines of core code), exceptional performance (up to 2.7x speedup over expert-tuned libraries), and advanced optimization techniques like persistent thread specialization and FFMA SASS interleaving.
What hardware is DeepGEMM optimized for?
DeepGEMM is specifically optimized for NVIDIA Hopper architecture, taking full advantage of Hopper's Tensor Memory Accelerator (TMA) features and Tensor Cores for FP8 matrix multiplication. It has been tested on H800 SXM5 GPUs, achieving up to 1358 TFLOPS of computational performance.
How does DeepGEMM achieve its performance gains?
DeepGEMM achieves its performance through several advanced techniques: fully JIT design with compile-time constants, unaligned block sizes for better SM utilization, FFMA SASS interleaving, two-level accumulation to solve FP8 precision issues, and a unified optimized block scheduler with rasterization for enhanced L2 cache reuse.
What types of GEMM operations does DeepGEMM support?
DeepGEMM supports three main types of GEMM operations: normal dense GEMM (using deep_gemm.gemm_fp8_fp8_bf16_nt), grouped GEMM with contiguous layout, and grouped GEMM with masked layout. The latter two are particularly useful for Mixture of Experts (MoE) models.
Is DeepGEMM suitable for Mixture of Experts (MoE) models?
Yes, DeepGEMM is highly suitable for MoE models. It provides specialized grouped GEMM operations with both contiguous and masked layouts, which are optimized for the specific computation patterns in MoE architectures. These operations can deliver up to 1.2x speedup compared to other libraries.
What precision does DeepGEMM support?
DeepGEMM primarily focuses on FP8 precision, which is increasingly important for efficient AI model inference. It implements a two-level accumulation approach to solve the imprecise accumulation issues that can occur with FP8 tensor cores, ensuring both performance and accuracy.
How does DeepGEMM's memory bandwidth utilization compare to other libraries?
DeepGEMM achieves exceptional memory bandwidth utilization, reaching up to 2668 GB/s. This is possible through its efficient use of Hopper's TMA features for faster and asynchronous data movement, including TMA loads, stores, broadcasts, and descriptor prefetching.
Can I learn from DeepGEMM's implementation to optimize my own CUDA code?
Absolutely! DeepGEMM's core kernel is only about 300 lines of code with a clean design, making it an excellent learning resource. It demonstrates advanced CUDA optimization techniques, efficient use of Hopper architecture features, and clever approaches to matrix multiplication that can be applied to other high-performance computing tasks.