China has seemingly countered NVIDIA’s pared-down AI accelerators with an impressive breakthrough from a company named DeepSeek. The latest buzz is about their project, which leverages the Hopper H800s AI accelerators to multiply performance by an astonishing eightfold in terms of TFLOPS.
### Unleashing Power: DeepSeek’s FlashMLA
China is proving its autonomy in tech advancements, bypassing the need to rely on external tech giants like NVIDIA for hardware enhancements. The homegrown company DeepSeek exemplifies this trend by capitalizing on software ingenuity to optimize performance using existing resources. DeepSeek’s recent innovation has created quite a stir; they’ve extracted remarkable efficiency from NVIDIA’s “cut-down” Hopper H800 GPUs. The magic lies in their strategy to optimize memory usage and manage how resources are distributed across inference tasks.
On Twitter, DeepSeek excitedly announced: “🚀 Day 1 of #OpenSourceWeek: FlashMLA – our efficient MLA decoding kernel for Hopper GPUs. Optimized for variable-length sequences and now in production.”
Here’s a bit of context: During their “OpenSource” week event, DeepSeek is rolling out tech advancements for public use via Github. On day one, they launched FlashMLA, a specialized “decoding kernel” tailored for NVIDIA’s Hopper GPUs. Before diving into the technicalities, it’s essential to highlight the groundbreaking advancements this has introduced.
According to DeepSeek, they’ve achieved a staggering 580 TFLOPS for BF16 matrix operations on the Hopper H800 — that’s nearly eight times the typical industry rating. Furthermore, FlashMLA’s outstanding memory efficiency pushes the limits to an impressive 3000 GB/s, nearly doubling the H800’s claimed peak. What’s fascinating is that these massive gains are purely the result of smart coding, not hardware tweaks.
Visionary x AI also remarked on Twitter: “This is crazy. Blazing fast: 580 TFLOPS on H800, ~8x industry avg (73.5 TFLOPS). Memory wizardry: Hits 3000 GB/s, surpassing H800’s 1681 GB/s peak.”
FlashMLA employs a technique known as “low-rank key-value compression” — essentially, it breaks down bulky data into smaller, more manageable bits, speeding up processing and cutting memory use by 40%-60%. Another smart feature involves a block-based paging system that dynamically adjusts memory allocation depending on task intensity, making it highly effective at handling variable-length sequences, which boosts overall performance.
DeepSeek’s innovation highlights the broader spectrum of AI computing possibilities beyond just hardware reliance. Currently tailored for Hopper GPUs, it piques curiosity about potential performance enhancements if expanded to include the H100 through FlashMLA.