Technology deep dive

FastFlowLM is built NPU-first

FastFlowLM retools the entire inference stack for AMD’s XDNA-based Ryzen™ AI silicon. Instead of mirroring GPU kernels, we partition prefill and decoding into tile-sized workloads, keep KV state resident on-chip, and stream attention math through the NPU’s 12 TFLOPS bf16 fabric.

The result: up to 5.2× faster prefill and 4.8× faster decoding than the iGPU, 33.5× / 2.2× gains versus the CPU, while drawing 67× / 223× less power. We also lift the context ceiling from 2K to 256K tokens and halve Gemma 3 image TTFT from 8 s down to 4 s on the same laptop-class part.

Prefill acceleration

5.2× vs iGPU

Ryzen AI NPU prefill throughput.

Decoding acceleration

4.8× vs iGPU

Tile-aware token streaming.

Context window

256K tokens

Up from 2K in stock stacks.

Power draw

67× / 223× less

NPU vs iGPU / CPU.

Image TTFT

4 s

Gemma 3 vision, down from 8 s.

Parallel-by-design

Fine-grained orchestration on XDNA

The Ryzen AI chip blends an iGPU and an always-on NPU, both rated at 12 TFLOPS bf16 compute. While the GPU enjoys 125 GB/s of memory bandwidth, the NPU sits at 60 GB/s—so FastFlowLM had to attack the problem with software-led tiling, compression, and scheduling.

  • AIE tile partitioning

    We map transformer blocks to configurable tiles, fuse matmuls + activation, and ensure the compute fabric never waits on host memory.

  • Streaming KV residency

    Attention state stays inside NPU SRAM, enabling 256K-token prompts without bouncing to LPDDR.

  • Dynamic power envelopes

    Always-on inference taps the NPU’s low-leakage island, yielding the 67×/223× power savings cited above.

Execution phases

FastFlowLM dissects inference into deterministic phases so the runtime can pipeline work across CPU, GPU, and NPU.

  • Prefill turbo

    Token embedding, rotary math, and large matmuls are staged across contiguous tiles for the 5.2× prefill gains.

  • Token streaming

    Lightweight decode kernels reuse on-chip KV blocks, hold steady at 4.8× faster than the iGPU, and avoid cache thrash.

  • Vision + multimodal

    Image TTFT drops from 8 s to 4 s by overlapping patch projection with text prefill on separate compute islands.

Edge to rack

Scaling the FastFlowLM approach

Ryzen AI proves the concept locally, but the same architecture is already moving toward rack-scale NPU deployments.

Edge laptops

Always-on, efficient AI

NPUs stay cool handling assistants, copilots, and background perception without burning the iGPU/CPU. Users avoid the thermal throttling that plagues CPU/iGPU-only stacks.

Rack roadmap

Qualcomm AI200/AI250 & AMD discrete NPU

Qualcomm is redefining rack inference with Hexagon NPUs, while AMD is building a discrete XDNA accelerator. FastFlowLM’s close-to-metal runtime is ready for both paths.

Projected scaling

10× GPU-class throughput

With equal compute and bandwidth, our rack NPU plan models >10.4× faster prefill, >9.6× faster decoding, and >114× better TPS/W than GPU baselines.

Rack architecture

From chips to racks, all tiled structure

Diagram of racks feeding an AIE and memory tile layout

Racks of AIE + memory tile array

  • Racks of NPU Chips

    The images on the left are photos of multiple server trays and chassis that host NPUs in a rack.

  • AIE compute tiles

    In the schematic, the upper blocks labeled “AIE tile” represent an array of programmable compute tiles.

  • Memory tiles

    The darker blocks at the bottom labeled “Memory tile” show the on-chip memory region that sits alongside the AIE tiles.

  • FastFlowLM Technology

    Our FastFlowLM software scales up and scales out naturally to rack level inferences.