NPU-first runtime

The fastest, most efficient LLM inference on NPUs

FastFlowLM delivers an Ollama-style developer experience optimized for tile-structured NPU accelerators. Install in seconds, stream tokens instantly, and run context windows up to 256k — all with dramatically better efficiency than GPU-first stacks. Our GA release for AMD Ryzen™ AI NPUs is available today, with betas for Qualcomm Snapdragon and Intel Core Ultra coming soon.

Download FastFlowLM (Windows) GitHub

Runtime size

~16 MB
Context

Up to 256k tokens
Supported chips

Ryzen™ AI (Strix, Halo, Kraken)

Ryzen™ AI FastFlowLM Runtime

10× power efficiency 256k ctx Vision · Audio · Text

PowerShell

flm run llama3.2:1b
flm list
flm serve llama3.2:1b

10×

Power Efficiency

256k

Context Tokens

~16MB

Runtime Size

LLMs

VLMs

Audio

Embeddings

MoE

NPU‑First Architecture

Built exclusively for AMD Ryzen™ AI NPUs with optimized kernels for maximum efficiency.

GPT-OSS on NPU

GPT-OSS-20B streaming fully on the Ryzen™ AI NPU

GPT-OSS 20B running locally on the Ryzen AI NPU

GPT-OSS on NPU

Runs GPT-OSS-20B at 19 TPS (token per second) with 10× GPU efficiency — the fastest MoE on any NPU.

Whisper on-device

Transcribe and summarize long-form audio locally

Whisper transcription and summarization demo

Whisper on-device

Transcribe hours of audio locally — FLM runs OpenAI Whisper fully on the NPU — fast, private, and efficient.

Llama 3.2 on WebUI

Interact with Llama 3.2-3B through the FastFlowLM WebUI

Llama 3.2 chat demo running in the FastFlowLM WebUI

Llama 3.2 on WebUI

Runs Meta Llama 3.2-3B at 28 TPS with over 10× GPU efficiency — the fastest on any NPU.

Install

From download to first token in under a minute

FastFlowLM ships as a 16 MB runtime with an Ollama-compatible CLI. No CUDA, no drivers, no guesswork—just run the installer, pull a model, and start chatting.

Zero-conf installer

Signed FastFlowLM installers cover every Ryzen™ AI laptop—just download and run.
Drop-in APIs

Compatible with Ollama, OpenAI, and Open WebUI endpoints for existing tooling.
Secure by default

Local auth tokens, TLS, and offline mode keep your data on-device.

Quickstart

Invoke-WebRequest https://github.com/FastFlowLM/FastFlowLM/releases/latest/download/flm-setup.exe `
  -OutFile flm-setup.exe
Start-Process .\flm-setup.exe -Wait
flm pull llama3.2:3b
flm run llama3.2:3b --ctx 256k

POST /v1/chat/completions
Authorization: Bearer $FLM_TOKEN

curl -s localhost:11434/api/generate \
  -d '{"model":"gemma3:4b","prompt":"hello"}'

Models

One CLI, every Ryzen-ready model

Pull curated FastFlowLM recipes. The runtime streams tokens via HTTP, WebSocket, or the Ollama-compatible API, so existing apps work without rewrites.

Flagship reasoning

Llama 3.2 · DeepSeek · Qwen 3

Optimized kernels for 70B down to 1B, with automatic quantization and smart context reuse.

Vision & speech

Gemma 3 VLM · Whisper · Gemma Audio

VLM and audio pipelines run on the NPU, enabling private multimodal assistants.

Edge fine-tuning

FLM MoE + Embedding suites

Use built-in adapters, LoRA checkpoints, and embedding endpoints for retrieval workflows.

Browse models Model docs

Benchmarks

Proof on silicon, not slides

FastFlowLM is tuned on real Ryzen™ AI hardware with synthetic and application-level workloads. Expect steady 40–80 tok/s on 7B models at < 10 W, plus deterministic latency for agentic chains.

Full-stack telemetry

Counters for NPU, CPU, and memory let you see exactly where cycles go.
Scenario-driven suites

Instruction tuning, RAG, chat, and multimodal tests mirror real workloads.

Benchmark details View raw results

Llama3.2 3B @ 4-bit

72 tok/s

Ryzen™ AI 9 HX 370 · 8 ms median latency

Gemma 3 4B Vision

18 fps

Vision + text pipeline on XDNA2 with shared memory

Power draw

9.6 W

Full assistant stack vs ~45 W GPU baseline

Remote test drive

No Ryzen™ AI hardware yet? Launch the hosted FastFlowLM + Open WebUI sandbox and stream from a live AMD Ryzen™ AI box with 96 GB RAM.

Live hardware

Same builds we use internally, refreshed with every release.
Guest access

Instant login with rotating demo credentials.
Bring your apps

Point your HTTP client at the public endpoint to try agent flows.

Launch test drive

Dive into docs Installation guide