There is a quiet truth in the local-AI conversation that gets drowned out by every benchmark headline involving a 70-billion-parameter model on a $4,000 workstation: most people do not need that model. They need a 7-billion one that runs on the desktop they already own, answers in a couple of seconds, and does not require a discrete GPU, a server-room power budget, or a second mortgage. As of April 2026, the mini PC class that delivers exactly this — AMD Ryzen 7 8845HS and Ryzen 9 8945HS systems with 32 to 64 GB of DDR5 — is the most under-discussed piece of hardware in the local-LLM ecosystem.

The machines are familiar names: the Beelink SER8 at roughly $649, the Geekom A8 Max at about $721 in its review configuration (Liliputing), the Minisforum UM890 Pro, and the GMKtec K8 Plus, which has been spotted as low as $399 on sale. They share a chassis-class profile of around 0.5 to 0.7 litres, an 8-core Zen 4 CPU, a Radeon 780M integrated GPU, dual SO-DIMM slots up to DDR5-5600, and an XDNA 1 neural engine rated at 16 TOPS. Per AMD’s own specifications, the 8845HS and 8945HS differ mainly in CPU clock; the iGPU and NPU are the same silicon.

Memory bandwidth is the ceiling, not core count

Local LLM inference is, almost without exception, memory-bandwidth-bound at the batch sizes a single human uses. A dual-channel DDR5-5600 configuration tops out at roughly 89.6 GB/s of theoretical bandwidth, with real-world numbers in the 80–86 GB/s range under STREAM-class benchmarks. That is a fraction of an Apple M-series unified memory pool or an AMD Strix Halo platform, and it is the single number that explains every other number in this article. It is also, crucially, enough.

A 7-billion-parameter model in a 5-bit quantisation (Q5_K_M, around 5.5 GB on disk) needs to stream those 5.5 GB through memory once per generated token. At 80 GB/s of usable bandwidth, the theoretical ceiling is roughly 14 tokens per second, and the practical ceiling sits a little below that because of overhead. Field reports on r/LocalLLaMA and llama.cpp’s benchmark discussions consistently land Llama 3.1 8B Q5_K_M at 8 to 14 tokens per second on 8845HS systems running the llama.cpp Vulkan backend against the Radeon 780M, with Mistral 7B Q5_K_M tracking the same envelope. That is faster than most people read, and it is happening on a fanless-adjacent box that idles at around 8 watts.

What runs comfortably, what strains, what stalls

Stepping up to the 13- to 14-billion class is where the budget mini PC stops being a casual answer and becomes a deliberate one. Qwen 2.5 14B and Llama 3 13B in Q4_K_M quantisations weigh in at roughly 7 to 8 GB and stream at 4 to 8 tokens per second on the same hardware in community testing. Phi-3 medium 14B Q4 falls into the same band. Usable, in the sense that a one-paragraph answer arrives in under a minute; not snappy, in the sense that nobody will mistake it for a hosted API.

Outside text generation, the picture is friendlier. Whisper transcription with the medium model runs in real time or better. Sentence-transformer embedding models — the workhorse of any local RAG pipeline — finish a paragraph in milliseconds. What does not run comfortably is the 30-billion-parameter class, anything that overflows 32 GB of system RAM once context and KV cache are accounted for, and image generation: Stable Diffusion XL on the Radeon 780M remains slow, and Phoronix’s Vulkan testing makes clear that the 780M’s strength is throughput on small models, not parallelism on diffusion graphs.

The software stack, honestly told

The CPU-only path is the boring one and also the most reliable: llama.cpp with no acceleration flags will run on every mini PC in this class out of the box, landing in the 5 to 10 tok/s range on a 7B Q5 model. The interesting story of 2025 and early 2026 is the Vulkan backend, which has matured to the point where the Radeon 780M is a credible accelerator for quantised inference — the llama.cpp Vulkan discussion thread shows community members reporting 12 to 18 tok/s on 7B Q4–Q5 models, a meaningful step up from CPU.

ROCm on the 780M remains the messier path. It works, it is faster than CPU when it works, and it is the option most likely to break across a kernel update.

The XDNA NPU — the 16 TOPS neural engine that AMD has marketed as the centrepiece of “Ryzen AI” — is the part of this story that requires the most candour. As ServeTheHome and others have documented, the general-purpose LLM software stack for XDNA in April 2026 is still immature. Most local-LLM users are not using the NPU at all. The community has settled, pragmatically, on Vulkan-on-iGPU as the path of least resistance.

The takeaway

For most people, a $700 Ryzen 8845HS mini PC with 32 GB of DDR5-5600 is more than enough local AI. It will run a Llama 3.1 8B-class assistant at human-readable speed, transcribe meetings in real time, embed a personal document corpus in seconds, and do all of it on a desk corner with a power budget closer to a lightbulb than a workstation. The honest caveat is that the iGPU is slower than a discrete GPU at the same price tier, and the NPU is not yet doing the work its specifications suggest it could. The honest reframing is that for the model class most users actually need, none of that matters. The question worth asking before buying a $2,000 GPU for local AI is not “how big a model can I run,” but “how big a model do I actually need.” For a great many readers, the answer fits in seven billion parameters, and it fits in a mini PC that already costs less than the GPU alone.