GMKtec EVO-X2 and the Quiet Arrival of 70B-Class Local AI on a 1-Liter Desk

There is a version of this story that is genuinely worth telling without hedging, because for once the marketing line happens to match what the hardware actually does: a mini PC the size of a hardback book can now hold a 70-billion-parameter language model entirely in memory and answer you in real time. The machine in question is GMKtec’s EVO-X2, built around AMD’s Ryzen AI Max+ 395 — the chip the industry has been calling Strix Halo. It is not the fastest way to run a 70B model. It is, by a margin that matters, the cheapest way to run one in this form factor, and that is the part of the story that resets expectations for everyone else in the category.

What’s actually inside

The Ryzen AI Max+ 395 pairs a 16-core Zen 5 CPU with a Radeon 8060S integrated GPU built on 40 RDNA 3.5 compute units, plus a 50 TOPS XDNA 2 NPU — and crucially, up to 128 GB of LPDDR5X-8000 unified memory that all three engines can address as a single pool. NotebookCheck’s full review of the EVO-X2 called the result “one of the best mini PCs of 2025,” noting that the chassis stays comparatively cool and quiet under sustained load. Tom’s Hardware reached a similar verdict, framing the EVO-X2 as a “compact Strix Halo powerhouse” whose appeal is not so much raw peak speed as the size class it occupies.

The unified-memory point is the part that quietly redefines what’s possible. Until very recently, running a 70B-class model locally meant one of two things: an RTX 4090 desktop with 24 GB of VRAM (which forced you down to Q3 quantization with heavy CPU offload), or an Apple Mac Studio M2 Ultra at 192 GB of unified memory and roughly $5,599 to start. The Strix Halo platform inserts a third option below both.

The real numbers on 70B

Independent benchmark threads — Level1Techs’ running tally of Strix Halo LLM results, the Framework Desktop community’s GPU LLM tests, and a careful optimization writeup at Hardware Corner — converge on the same shape of result. On a dense Llama 3.3 70B at Q4–Q6 quantization, Strix Halo lands in roughly the 3–5 tokens-per-second range when the model is fully resident on the iGPU, with individual benchmark posts reporting around 3.7–3.8 t/s for Llama 3.3 70B Q6_K via GPU offload.

That is not fast. It is, however, comfortably inside the 3–5 t/s band that the local-LLM community generally treats as the floor for usable interactive chat. And the platform’s ceiling is not arbitrary: as Hardware Corner showed by walking through the math, a 42 GB dense model at the chip’s measured ~215 GB/s memory bandwidth caps out at a theoretical 5.1 t/s, and real runs hit 4.8 t/s — about 94% of the bandwidth ceiling. There is not a software fix coming that will double those numbers on a dense 70B. The bandwidth is the wall.

Where Strix Halo opens up is on the newer Mixture-of-Experts designs, where only a fraction of the weights activate per token. The Hardware Corner runs and the Level1Techs thread both report roughly 52 t/s on Qwen3-30B-A3B, and the same hardware comfortably hosts 100 GB+ MoE checkpoints that simply do not fit on consumer dGPUs. That, more than the dense-70B figure, is the workload Strix Halo was actually built for.

How it stacks up

For honest comparison: the Mac Studio M2 Ultra 192 GB runs Llama 2 70B Q4_K_M at roughly 12 t/s in published GPU-Benchmarks-on-LLM-Inference results, driven by 800 GB/s of memory bandwidth — about 3.7× the Strix Halo number. An RTX 4090 desktop pushes north of 1 TB/s on its 24 GB of VRAM, but cannot hold a full 70B at Q4 without offloading, and offloading collapses the bandwidth advantage. Each tier represents a real engineering trade.

Where the EVO-X2 shifts the conversation is price. NotebookCheck and GMKtec’s own product listing place the 64 GB base model at $1,499, the 96 GB configuration at $1,799, and the 128 GB version — the only one that matters for 70B work — at $1,999. The Mac Studio M2 Ultra 192 GB starts at $5,599. An RTX 4090 desktop with comparable system RAM lands in similar territory once you build out the rest of the box, and still cannot fit the model on the GPU. For a buyer whose actual goal is “run a 70B model on my desk without renting cloud GPUs,” the EVO-X2’s price-to-capability ratio is, in this form factor, simply unprecedented.

Where the asterisks live

Three caveats deserve to ride along with the headline.

First, memory bandwidth is the ceiling, and it is fixed. 256 GB/s on paper, ~215 GB/s in practice. Dense 70B models will not get faster than the mid-single-digit t/s range on this platform, no matter how the software stack matures. Buyers expecting Mac-Studio-class throughput will be disappointed; buyers comparing to “no local 70B at all” will not.

Second, the NPU is mostly along for the ride on today’s LLM software. The 50 TOPS XDNA 2 unit is a real piece of silicon, but llama.cpp, Ollama and LM Studio currently route inference through the iGPU, not the NPU. Whether that changes is a software story, not a hardware one. Phoronix has documented that the Linux stack is still actively maturing for this platform, with measurable gains landing as recently as Ubuntu 26.04.

Third, the 128 GB SKU is the one to buy if local LLMs are the reason you’re here. The 64 GB base model is a great mini PC; it is not a 70B mini PC. Make sure the listing matches the workload before clicking buy.

The takeaway

The EVO-X2 is not the fastest desk for 70B inference, and nobody honest is claiming otherwise. What it is, for the first time in the mini PC category, is a serious one — a 1-liter chassis at $1,999 that holds a real 70B-class model in unified memory and answers in usable time. That has not been true at this price or this size before. The local-AI conversation has been waiting on hardware that makes it cheap enough to stop being a flex and start being a tool, and Strix Halo, in this enclosure, is the closest thing the market has yet shipped.