
Following the debut of the NVIDIA RTX Spark "Superchip," we conducted a comparative analysis to evaluate its potential impact against our existing AMD Ryzen AI Max+ workflows. This research examines what NVIDIA’s upcoming unified architecture will deliver for local inference.
Large language models (LLMs) are no longer tested only in the cloud. For development teams working on AI systems, the ability to shift workloads directly to edge devices is becoming increasingly important to address corporate demands for data sovereignty, latency reduction, and the elimination of recurring cloud API fees.
However, executing modern frontier LLMs requires hardware capable of bypassing the traditional memory bottlenecks found in standard consumer systems. Below is a practical overview comparing two unified memory architecture (UMA) platforms: the x86-based AMD Ryzen AI Max+ 395 "Strix Halo" APU and the ARM-based NVIDIA RTX Spark "Superchip".
The hardware setups: AMD vs. NVIDIA
The AMD Ryzen AI Max+ 395 and the NVIDIA RTX Spark represent two distinct design philosophies for unified computing.
AMD Ryzen AI Max+ 395 AMD’s setup focuses on a multi-chiplet design. In practice, this platform:
- Is built on TSMC's 4 nm FinFET process.
- Features two CPU Core Complex Dies housing 16 high-performance "Zen 5" cores with 32 threads.
- Integrates a Radeon 8060S graphics engine, memory controllers, and an AMD XDNA 2 NPU (capable of 50 TOPS) on a central I/O die.
- Relies on high-speed internal buses on its monolithic I/O die to manage data.
- Operates on a configurable TDP of 45 W to 120 W.
NVIDIA RTX Spark Co-developed with MediaTek, NVIDIA takes a dual-chiplet "superchip" approach. This platform:
- Is built on TSMC's 3 nm EUV process.
- Couples a custom 20-core ARM "Grace" CPU with a client "Blackwell" RTX graphics processor.
- Features 48 Streaming Multiprocessors with 6,144 CUDA cores.
- Utilizes an on-package NVLink-C2C interconnect, providing up to 600 GB/s of coherent, low-latency bandwidth.
- Uses custom power management to dynamically scale up to 80 W, offering high power efficiency on battery.
Memory management: VRAM and bandwidth
Local LLM execution is primarily constrained by memory bandwidth rather than raw compute operations. Both platforms offer wider memory buses than standard x86 laptop platforms, which are typically restricted to a 128-bit interface.
Memory behavior on the AMD Ryzen AI Max+ 395:
- Features Quad-Channel LPDDR5X-8000 memory.
- Offers a unified memory bandwidth of 256 GB/s.
- Has default BIOS configurations that restrict the physical memory reserved as VRAM, capping maximum selectable limits at 96 GB.
- Allows developers using Linux to adjust Translation Table Manager (TTM) kernel parameters to bypass BIOS constraints, allocating up to 120 GB of a 128 GB system pool as usable VRAM.
Memory behavior on the NVIDIA RTX Spark:
- Features unified LPDDR5X memory with a capacity of up to 128 GB.
- Delivers up to 300 GB/s of unified memory bandwidth.
- Employs a fully dynamic allocation system without manual partition adjustments.
- Allows the Blackwell GPU and Grace CPU to share the entire memory pool seamlessly, eliminating data copying overhead.
Scaling up: Multi-node vs. hardware link
For workloads requiring memory pools larger than 128 GB, both architectures support scaling beyond a single integrated GPU with clear trade-offs.
- Multi-Node Scaling on AMD: Developers can connect multiple physical nodes via high-speed Ethernet and use llama.cpp's RPC protocol to split model layers. This allows a four-APU cluster to run a 1-trillion-parameter model locally.
- Hardware Link Scaling on NVIDIA: Two physical RTX Spark desktop units can be connected directly via an external interconnect. This creates a cohesive 256 GB shared-memory system without network overhead, enabling the local execution of models up to 200 billion parameters.
Software frameworks: Vulkan, ROCm, and CUDA
The software environments of these two platforms dictate their ease of deployment, library support, and long-term stability.
The AMD stack: Vulkan vs. ROCm
- Vulkan: This API is highly stable on client APUs and can deliver good generation speeds.
- ROCm: Because it was originally designed for enterprise server GPUs, it can cause driver and library conflicts on client-facing APUs. This was greatly improved with the new versions. For example, a Qwen coder 60B model yields 36 t/s under ROCm supported Ollama.
- Hybrid execution: Using ONNX Runtime GenAI, the 50 TOPS NPU can handle the initial prefill phase silently on less than 15 W of power, leaving the GPU to process the high-bandwidth decode phase.
Read more about running LLMs on AMD AI MAX - here.
The NVIDIA stack: CUDA and OpenShell
- CUDA and TensorRT: Standard packages like PyTorch and vLLM run natively on the platform without requiring custom environment variables or version overrides.
- OpenShell: An open-source runtime that integrates with Windows security primitives to run autonomous agents in isolated, metered sandboxes. It automatically masks personal information in outgoing queries to protect user privacy.
- ARM Emulation: Because the CPU uses the ARM instruction set, running legacy x86 tools relies on Microsoft's Prism emulation layer. Compiling llama.cpp from source takes up to three times longer on the Grace ARM CPU than on a standard x86 Zen 5 CPU.
Practical state of inference and benchmarks
When testing local inference speeds, raw computational performance matters less than quantization formats and memory architecture.
- The AMD platform proves highly efficient when running Mixture-of-Experts (MoE) models, which reduce memory bandwidth demands. For instance, a 120B parameter MoE model (GPT-OSS-120B) runs at a comfortable 47 t/s on this setup.
- The NVIDIA RTX Spark leverages its Blackwell Tensor Cores for native FP4 quantization, allowing the superchip to run large, dense models with a significantly reduced memory footprint.
- By utilizing optimized libraries like llama.cpp with Multi-Token Prediction (MTP), the RTX Spark achieves up to a 3x performance gain, allowing dense 70B+ models to generate at over 20 t/s.
Market reality: Availability and pricing
One practical insight from evaluating these setups is that the technically best option must always align with budget constraints and market availability.
- AMD Ryzen AI Max+ 395: Launched in January 2025, this platform is widely available and currently the more cost-effective choice. Fully functional 128 GB desktop workstations and laptops are available in the $2,700 to $3,400 range.
- NVIDIA RTX Spark: Scheduled for a Fall 2026 release, 128 GB consumer systems are projected to launch at a significant premium, with pricing expected to hover between $4,000 and $5,000. Lower-tier 64 GB models are expected to retail close to $4,000.
Outlook: Which edge workstation makes sense?
The right choice depends on your specific performance priorities, budget, and real-world requirements. As the NVIDIA is still to be launched, the only option currently available is AMD. But when RTX Spark comes, this will be the verdict.
The AMD Ryzen AI Max+ 395 is an excellent option for developers looking for a cost-effective path to large local memory pools immediately. While the ROCm software stack requires some more manual configuration, developers can leverage the Vulkan backend and Linux TTM adjustments to bypass memory limits and smoothly run large models today.
The NVIDIA RTX Spark is the superior option for developers who prioritize software compatibility, native ecosystem support, and advanced agent security. With up to 300 GB/s of unified memory bandwidth and hardware-accelerated sandboxing via OpenShell, it provides a highly optimized environment—provided teams can accommodate the higher entry cost and slower compilation times of the ARM-based CPU.
Our analysis will expand as the NVIDIA ecosystem matures; however, the Ryzen AI Max remains a robust and practical foundation for current local inference architectures.

