Run a 1 Trillion Parameter LLM Locally: AMD Ryzen AI Max+ Cluster Guide

Running a 1 Trillion Parameter LLM Locally on AMD Ryzen AI Max+ Cluster

AMD's technical article details how to build a small-scale distributed inference cluster using four Framework Desktop systems with Ryzen AI Max+ 395 processors and run the Kimi K2.5 open-source model (1 trillion parameters, 375GB) using llama.cpp RPC. The setup treats the four machines as a single logical AI accelerator.

Hardware and Software Stack

Hardware: 4x Framework Desktop - AMD Ryzen AI Max+ 395 - 128GB
AI Framework: AMD ROCm
Inference Engine: Llama.cpp RPC
OS: Ubuntu 24.04.3 LTS
Model: Kimi-K2.5 (UD_Q2_K_XL) (375GB)
Network: 5Gbps over Ethernet

Technical Setup: Extended VRAM Allocation

For each Ryzen AI Max+ system, BIOS must first set iGPU Memory Size to 512MB. The maximum dedicated VRAM per node via BIOS is 96GB (384GB total across four nodes). Using Translation Table Manager (TTM) kernel parameters increases this to 120GB per node (480GB total).

Configure kernel parameters:

sudo nano /etc/default/grub

Find line starting with GRUB_CMDLINE_LINUX_DEFAULT= and append inside quotes:

"quiet splash ttm.pages_limit=30720000 amdgpu.gttsize=120000"

TTM limits are expressed in 4 KB pages. Calculation for 120GB: (120 * 1024 * 1024) / 4.096 = 30720000

After saving and exiting, run:

sudo update-grub
sudo reboot

Verify configuration:

$ sudo dmesg | grep "amdgpu.*memory"
[drm] amdgpu: 512M of VRAM memory ready
[drm] amdgpu: 120000M of GTT memory ready.

Setup Option 1: Lemonade SDK (Recommended)

Download pre-built binaries from: https://github.com/lemonade-sdk/llamacpp-rocm/releases/latest/

Download archive matching your platform and GPU target: llama-bxxxx-ubuntu-rocm-gfx1151-x64.zip

Extract and prepare:

unzip llama-bxxxx-ubuntu-rocm-gfx1151-x64.zip
cd llama-bxxxx-ubuntu-rocm-gfx1151-x64
chmod +x llama-cli llama-server rpc-server

Verify GPU detection:

$ ./llama-cli --list-devices
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
Available devices:
ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 127697544
ROCm0: AMD Radeon Graphics (120000 MiB, 124704 MiB free)

Setup Option 2: Manual Source Build

Install ROCm 7.0.2 on Ubuntu 24.04.3:

wget https://repo.radeon.com/amdgpu-install/7.0.2/ubuntu/noble/amdgpu-install_7.0.2.70002-1_all.deb
sudo apt install ./amdgpu-install_7.0.2.70002-1_all.deb
sudo apt update
sudo apt install python3-setuptools python3-wheel
sudo usermod -a -G render,

The article continues with additional setup steps and inference configuration details.

📖 Read the full source: HN LLM Tools