Running a 1 Trillion Parameter LLM Locally on AMD Ryzen AI Max+ Cluster

Running a 1 Trillion Parameter LLM Locally on AMD Ryzen AI Max+ Cluster
AMD's technical article details how to build a small-scale distributed inference cluster using four Framework Desktop systems with Ryzen AI Max+ 395 processors and run the Kimi K2.5 open-source model (1 trillion parameters, 375GB) using llama.cpp RPC. The setup treats the four machines as a single logical AI accelerator.
Hardware and Software Stack
- Hardware: 4x Framework Desktop - AMD Ryzen AI Max+ 395 - 128GB
- AI Framework: AMD ROCm
- Inference Engine: Llama.cpp RPC
- OS: Ubuntu 24.04.3 LTS
- Model: Kimi-K2.5 (UD_Q2_K_XL) (375GB)
- Network: 5Gbps over Ethernet
Technical Setup: Extended VRAM Allocation
For each Ryzen AI Max+ system, BIOS must first set iGPU Memory Size to 512MB. The maximum dedicated VRAM per node via BIOS is 96GB (384GB total across four nodes). Using Translation Table Manager (TTM) kernel parameters increases this to 120GB per node (480GB total).
Configure kernel parameters:
sudo nano /etc/default/grub
Find line starting with GRUB_CMDLINE_LINUX_DEFAULT= and append inside quotes:
"quiet splash ttm.pages_limit=30720000 amdgpu.gttsize=120000"
TTM limits are expressed in 4 KB pages. Calculation for 120GB: (120 * 1024 * 1024) / 4.096 = 30720000
After saving and exiting, run:
sudo update-grub sudo reboot
Verify configuration:
$ sudo dmesg | grep "amdgpu.*memory" [drm] amdgpu: 512M of VRAM memory ready [drm] amdgpu: 120000M of GTT memory ready.
Setup Option 1: Lemonade SDK (Recommended)
Download pre-built binaries from: https://github.com/lemonade-sdk/llamacpp-rocm/releases/latest/
Download archive matching your platform and GPU target: llama-bxxxx-ubuntu-rocm-gfx1151-x64.zip
Extract and prepare:
unzip llama-bxxxx-ubuntu-rocm-gfx1151-x64.zip cd llama-bxxxx-ubuntu-rocm-gfx1151-x64 chmod +x llama-cli llama-server rpc-server
Verify GPU detection:
$ ./llama-cli --list-devices ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 Available devices: ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 127697544 ROCm0: AMD Radeon Graphics (120000 MiB, 124704 MiB free)
Setup Option 2: Manual Source Build
Install ROCm 7.0.2 on Ubuntu 24.04.3:
wget https://repo.radeon.com/amdgpu-install/7.0.2/ubuntu/noble/amdgpu-install_7.0.2.70002-1_all.deb sudo apt install ./amdgpu-install_7.0.2.70002-1_all.deb sudo apt update sudo apt install python3-setuptools python3-wheel sudo usermod -a -G render,
The article continues with additional setup steps and inference configuration details.
📖 Read the full source: HN LLM Tools
👀 See Also

Building a serverless AI agent platform on AWS for $0.01/month with Claude Code
A developer built a complete AWS serverless platform running AI agents for approximately $0.01/month using Claude Code over 29 hours, eliminating expensive components like NAT Gateway ($32/month) and ALB ($18/month). The project includes 233 unit tests, 35 E2E tests, and deploys with a single cdk deploy command.

How to Troubleshoot OpenClaw Setup Issues: Multi-Agent and Model Response Problems
Struggling with setting up OpenClaw? Discover common problems with multi-agent configurations and unresponsive models, and learn how to solve them.

OpenClaw 5.28: Codex Plugin Broken After Upgrade — Fix with Symlink Shim
OpenClaw 5.28 breaks Codex plugin due to binary path mismatch. Fix: create symlink from expected path to actual bin/codex.

Guide: Running GitHub Copilot with Local LLM on Windows via Lemonade Server
A developer created a walkthrough for setting up GitHub Copilot to work with a local LLM on a Framework Desktop using Lemonade Server, addressing the lack of simple instructions for this configuration on Windows.