Preprint

REPOMIND: Reproducing 256K-context Repository-Scale Code Understanding on a Single AMD MI300X with FP8 KV Cache

Sardor RazikovIndependent ML Engineer, Tashkent, Uzbekistan

Zenodo (CERN European Organization for Nuclear Research)repository2026en

ABI

Annotatsiya

We present REPOMIND, an open-source repository-scale coding agent that operates on a single AMD MI300X (192 GB HBM3) accelerator at 256K context length using Qwen/Qwen3-Coder-Next-FP8 (80B parameters, 3B active MoE, FP8 weights and KV cache). We report 62 measured data points collected across 124 minutes of stress testing on AMD Developer Cloud infrastructure: throughput as a function of context length, concurrency scaling at 8K–256K context windows, long-context coherence via needle-in-a-haystack probes at the 200K position, and end-to-end repository question-answering on three real codebases including pytorch/vision (~1.3M tokens, 5× the 256K window). All 31 parallel users succeed at every realistic context (8K–64K, 31/31), 144/144 outputs are clean on the default Triton attention backend, the needle probe passes at all three positions including 200K, and all 9 end-to-end repository questions are answered correctly. We also report an engineering honesty result: AMD's ROCM_AITER_FA attention backend, advertised for higher throughput on MI300X, produces 2–4× higher aggregate throughput when combined with FP8 KV cache, but degenerates outputs to repeating punctuation on 137 of 144 cells in our concurrency matrix on this specific model + configuration. The default ROCm Triton backend remains production-safe; AITER stays research-quality on this configuration as of vLLM 0.17.1 / ROCm 7.2.0. We file this regression with reproducible scripts and per-cell evidence in the public repository. Total session cost: 4.12 USD of compute ($1.99/hr × 2.07 hr across two sessions on AMD Developer Cloud). We argue that a memory-architecture moat exists - Qwen3-Coder-Next-FP8 weights (77.29 GiB) + FP8 KV cache (94.58 GiB) + activations approximates ~143 GiB, which does not fit on an NVIDIA H100 80GB single-card by VRAM accounting but has headroom on MI300X's 192 GB - and that this moat enables a category of on-premises repository-scale coding assistance that hosted SaaS coding tools cannot legally serve to compliance-locked enterprises (banks, defense, healthcare, IP-sensitive product teams).

Hali tarjima qilinmagan

Mavzular

Parallel Computing and Optimization Techniques Advanced Data Storage Technologies Big Data and Digital Economy

Identifikatorlar

DOI: 10.5281/zenodo.20330467

Iqtiboslar va manbalar

0 ta iqtibos0 ta foydalanilgan manba

Koʻrsatkichlar — AkademScholar