Article

Evaluating LLMs on Kazakhstan's mathematics exam for university admission

Shirali KadyrovDepartment of General Education, New Uzbekistan University, Tashkent, UzbekistanBolatbek AbdrasilovNational Test Center, Astana, KazakhstanAslan SabyrovNational Test Center, Astana, KazakhstanNurseit BaizhanovNational Test Center, Astana, KazakhstanAlfira MakhmutovaDepartment of General Education, New Uzbekistan University, Tashkent, UzbekistanPatrick C. KyllonenEducational Testing Service, Princeton, NJ, United States

Frontiers in Artificial Intelligencejournal2025en

ABI

Abstract

Introduction: The rapid advancement of large language models (LLMs) has prompted their exploration in educational contexts, particularly in high-stakes standardized tests such as Kazakhstan's Unified National Testing (UNT) mathematics component, which is critical for university admission. While most existing benchmarks for mathematical reasoning focus on English, concerns remain that LLMs may underperform in under-resourced or non-English languages. This study addresses this gap by evaluating LLM performance on 139 UNT multiple-choice mathematics questions administered entirely in Russian. Methods: We assessed six LLMs-Claude, DeepSeek, Gemini, Llama, Qwen, and o1-on questions covering algebra, functions, geometry, inequalities, and trigonometry. Three evaluation conditions were employed: (1) zero-shot performance, (2) hybrid integration with SymPy for symbolic computation, and (3) a role-specific simulated multi-agent refinement framework that builds on existing self-correction techniques with targeted feedback. Results: In zero-shot settings, DeepSeek, Gemini, Qwen, and o1 achieved near-perfect or perfect accuracy (91.2-100%) across all difficulty levels and topics, while Claude and Llama lagged (43.5-76.5%). The hybrid approach significantly improved Claude and Llama's accuracy by 27.4% and 39.9%, respectively. Under the multi-agent refinement condition, Claude showed substantial gains, reaching 97.8% accuracy, which represented a 58.1% improvement over zero-shot performance. Discussion: These findings provide important empirical evidence that LLMs can perform competitively on mathematics tasks in non-English languages. The results challenge prior assumptions about limited performance in under-resourced linguistic settings and highlight the potential of LLMs to support bilingual education and promote equitable access to higher education.

Topics

Intelligent Tutoring Systems and Adaptive Learning Psychometric Methodologies and Testing Educational Assessment and Pedagogy

Identifiers

DOI: 10.3389/frai.2025.1642570

Citations and references

Cited by 029 references

Metrics — AkademScholar · Coming soon