Boshqa

Automated Black-Box Testing: A Comparative Study of LLM Agent Architectures and Prompt Engineering

Anna ArnaudoPolytechnic University of TurinVan-Thanh NguyenEnrico ChenXiaoning MaXiaoQuan JiMinh-Thai Mai

Zenodo (CERN European Organization for Nuclear Research)repository2026en

ABI

Annotatsiya

30llm: Multi-Agent LLM Systems for Collaborative Test Case Generation This project investigates how Large Language Model (LLM) agents can be used to automaticallygenerate high-quality software test cases. Traditional automated testing tools struggle with tasksthat require multi-perspective reasoning—such as understanding user intent, exploring edge cases, orapplying domain knowledge. In contrast, multi-agent LLM architectures enable multiple specializedagents to collaborate, debate, or compete to produce more comprehensive test artefacts. Main Goal The primary goal of this project is to determine whether multi-agent LLM systems can outperformsingle-agent or traditional methods in generating comprehensive, diverse, and effective test cases. This project builds upon the QAagent framework, a multi-agent system designed for unit test generation through natural language pseudocode. The original QAagent approach employs a two-stage pipeline where a code architect agent first generates an implementation plan in natural language and pseudocode, followed by a test generator agent that produces test cases based on this plan. This separation of concerns allows different perspectives to be incorporated into the test generation process, demonstrating superior performance on the HumanEval benchmark. This framework is adapted and extended by modifying the prompting strategies and introducing additional interaction mechanisms to better suit function-level test generation. The modifications include support for different reasoning styles in the planning phase and multiple strategies for combining outputs from multiple agents, enabling systematic comparison of collaborative versus competitive multi-agent architectures. Methodology This project evaluates three distinct approaches to LLM-based test generation, progressing from simple single-agent baselines to sophisticated multi-agent architectures and compares them with QAagent framework. Single Agent The single-agent approach serves as a baseline, generating test cases directly from the problem description in a single inference pass.This approach is computationally efficient, requiring only a single model invocation per problem with minimal token usage. However, it is inherently limited by single-perspective reasoning and may underrepresent challenging edge cases or uncommon execution paths, motivating the exploration of multi-agent alternatives. Multi-Agent Collaborative The collaborative multi-agent system separates test generation into distinct planning and execution phases, mimicking real-world software development workflows. The system operates through a linear pipeline where three code architect agents independently analyze each problem and generate natural language pseudocode describing likely implementations. Each architect employs a different reasoning strategy—Chain-of-Thought with few-shots and zero-shot, and ReAct with few-shots—to maximize diversity in the generated plans. These plans are then consolidated and provided to a test generator agent, which produces comprehensive test suites covering both basic functionality and edge cases. After test generation, a merger agent reconciles outputs using one of three strategies. The concat strategy performs basic concatenation of all generated tests after removing empty entries. The accuracy strategy extends concat strategy by adding validations like filtering tests with syntax errors, AST-based deduplication, and filtering of tests with incorrect function names. The LLM-based strategy employs a merger agent that is prompted to combine multiple test suites into a single high-quality output, following a set of predefined rules.After merging, the test merged suite is evaluated against the canonical solution using metrics such as coverage, execution success rate, and token usage. This architecture enables complementary reasoning strategies to be combined, potentially improving coverage and robustness on complex functions. The separation of planning and testing roles allows each agent to focus on its specialized task, while the merge phase ensures coherent final test suites. Multi-Agent Competitive The competitive multi-agent approach generates complete test suites independently from each agent configuration and selects the highest-quality output. Rather than combining outputs, agents compete to produce the best solution. Each agent follows the same two-stage pipeline as the collaborative approach: a code architect generates pseudocode using a specific reasoning strategy, followed by a test generator producing test cases. However, each agent pair operates independently without sharing information during generation. The final test suite is selected by a Judge agent (LLM-based), which evaluates candidate suites as a black-box. The Judge can operate as a selector choosing the best candidate based on the observable quality of the test suite—clarity, determinism, diversity of tested behaviors, and presence of meaningful edge cases—or as a scorer, assigning scores independently to each suite on the same criteria.All intermediate results from competing agents are preserved for analysis. After selection, the chosen test suite is evaluated against the canonical solution using metrics such as coverage, execution success rate, and token usage. Candidate suites that were not selected are also evaluated using the same metrics for post-hoc analysis, providing insights into the relative quality of different generation strategies and the effectiveness of the Judge, without affecting the selection process. This competitive architecture allows direct comparison of different reasoning strategies under identical conditions. By evaluating each approach independently, the system avoids potential quality degradation from merging incompatible test cases while ensuring delivery of the best-performing solution. Experiments Evaluation Metrics Test quality is assessed using three complementary metrics that capture different aspects of test effectiveness: Coverage: Line coverage percentage measures the proportion of source code lines executed during test execution. Execution Success Rate: Test execution success rate is defined as the proportion of generated tests that pass when executed against the canonical solution. This metric validates that generated tests correctly specify expected behavior and do not contain false positives. Execution success rate is computed by executing each test case individually and recording pass/fail outcomes. Tokens: Average total token usage (Input + Output) provides insights into computational cost and efficiency of different strategies. These metrics provide complementary perspectives: coverage measures thoroughness of test exploration, accuracy measures correctness of test specifications, and tokens measure resource efficiency. High-quality test suites achieve both comprehensive coverage and high accuracy while maintaining reasonable token usage. Experimental Setup All experiments (single agent, multi-agent collaborative, multi-agent competitive, QAagent) are conducted on two settings: a subset of 20 functions from the HumanEval benchmark and the complete HumanEval dataset (164 problems). In both cases, results are reported as the average over 10 independent runs. In the single-agent setup, each run generates a single test suite directly from the problem description. In multi-agent setups, each run produces multiple planning perspectives via independent architect agents; these outputs are then either merged (collaborative) or evaluated competitively (competitive) to produce the final test suite for that run. Prompt strategy: Single-Agent, available prompts are: Rule-Augmented Few-Shot (assign role, task, rules, formatting rule, few-shots) Standard Few-Shot (assign role, task, formatting rule, few-shots) Zero-Shot (assign role, task, formatting rule) Multi-Agent cooperative and Multi-Agent competitive: Architect: Chain-of-Thought with few-shots and zero-shot, and ReAct with few-shots Generator: Rule-Augmented Few-Shot (assign role, task, rules, formatting rule, few-shots) or Standard Few-Shot (assign role, task, formatting rule, few-shots) QAagent: Architect: Chain-of-Thought (assign role, task, formatting rule, few-shots) Generator: Rule-Augmented Few-Shot (assign role, task, rules, formatting rule, few-shots) or Standard Few-Shot (assign role, task, formatting rule, few-shots) Model Choice Model selected for these experiments is nvidia/nemotron-3-nano-30b-a3b because: Using a single fixed model isolates the impact of strategy and prompt changes, making comparisons fairer. Strengths highlighted in the model card: Open model family with open weights, training data, and recipes. Hybrid MoE architecture (Mamba-2 + attention) with 3.5B active parameters and 30B total parameters, favoring efficiency. Unified reasoning and non-reasoning model with configurable reasoning traces (accuracy vs. direct-answer trade-off). Long-context support: model card notes up to a 1M context size (HF default 256k due to VRAM needs). Fine-tuned for code, math, science, tool calling, instruction following, and structured outputs. Multilingual support (English, German, Spanish, French, Italian, Japanese) and marked as ready for commercial use. All experiments use consistent decoding parameters (temperature, top-p) across configurations to isolate the effects of architectural choices. Results HumanEval - 20 Selected Problems (Average of 10 Runs) Strategy Prompt Execution Success Rate Coverage Tokens/Problem Single Agent Zero Shot (Baseline) 61.55 68.03 2,372.58 Single Agent Standard Few-Shot 88.38 96.57 2,287.66 QA Agent Standard Few-Shot 88.47 97.17 3,651.18 Multi-Agent Competitive (LLM scorer) Standard Few-Shot 93.94 98.12 19,047.33 Multi-Agent Competitive (LLM selector) Standard Few-Shot 93.07 98.04 17,783.04 Multi-Agent Merge (Accuracy) Standard Few-Sho

Hali tarjima qilinmagan

Identifikatorlar

DOI: 10.5281/zenodo.20304176

Iqtiboslar va manbalar

0 ta iqtibos0 ta foydalanilgan manba

Koʻrsatkichlar — AkademScholar