Другое

Automated Black-Box Testing: A Comparative Study of LLM Agent Architectures and Prompt Engineering

Anna M. ArnaudoTurin Polytechnic UniversityVan-Thanh NguyenEnrico ChenXiaoning MaXiaoQuan JiMinh-Thai Mai

Zenodo (CERN European Organization for Nuclear Research)repository2026

ABI

Аннотация

Replication package for the study "Automated Black-Box Testing: A Comparative Study of LLM Agent Architectures and Prompt Engineering" 30llm: Multi-Agent LLM Systems for Collaborative Test Case Generation This project investigates how Large Language Model (LLM) agents can be used to automatically generate high-quality software test cases. Traditional automated testing tools struggle with tasks that require multi-perspective reasoning—such as understanding user intent, exploring edge cases, or applying domain knowledge. In contrast, multi-agent LLM architectures enable multiple specialized agents to collaborate, debate, or compete to produce more comprehensive test artefacts. Main Goal The primary goal of this project is to determine whether multi-agent LLM systems can outperform single-agent or traditional methods in generating comprehensive, diverse, and effective test cases. This project builds upon the [QAagent framework](https://github.com/AkhilDeo/QAagent), a multi-agent system designed for unit test generation through natural language pseudocode. The original QAagent approach employs a two-stage pipeline where a code architect agent first generates an implementation plan in natural language and pseudocode, followed by a test generator agent that produces test cases based on this plan. This separation of concerns allows different perspectives to be incorporated into the test generation process, demonstrating superior performance on the HumanEval benchmark. This framework is adapted and extended by modifying the prompting strategies and introducing additional interaction mechanisms to better suit function-level test generation. The modifications include support for different reasoning styles in the planning phase and multiple strategies for combining outputs from multiple agents, enabling systematic comparison of collaborative versus competitive multi-agent architectures. Methodology This project evaluates three distinct approaches to LLM-based test generation, progressing from simple single-agent baselines to sophisticated multi-agent architectures and compares them with QAagent framework. Single Agent The single-agent approach serves as a baseline, generating test cases directly from the problem description in a single inference pass. This approach is computationally efficient, requiring only a single model invocation per problem with minimal token usage. However, it is inherently limited by single-perspective reasoning and may underrepresent challenging edge cases or uncommon execution paths, motivating the exploration of multi-agent alternatives. Multi-Agent Collaborative The collaborative multi-agent system separates test generation into distinct planning and execution phases, mimicking real-world software development workflows. The system operates through a linear pipeline where three code architect agents independently analyze each problem and generate natural language pseudocode describing likely implementations. Each architect employs a different reasoning strategy—Chain-of-Thought with few-shots and zero-shot, and ReAct with few-shots—to maximize diversity in the generated plans. These plans are then consolidated and provided to a test generator agent, which produces comprehensive test suites covering both basic functionality and edge cases. After test generation, a merger agent reconciles outputs using one of two strategies. The **concat** strategy performs basic concatenation of all generated tests after removing empty entries. The **accuracy** strategy extend **concat** strategy by adding validations like filtering tests with syntax errors, AST-based deduplication, filtering of tests with incorrect function names, and retaining only those that pass successfully by executing them against the canonical solution. This architecture enables complementary reasoning strategies to be combined, potentially improving coverage and robustness on complex functions. The separation of planning and testing roles allows each agent to focus on its specialized task, while the merge phase ensures coherent final test suites. Multi-Agent Competitive The competitive multi-agent approach generates complete test suites independently from each agent configuration and selects the highest-quality output. Rather than combining outputs, agents compete to produce the best solution. Each agent follows the same two-stage pipeline as the collaborative approach: a code architect generates pseudocode using a specific reasoning strategy, followed by a test generator producing test cases. However, each agent pair operates independently without sharing information during generation. All agent outputs are evaluated against the canonical solution using coverage and execution success rate metrics. The final test suite is selected by ranking agents according to total line coverage as the primary criterion and test execution success rate as a tiebreaker. This ensures the system delivers the most comprehensive and correct test suite from among all candidates. All intermediate results from competing agents are preserved for analysis. This competitive architecture allows direct comparison of different reasoning strategies under identical conditions. By evaluating each approach independently, the system avoids potential quality degradation from merging incompatible test cases while ensuring delivery of the best-performing solution. Experiments Evaluation Metrics Test quality is assessed using three complementary metrics that capture different aspects of test effectiveness: **Coverage:** Line coverage percentage measures the proportion of source code lines executed during test execution. Coverage for both the first five generated tests and the complete test suite are reported. **Execution Success Rate:** Test execution success rate is defined as the proportion of generated tests that pass when executed against the canonical solution. This metric validates that generated tests correctly specify expected behavior and do not contain false positives. Execution success rate is computed by executing each test case individually and recording pass/fail outcomes. **Tokens:** Average total token usage (Input + Output) provides insights into computational cost and efficiency of different strategies. These metrics provide complementary perspectives: coverage measures thoroughness of test exploration, accuracy measures correctness of test specifications, and tokens measure resource efficiency. High-quality test suites achieve both comprehensive coverage and high accuracy while maintaining reasonable token usage. Experimental Setup All experiments (single agent, multi agent cooperative, multi agent competitive, QAagent) are conducted on 20 functions selected from the HumanEval benchmark (average of 10 runs), then on the complete HumanEval benchmark (1 run). For single-agent experiments, each problem is evaluated once with the baseline configuration. Multi-agent experiments generate multiple independent planning perspectives per problem, which are then either merged or evaluated competitively depending on the architecture being tested. Prompt strategy - **Single-Agent**, available prompts are: - **Augmented Few-Shot** (assign role, task, rules, formatting rule, few-shots) - **Zero-Shot** (assign role, task, formatting rule) - **original** (assign role, task, formatting rule, few-shots) - **Multi-Agent cooperative** and **Multi-Agent competitive**: - **Architect**: Chain-of-Thought with few-shots and zero-shot, and ReAct with few-shots - **Generator**: **Augmented Few-Shot** (assign role, task, rules, formatting rule, few-shots) or **Standard Few-Shot** (assign role, task, formatting rule, few-shots) - **QAagent**: - **Architect**: Chain-of-Thought (assign role, task, plan formatting, few-shots) - **Generator**: **Augmented Few-Shot** (assign role, task, rules, formatting rule, few-shots) or **Standard Few-Shot** (assign role, task, formatting rule, few-shots) Model Choice Model selected for these experiments is **nvidia/nemotron-3-nano-30b-a3b** because: * Using a single fixed model isolates the impact of strategy and prompt changes, making comparisons fairer. Strengths highlighted in the model card: * Open model family with open weights, training data, and recipes. * Hybrid MoE architecture (Mamba-2 + attention) with 3.5B active parameters and 30B total parameters, favoring efficiency. * Unified reasoning and non-reasoning model with configurable reasoning traces (accuracy vs. direct-answer trade-off). * Long-context support: model card notes up to a 1M context size (HF default 256k due to VRAM needs). * Fine-tuned for code, math, science, tool calling, instruction following, and structured outputs. * Multilingual support (English, German, Spanish, French, Italian, Japanese) and marked as ready for commercial use. All experiments use consistent decoding parameters (temperature, top-p) across configurations to isolate the effects of architectural choices. Results HumanEval - 20 Selected Problems (Average of 10 Runs) | Strategy | Prompt | Execution Success Rate | Coverage | Tokens/Problem | | :------------------------------------- | :------------------------ | ---------------------: | -------: | -------------: | | Single Agent | Zero Shot (Baseline) | 61.55 | 68.03 | 2,372.58 | | Single Agent | Standard Few-Shot | 88.38 | 96.57 | 2,287.66 | | QA Agent | Standard Few-Shot | 88.47 | 97.17 | 3,651.18 | | Multi-Agent Competitive (LLM scorer) | Standard Few-Shot | 93.94 | 98.12 | 19,047.33 | | Multi-Agent Competitive (LLM selector) | Standard Few-Shot | 93.07 | 98.04 | 17,783.04 | | Multi-Agent Merge (Accuracy) | Standard Few-Shot | 86.59 | 91.78 | 12,278.32 | | Multi-Agent Merge (Concat) | Standard Few-Shot | 90.09 | 97.05 | 12,183.05 | | Multi-Agent Merge (LLM) | Standard Few-Shot | 90.86 | 98.76 | 21,144.7

Перевод пока недоступен

Идентификаторы

DOI: 10.5281/zenodo.18999792

Цитирования и источники

Цитирований: 0Использованных источников: 0

Показатели — AkademScholar