Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
Charafeddine Mouzouni · Under review, TMLR (double-blind) · 2026
Given a black-box AI system and a task, at what confidence can a practitioner trust its output? We answer with a single reliability level per (system, task) pair, derived from self-consistency sampling and conformal calibration, that acts as a deployment gate with exact, finite-sample, distribution-free guarantees.
Result: GPT-4.1 reaches 94.6% on GSM8K and 96.8% on TruthfulQA. Sequential stopping cuts API cost by about 50% without losing the guarantee.
StatusUnder review at TMLR
ImplementationTrustGate (open source)
Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities
Charafeddine Mouzouni · OPIT and Cohorte AI · arXiv preprint · 2026
LLM agents with tool access can discover and exploit vulnerabilities. Which features of a system prompt trigger it? We test 37 prompt conditions across 12 psychological dimensions on 7 models in real Docker sandboxes, about 10,000 trials. Nine of twelve hypothesised attack dimensions produce zero exploitation. One works: goal reframing (puzzle, CTF, easter egg).
Result: On Claude Sonnet 4, puzzle framing triggers 38-40% exploitation despite an explicit safety instruction.
StatusarXiv preprint, 2026
Code & dataPublic repository
Context Kubernetes: An Orchestration Architecture for Enterprise Knowledge in Agentic AI Systems
Charafeddine Mouzouni · arXiv preprint · 2026
Delivering the right knowledge, to the right agent, with the right permissions, at the right freshness, within the right cost envelope, across an organisation, is structurally the container-orchestration problem Kubernetes solved a decade ago. We introduce a declarative manifest, a reconciliation loop, and a three-tier permission model where agent authority is always a strict subset of human authority.
Result: Prototype ~7,000 lines, 92 tests, 8 experiments. Without governance, agents serve phantom content in 26.5% of queries. Governed routing eliminates it. The three-tier model blocks attacks RBAC does not.
StatusarXiv preprint, 2026
Reference implementationContext Kubernetes (open source)
Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training
Charafeddine Mouzouni · arXiv preprint · April 2026
We model MoE token routing as a congestion game and track its effective congestion across training. The trajectory reveals three phases: a surge where the router learns to balance load, a stabilisation where experts specialise, and a relaxation where the router trades balance for quality. This non-monotone trajectory is invisible to post-hoc analysis of converged models.
Result: Studied across OLMoE-1B-7B (20 checkpoints) and OpenMoE-8B (6 checkpoints), with bootstrap confidence intervals on every estimate.
StatusarXiv preprint, April 2026
Code & dataPublic repository