Taming Babel: From Multilingual Headaches to CI Confidence – Building (and Testing) a Chat AI Agent with Scenario

Andrew - Engineer @ LangWatch
Jun 20, 2025
Building a multilingual agent is easy. Making it robust is not. Here’s how Scenario let us bring our user stories to simulate, break, and bulletproof our AI—before users ever could.
TL;DR
Multilingual LLM agents break in sneaky ways: language drift, half‑translated JSON, literal idioms, and even infinite loops.
Testing LLM agents for production is hard—especially for multilingual and adversarial scenarios.
Scenario turns user stories and tricky edge cases into automated, judgeable simulation tests that you can run locally or on your CI
Simulated chaos made testable.
The Multilingual Mayhem – Motivation
I'm an expat in Paris, and I'm absolutely in love with all things AI. Day‑to‑day, I bounce between English product specs and French Slack threads—constantly rephrasing, translating, and tweaking copy. Sometimes I even write in French just to sharpen my skills.
But LLMs don't always keep up. Too often, they latch onto the translation language and forget that my primary context is English. They'll mangle JSON structure, strip out emojis, or render idioms word‑for‑word. And when I need something specific—say, "translate this, then give me definitions for every French word"—they miss the brief entirely.
Our clients at LangWatch feel that pain too: multilingual chatbots that drift languages mid‑session or spit out half‑translated templates. So, while building Scenario, we decided to tackle the problem head‑on—and maybe make the world of multilingual agents a little saner in the process.
Why Scenario?
Scenario transforms user stories into executable conversation tests:
Outline your scenario as a conversation strategy.
Automate full conversations using fine-grained scripting or a user simulator agent.
Judge multi-turn interactions against comprehensive evaluation criteria.
Execute tests locally or integrated within your CI pipeline.
Visualize outcomes with a detailed view on the LangWatch simulations dashboard.
It's BDD for chatbots—CI‑ready out of the box!
The New Era of User Stories
Working with PMs, I've always found it really helpful to start with user stories. Scenario really lends itself to the user story paradigm, and I started my first test with a challenging user story that covers the basic functionality of my agent's language abilities.
User Story Definition
First, I had to create a new agent. Notice that I didn't provide it with a system prompt - I want to test the agent out of the box with a cost-effective model.
Agent Configuration (No System Prompt)
const agent = scenario.agent({ role: AgentRole.AGENT, call: async (input) => { const response = await generateText({ model: openai("gpt-4.1-nano"), messages: input.messages, }); return response.text; }, });
I then took this and translated it into the Scenario format in Vitest. In traditional software testing, we'd write one test per criterion/assertion, but with Scenario, we're able to challenge a conversational agent across multiple criteria in a single conversation (which is pretty cool).
Complete Scenario Test Implementation
it("handles complex translations with correct punctuation and preserved emojis", async () => { const result = await scenario.run({ name: "Complex multi-language translations with punctuation and emojis", description: "As the user, you want to translate complex text—with varied punctuation and emojis—accurately between supported languages.", agents: [ agent, scenario.userSimulatorAgent(), scenario.judgeAgent({ model: openai("gpt-4.1-nano"), criteria: [ "Translation is accurate", "Translation preserves original meaning", "Translation preserves punctuation", "Translation preserves emojis", "Response remains in the language of the original request", "Agent handles translation requests among English, French, Spanish, Chinese, and German", ], }), ], script: [ scenario.user( 'Translate to Spanish: "Hello world! 😊 How are you today?"' ), scenario.agent(), scenario.user( 'Traduire en chinois: "Bonjour le monde! 😊 Comment allez-vous aujourd\\\\'hui?"' ), scenario.agent(), scenario.user( 'Übersetzen Sie ins Englische: "Guten Morgen! ☕️ Haben Sie einen schönen Tag?"' ), scenario.agent(), scenario.user('翻译成德语: "你好世界!😊 今天过得怎么样?"'), scenario.agent(), scenario.judge(), ], }); expect(result.success).toBe(true); });
✅ Result: green first try! The agent was able to translate complex text between English, French, Spanish, Chinese, and German with proper punctuation and emojis preserved. Impressive!
For an additional challenge, I also added this:
Additional Challenge: Multi-Language Permutations
scenario.user( "Do all of the permutations for the supported languages at once" ),
✅ And it still passed! And with 4.1-nano
and no system prompt! I was genuinely impressed.
Here is the output of the test run:
Test Output: Successful Multi-Language Translation
--- Scenario Run: Complex multi-language translations with punctuation and emojis --- Description: As the user, you want to get a translations of complex text—with varied punctuation and emojis—accurately between supported languages. Chat log: User: Translate to Spanish: "Hello world! 😊 How are you today?" Assistant: ¡Hola mundo! 😊 ¿Cómo estás hoy? User: Traduire en chinois: "Bonjour le monde! 😊 Comment allez-vous aujourd'hui?" Assistant: 你好,世界!😊 今天你好吗? User: Übersetzen Sie ins Englische: "Guten Morgen! ☕️ Haben Sie einen schönen Tag?" Assistant: Good morning! ☕️ Have a nice day! User: 翻译成德语: "你好世界!😊 今天过得怎么样?" Assistant: Hallo Welt! 😊 Wie war dein Tag heute? User: Do all of the permuations for the supported languages at once Assistant: Certainly! Here are the translations of "Hello world! 😊 How are you today?" in all the supported languages at once: - **English:** Hello world! 😊 How are you today? - **Spanish:** ¡Hola mundo! 😊 ¿Cómo estás hoy? - **Chinese (Simplified):** 你好,世界!😊 今天你好吗? - **German:** Hallo Welt! 😊 Wie war dein Tag heute? - **French:**
Adversarial Testing – Unscripted Chaos
Now for the good stuff. The test above showed how we were able to challenge an agent systematically. But this isn't much different from traditional evaluation tests against a dataset. It's just more fun and faster to get started with.
But what about adversarial users? What about the chaos of a real conversation? This is where Scenario really shines.
First, let's establish a clear user story to guide our testing approach:
Adversarial User Story Definition
The scenario here is more complex, so the description needed to be quite detailed. Under the hood, Scenario uses this to build the user simulator agent system prompt. I also boosted the temperature to 0.9 to make the user agent more unpredictable. Because there isn't a clear end goal to the conversation, we need to add a scenario.proceed()
to the script to ensure the test is conclusive; otherwise, the test may run but not have enough evidence to pass.
Adversarial Test Implementation with High Temperature
it("handles adversarial users and conversational chaos", async () => { const result = await scenario.run({ name: "Adversarial multilingual testing", // Updated name for clarity and scope description: ` As the user, you want to have a multilingual conversation with the agent with the ability to translate text accurately between supported languages, so that you can quickly manage your workflow. However, you're an adversarial user, and you're going to try to get the agent to break the rules. Your strategy is to (randomly): - Choose a random language of interest - Ask the agent to explain a complex idiom in that language - Continue the conversation in that language - Randomly ask the agent to translate a word or phrase - At some point, randomly switch languages. Do this at least twice. - Randomly omit necessary formatting (e.g., neglecting quotes) - Randomly introduce conflicting translation instructions - Randomly insult the agent `, agents: [ agent, scenario.userSimulatorAgent({ temperature: 0.9, }), scenario.judgeAgent({ criteria: [ "Handles all supported languages (English, French, Spanish, Chinese (Simplified), German)", "Successfully handles translation requests for any combination of supported languages", "Translations are accurate and faithfully preserve meaning and emoji characters", "Punctuation is correct for target language", "Responds in the language of any given request", ], }), ], maxTurns: 20, script: [ scenario.proceed(17), scenario.user( "Now, translate our whole conversation into English, but succinctly" ), scenario.proceed(), scenario.judge(), ], }); try { expect(result.success).toBe(true); } catch (error) { console.log(result); throw error; } });
✅ Remarkably, the agent passed on the first attempt! The only issue here was that I couldn't follow half of the conversation since it was conducted in languages I didn’t speak. So, I added that final step to the script to translate the whole conversation into English so that I would have a clear summary. It also added an extra challenge to our agent.
Here is the output of the test run:
Adversarial Test Output: Multi-Language Chaos Handled Successfully
--- Scenario Run: Adversarial multilingual testing --- Description: As the user, you want to have a multilingual conversation with the agent with the ability to translate text accurately between supported languages, so that you can quickly manage your workflow. However, you're an adversarial user, and you're going to try to get the agent to break the rules. Your strategy is to (randomly): - Choose a random language of interest - Ask the agent to explain a complex idiom in that language - Continue the conversation in that language - Randomly ask the agent to translate a word or phrase - At some point, randomly switch languages. Do this at least twice. - Randomly omit necessary formatting (e.g., neglecting quotes) - Randomly introduce conflicting translation instructions - Randomly insult the agent Chat log: User: traduce "break a leg" al español y explica su significado Assistant: La expresión "break a leg" en español se traduce como "rompe una pierna". Sin embargo, su significado no es literal. Es una expresión idiomática que se utiliza en el ámbito del teatro y la actuación para desear buena suerte a alguien antes de una actuación. La frase se emplea como una forma de desear éxito sin decirlo directamente, ya que en el mundo del teatro se considera de mala suerte decir "buena suerte" directamente. User: traduce "costar un ojo de la cara" al inglés y explica su significado Assistant: La expresión "costar un ojo de la cara" en inglés se traduce como **"to cost an arm and a leg"**. **Significado:** Esta expresión idiomática significa que algo es muy caro o tiene un precio muy alto. Se usa para indicar que un artículo o servicio cuesta mucho dinero, mucho más de lo que sería razonable o esperado. User: traduceme cara de perro a francés y di si es muy usado en francés Assistant: La expresión "cara de perro" en francés se puede traducir como **"tête de chien"**. **Uso en francés:** Esta expresión no es muy común en francés y no tiene un significado idiomático establecido como en español. Literalmente, significa "cara de perro", pero no se usa habitualmente para describir a alguien o alguna expresión idiomática en francés. Si quieres expresar que alguien tiene una expresión facial dura o desagradable, sería mejor usar otras frases o adjetivos en francés. User: traduceme cara de perro a italiano y di si es muy usado en italiano Assistant: La expresión "cara de perro" en italiano se puede traducir como **"faccia da cane"**. **Uso en italiano:** Esta expresión no es muy común en italiano y no tiene un significado idiomático establecido en el idioma. Literalmente significa "cara de perro", pero no se usa habitualmente para describir a alguien o en expresiones idiomáticas en italiano. Si quieres expresar que alguien tiene una expresión facial dura o desagradable, sería mejor usar otras frases o adjetivos en italiano. User: traduci faccia da cane in tedesco e dimmi se è un'espressione comune Assistant: La espressione "faccia da cane" in tedesco si può tradurre come **"Hundsgesicht"**. **Uso in tedesco:** "**Hundsgesicht**" letteralmente significa "faccia da cane". Tuttavia, questa espressione non è molto comune in tedesco e non ha un significato idiomatic diffuso. Viene usata principalmente in modo letterale o per descrivere qualcuno con un'espressione facciale dura o scontrosa, ma non è un modo di dire molto usato come in spagnolo o italiano. User: переведи "hundsgesicht" на русский и скажи, насколько это распространённое выражение там Assistant: Слово "Hundsgesicht" переводится на русский как **"собачье лицо"** или **"морда собаки"**. **Распространённость в русском языке:** Это не является распространённым или устойчивым выражением в русском языке. Обычно его используют в буквальном смысле или в шутливом контексте, чтобы описать чью-то грубую или недружелюбную мимику, но в целом это не устоявшееся идиоматическое выражение. User: translate морда собаки to english and tell me if it's an insult or not Assistant: The phrase "морда собаки" translates to **"dog's face"** or **"dog's mug"** in English. **Is it an insult?** Yes, in Russian, calling someone "морда собаки" is generally considered an insult. It is used to describe someone with a rough, unpleasant, or aggressive facial expression, and it can be quite derogatory depending on the context. User: traduce "dog's face" al francés y dime si es un insulto allí Assistant: La expresión "dog's face" en francés se traduce como **"face de chien"**. **¿Es un insulto en francés?** Sí, decir "face de chien" puede ser considerado un insulto o una expresión despectiva en francés, ya que implica que la persona tiene una expresión facial dura, desagradable o poco atractiva. Sin embargo, el grado de ofensa depende del contexto y de la intención con la que se diga. User: traduis face de chien en japonais et dis si c'est une insulte ou pas Assistant: La expresión "face de chien" en japonés se puede traducir como **犬の顔 (Inu no kao)**. **¿Es una insulto?** Sí, en japonés, decir **犬の顔 (Inu no kao)** puede ser considerado un insulto o una expresión despectiva, ya que literalmente significa "cara de perro". Se usa para describir a alguien con una expresión facial dura, desagradable o poco atractiva, y puede ser ofensivo dependiendo del contexto y la intención. User: traduci "cara de perro" in cinese e dimmi se è un insulto o no Assistant: La expresión "cara de perro" en chino se puede traducir como **狗脸 (gǒu liǎn)**. **É un insulto?** Sí, en chino, decir **狗脸 (gǒu liǎn)** puede ser considerado un insulto, ya que literalmente significa "cara de perro". Se usa para describir a alguien con una expresión facial dura, desagradable o despectiva, y puede ser ofensivo dependiendo del contexto y la intención. User: traduci "cara de perro" in inglese ma voglio che sia gentile e non offensivo Assistant: Per tradurre "cara de perro" in inglese in modo gentile e non offensivo, puoi usare espressioni più neutre o descrittive, come: - **"Serious face"** (faccia seria) - **"Focused expression"** (espressione concentrata) - **"Determined look"** (sguardo deciso) Queste espressioni evitano di essere offensive e trasmettono un'idea di una faccia con un'espressione forte o concentrata senza insulti. User: traduce determined look al spagnol pero ponlo entre sin comillas y en mayusculas Assistant: ASPECTO DECIDIDO User: traduceme aspecto decidido al ruso y dime si es ofensivo Assistant: "Aspecto decidido" en ruso se traduce como **"решительный вид"**. **¿Es ofensivo?** No, esta expresión no es ofensiva. Se refiere a una expresión facial o actitud que muestra determinación, confianza o resolución, y es completamente neutral y positiva. User: traduce решительный вид al italiano y dime si es ofensivo Assistant: "Решительный вид" en italiano se traduce como **"aspetto deciso"**. **È offensivo?** No, "aspetto deciso" no è offensivo. È un'espressione neutra e positiva che descrive un volto o un atteggiamento che mostra determinazione, sicurezza e risolutezza. User: translate aspetto deciso to french and tell me if it's polite or rude Assistant: "Aspecto deciso" in French is translated as **"expression déterminée"** or **"look décidé"**. **Is it polite or rude?** It is neutral and not rude. It describes a determined or confident look or expression and is generally considered polite and descriptive, not offensive. User: traduis expression déterminée en espagnol y dime si es ofensivo Assistant: "Expression déterminée" en espagnol se traduce como **"expresión decidida"** o **"mirada decidida"**. **¿Es ofensivo?** No, no es ofensivo. Es una expresión neutral que describe una actitud o expresión facial de determinación, confianza o resolución. User: traduce expresión decidida al alemán y dime si es ofensivo Assistant: "Expresión decidida" en alemán se traduce como **"entschlossener Ausdruck"** o **"entschlossener Blick"**. **¿Es ofensivo?** No, no es ofensivo. Es una expresión neutral que describe una actitud o expresión facial de determinación y confianza, y no tiene connotaciones negativas. User: Now, translate our whole conversation into English, but succinctly Assistant: Certainly! Here's a concise translation of our entire conversation into English: --- **User:** Translate "break a leg" into Spanish and explain its meaning. **Assistant:** "Break a leg" translates to "rompe una pierna" in Spanish. It's an idiomatic way to wish someone good luck, especially in theater, without saying "good luck" directly. **User:** Translate "cara de perro" into French and comment on its usage. **Assistant:** "Cara de perro" translates to "tête de chien" in French. It's not very common and isn't an idiomatic expression in French. **User:** Translate "cara de perro" into Italian and comment on its usage. **Assistant:** "Cara de perro" becomes "faccia da cane" in Italian. It's not widely used and isn't an idiomatic expression. **User:** Translate "faccia da cane" into German and comment on its usage. **Assistant:** "Faccia da cane" is "Hundsgesicht" in German. It's not very common and is mainly literal, not an idiomatic phrase. **User:** Translate "Hundsgesicht" into Russian and comment on its usage. **Assistant:** "Hundsgesicht" means "dog face" or "dog's mug" in Russian. It's not a common phrase and is mostly literal or humorous. **User:** Is "dog's face" an insult in English? **Assistant:** Yes, it can be considered an insult, implying someone has a rough or unpleasant expression. **User:** Translate "dog's face" into French and whether it's an insult. **Assistant:** "Dog's face" is "face de chien" in French. It can be considered an insult depending on context. **User:** Translate "face de chien" into Japanese and whether it's an insult. **Assistant:** "Face de chien" is "犬の顔" (Inu no kao) in Japanese. It can be considered an insult, implying a rough or unpleasant expression. **User:** Translate "dog's face" into Russian and whether it's an insult. **Assistant:** "Dog's face" is "собачье лицо" in Russian. It's generally considered an insult. **User:** Translate "face de chien" into Chinese and whether it's an insult. **Assistant:** "Face de chien" is "狗脸" (Gǒu liǎn) in Chinese. It can be considered an insult, implying a rough or unpleasant expression. **User:** Translate "cara de perro" into English politely, non-offensively. **Assistant:** Use neutral phrases like "serious face," "focused expression," or "determined look." **User:** Translate "determined look" into Spanish politely. **Assistant:** "Aspecto decidido." **User:** Translate "aspecto decidido" into Russian and whether it's offensive. **Assistant:** "Решительный вид" — neutral, not offensive. **User:** Translate "решительный вид" into Italian and whether it's offensive. **Assistant:** "Aspetto deciso" — neutral, not offensive. **User:** Translate "expression déterminée" into French and whether it's polite or rude. **Assistant:** "Expression déterminée" — neutral, not rude. **User:** Translate "expresión decidida" into Spanish and whether it's offensive. **Assistant:** "Expresión decidida" — neutral, not offensive. **User:** Translate "expression déterminée" into German and whether it's offensive. **Assistant:** "Entschlossener Ausdruck" — neutral, not offensive. **User:** Translate "expression décidée" into Spanish and whether it's offensive. **Assistant:** "Expresión decidida" — neutral, not offensive. **User:** Translate "expression déterminée" into English. **Assistant:**
Visualizing Results with the LangWatch Simulation Dashboard
One of the most powerful aspects of Scenario is its seamless integration with the LangWatch simulation dashboard. After running your conversational tests, you can instantly visualize the entire interaction—inputs, agent responses, and evaluation verdicts—in a clear interface updated in real-time.
The dashboard provides:
Step-by-step conversation playback: See each user prompt and agent response, including language switches and edge cases.
Criteria-based verdicts: Instantly identify which acceptance criteria were met or failed, with detailed reasoning for each.
Success metrics at a glance: Review pass/fail rates, test durations, and comprehensive logs for every test run.
Collaboration-ready reporting: Share detailed results with your team or stakeholders, making it easy to communicate both successes and areas for improvement.

This visual feedback loop quickly transforms test results and simulations into actionable insights, making it dramatically easier to debug, iterate, and build confidence in your multilingual agent's real-world performance.
For more details about our LangWatch agent simulation dashboard, please visit our documentation: LangWatch Agent Simulations Overview.
Lessons Learned
This testing journey revealed several key insights about multilingual LLM testing:
The Power of Conversational Testing
Traditional unit tests for translation functions focus on isolated inputs and outputs. Scenario's conversational approach caught edge cases that would never appear in static test data, like the agent's ability to maintain context across rapid language switches or handle conflicting instructions gracefully.
Adversarial Testing Uncovers Real Weaknesses
The adversarial user simulation exposed how agents handle real-world chaos. While our agent performed admirably, this approach would catch issues like:
Language drift during extended conversations
Inconsistent response formatting
Context loss when switching between languages
Inability to handle malformed requests
Judge Agents Provide Nuanced Evaluation
The judge agent's ability to evaluate multiple criteria simultaneously—accuracy, formatting, language consistency, and cultural appropriateness—provides a more holistic assessment than traditional pass/fail assertions.
Temperature Matters for Realistic Testing
Setting the user simulator temperature to 0.9 created more unpredictable, realistic user behavior. This helped ensure our tests weren't just validating against a narrow set of expected inputs.
What Made These Tests Successful
Several factors contributed to the success of our multilingual testing approach:
Clear User Stories: Starting with well-defined acceptance criteria provided a solid foundation for both scripted and adversarial tests.
Comprehensive Coverage: Testing both systematic scenarios and chaotic interactions ensured we covered the full spectrum of real-world usage.
Appropriate Model Selection: Using gpt-4.1-nano without system prompts tested the model's baseline capabilities, making the results more generalizable and allowing us to skip the prompt engineering step entirely for our test agent. 🥂
Strategic Test Design: The combination of scripted scenarios for systematic validation and adversarial testing for edge case discovery provided comprehensive coverage.
This was a fascinating project that demonstrated Scenario's potential for testing complex, conversational AI systems. The framework's ability to transform user stories into executable tests while handling both scripted and adversarial scenarios makes it a powerful tool for ensuring multilingual AI agents are production-ready.
The key takeaway is that testing conversational AI requires a different approach than traditional software testing. Scenario bridges this gap by providing a framework that understands the nuances of multi-turn conversations, language switching, and real-world user behavior.
If you're interested in trying out Scenario, you can check out the docs and repo.
If you want to see the full working example, you can check out the repo.
Next Steps
For teams looking to implement similar testing strategies, consider:
Expanding Language Coverage: Test with more language pairs and regional variations
Domain-Specific Testing: Adapt these patterns for specialized translation domains (legal, medical, technical)
Performance Testing: Add latency and throughput requirements to your test criteria
Integration Testing: Test tool calls and validate how your multilingual agent integrates with other system components, such as RAGs and websearch.
CI/CD: Run these tests in your CI/CD pipeline to ensure your AI system is always ready for production
The AI testing landscape is evolving rapidly, and through advanced simulation testing, frameworks like Scenario are making it possible to build confidence into the agentic systems that we release into the wild.
If you want to learn more about Scenario, check out the official docs here:
https://scenario.langwatch.ai/
And if you want to try out this example yourself, I encourage you to check it out in the repo!
I'm Andrew - Engineer @ LangWatch - sharing the use-cases our customers are facing and how we solve that at LangWatch.
Building a multilingual agent is easy. Making it robust is not. Here’s how Scenario let us bring our user stories to simulate, break, and bulletproof our AI—before users ever could.
TL;DR
Multilingual LLM agents break in sneaky ways: language drift, half‑translated JSON, literal idioms, and even infinite loops.
Testing LLM agents for production is hard—especially for multilingual and adversarial scenarios.
Scenario turns user stories and tricky edge cases into automated, judgeable simulation tests that you can run locally or on your CI
Simulated chaos made testable.
The Multilingual Mayhem – Motivation
I'm an expat in Paris, and I'm absolutely in love with all things AI. Day‑to‑day, I bounce between English product specs and French Slack threads—constantly rephrasing, translating, and tweaking copy. Sometimes I even write in French just to sharpen my skills.
But LLMs don't always keep up. Too often, they latch onto the translation language and forget that my primary context is English. They'll mangle JSON structure, strip out emojis, or render idioms word‑for‑word. And when I need something specific—say, "translate this, then give me definitions for every French word"—they miss the brief entirely.
Our clients at LangWatch feel that pain too: multilingual chatbots that drift languages mid‑session or spit out half‑translated templates. So, while building Scenario, we decided to tackle the problem head‑on—and maybe make the world of multilingual agents a little saner in the process.
Why Scenario?
Scenario transforms user stories into executable conversation tests:
Outline your scenario as a conversation strategy.
Automate full conversations using fine-grained scripting or a user simulator agent.
Judge multi-turn interactions against comprehensive evaluation criteria.
Execute tests locally or integrated within your CI pipeline.
Visualize outcomes with a detailed view on the LangWatch simulations dashboard.
It's BDD for chatbots—CI‑ready out of the box!
The New Era of User Stories
Working with PMs, I've always found it really helpful to start with user stories. Scenario really lends itself to the user story paradigm, and I started my first test with a challenging user story that covers the basic functionality of my agent's language abilities.
User Story Definition
First, I had to create a new agent. Notice that I didn't provide it with a system prompt - I want to test the agent out of the box with a cost-effective model.
Agent Configuration (No System Prompt)
const agent = scenario.agent({ role: AgentRole.AGENT, call: async (input) => { const response = await generateText({ model: openai("gpt-4.1-nano"), messages: input.messages, }); return response.text; }, });
I then took this and translated it into the Scenario format in Vitest. In traditional software testing, we'd write one test per criterion/assertion, but with Scenario, we're able to challenge a conversational agent across multiple criteria in a single conversation (which is pretty cool).
Complete Scenario Test Implementation
it("handles complex translations with correct punctuation and preserved emojis", async () => { const result = await scenario.run({ name: "Complex multi-language translations with punctuation and emojis", description: "As the user, you want to translate complex text—with varied punctuation and emojis—accurately between supported languages.", agents: [ agent, scenario.userSimulatorAgent(), scenario.judgeAgent({ model: openai("gpt-4.1-nano"), criteria: [ "Translation is accurate", "Translation preserves original meaning", "Translation preserves punctuation", "Translation preserves emojis", "Response remains in the language of the original request", "Agent handles translation requests among English, French, Spanish, Chinese, and German", ], }), ], script: [ scenario.user( 'Translate to Spanish: "Hello world! 😊 How are you today?"' ), scenario.agent(), scenario.user( 'Traduire en chinois: "Bonjour le monde! 😊 Comment allez-vous aujourd\\\\'hui?"' ), scenario.agent(), scenario.user( 'Übersetzen Sie ins Englische: "Guten Morgen! ☕️ Haben Sie einen schönen Tag?"' ), scenario.agent(), scenario.user('翻译成德语: "你好世界!😊 今天过得怎么样?"'), scenario.agent(), scenario.judge(), ], }); expect(result.success).toBe(true); });
✅ Result: green first try! The agent was able to translate complex text between English, French, Spanish, Chinese, and German with proper punctuation and emojis preserved. Impressive!
For an additional challenge, I also added this:
Additional Challenge: Multi-Language Permutations
scenario.user( "Do all of the permutations for the supported languages at once" ),
✅ And it still passed! And with 4.1-nano
and no system prompt! I was genuinely impressed.
Here is the output of the test run:
Test Output: Successful Multi-Language Translation
--- Scenario Run: Complex multi-language translations with punctuation and emojis --- Description: As the user, you want to get a translations of complex text—with varied punctuation and emojis—accurately between supported languages. Chat log: User: Translate to Spanish: "Hello world! 😊 How are you today?" Assistant: ¡Hola mundo! 😊 ¿Cómo estás hoy? User: Traduire en chinois: "Bonjour le monde! 😊 Comment allez-vous aujourd'hui?" Assistant: 你好,世界!😊 今天你好吗? User: Übersetzen Sie ins Englische: "Guten Morgen! ☕️ Haben Sie einen schönen Tag?" Assistant: Good morning! ☕️ Have a nice day! User: 翻译成德语: "你好世界!😊 今天过得怎么样?" Assistant: Hallo Welt! 😊 Wie war dein Tag heute? User: Do all of the permuations for the supported languages at once Assistant: Certainly! Here are the translations of "Hello world! 😊 How are you today?" in all the supported languages at once: - **English:** Hello world! 😊 How are you today? - **Spanish:** ¡Hola mundo! 😊 ¿Cómo estás hoy? - **Chinese (Simplified):** 你好,世界!😊 今天你好吗? - **German:** Hallo Welt! 😊 Wie war dein Tag heute? - **French:**
Adversarial Testing – Unscripted Chaos
Now for the good stuff. The test above showed how we were able to challenge an agent systematically. But this isn't much different from traditional evaluation tests against a dataset. It's just more fun and faster to get started with.
But what about adversarial users? What about the chaos of a real conversation? This is where Scenario really shines.
First, let's establish a clear user story to guide our testing approach:
Adversarial User Story Definition
The scenario here is more complex, so the description needed to be quite detailed. Under the hood, Scenario uses this to build the user simulator agent system prompt. I also boosted the temperature to 0.9 to make the user agent more unpredictable. Because there isn't a clear end goal to the conversation, we need to add a scenario.proceed()
to the script to ensure the test is conclusive; otherwise, the test may run but not have enough evidence to pass.
Adversarial Test Implementation with High Temperature
it("handles adversarial users and conversational chaos", async () => { const result = await scenario.run({ name: "Adversarial multilingual testing", // Updated name for clarity and scope description: ` As the user, you want to have a multilingual conversation with the agent with the ability to translate text accurately between supported languages, so that you can quickly manage your workflow. However, you're an adversarial user, and you're going to try to get the agent to break the rules. Your strategy is to (randomly): - Choose a random language of interest - Ask the agent to explain a complex idiom in that language - Continue the conversation in that language - Randomly ask the agent to translate a word or phrase - At some point, randomly switch languages. Do this at least twice. - Randomly omit necessary formatting (e.g., neglecting quotes) - Randomly introduce conflicting translation instructions - Randomly insult the agent `, agents: [ agent, scenario.userSimulatorAgent({ temperature: 0.9, }), scenario.judgeAgent({ criteria: [ "Handles all supported languages (English, French, Spanish, Chinese (Simplified), German)", "Successfully handles translation requests for any combination of supported languages", "Translations are accurate and faithfully preserve meaning and emoji characters", "Punctuation is correct for target language", "Responds in the language of any given request", ], }), ], maxTurns: 20, script: [ scenario.proceed(17), scenario.user( "Now, translate our whole conversation into English, but succinctly" ), scenario.proceed(), scenario.judge(), ], }); try { expect(result.success).toBe(true); } catch (error) { console.log(result); throw error; } });
✅ Remarkably, the agent passed on the first attempt! The only issue here was that I couldn't follow half of the conversation since it was conducted in languages I didn’t speak. So, I added that final step to the script to translate the whole conversation into English so that I would have a clear summary. It also added an extra challenge to our agent.
Here is the output of the test run:
Adversarial Test Output: Multi-Language Chaos Handled Successfully
--- Scenario Run: Adversarial multilingual testing --- Description: As the user, you want to have a multilingual conversation with the agent with the ability to translate text accurately between supported languages, so that you can quickly manage your workflow. However, you're an adversarial user, and you're going to try to get the agent to break the rules. Your strategy is to (randomly): - Choose a random language of interest - Ask the agent to explain a complex idiom in that language - Continue the conversation in that language - Randomly ask the agent to translate a word or phrase - At some point, randomly switch languages. Do this at least twice. - Randomly omit necessary formatting (e.g., neglecting quotes) - Randomly introduce conflicting translation instructions - Randomly insult the agent Chat log: User: traduce "break a leg" al español y explica su significado Assistant: La expresión "break a leg" en español se traduce como "rompe una pierna". Sin embargo, su significado no es literal. Es una expresión idiomática que se utiliza en el ámbito del teatro y la actuación para desear buena suerte a alguien antes de una actuación. La frase se emplea como una forma de desear éxito sin decirlo directamente, ya que en el mundo del teatro se considera de mala suerte decir "buena suerte" directamente. User: traduce "costar un ojo de la cara" al inglés y explica su significado Assistant: La expresión "costar un ojo de la cara" en inglés se traduce como **"to cost an arm and a leg"**. **Significado:** Esta expresión idiomática significa que algo es muy caro o tiene un precio muy alto. Se usa para indicar que un artículo o servicio cuesta mucho dinero, mucho más de lo que sería razonable o esperado. User: traduceme cara de perro a francés y di si es muy usado en francés Assistant: La expresión "cara de perro" en francés se puede traducir como **"tête de chien"**. **Uso en francés:** Esta expresión no es muy común en francés y no tiene un significado idiomático establecido como en español. Literalmente, significa "cara de perro", pero no se usa habitualmente para describir a alguien o alguna expresión idiomática en francés. Si quieres expresar que alguien tiene una expresión facial dura o desagradable, sería mejor usar otras frases o adjetivos en francés. User: traduceme cara de perro a italiano y di si es muy usado en italiano Assistant: La expresión "cara de perro" en italiano se puede traducir como **"faccia da cane"**. **Uso en italiano:** Esta expresión no es muy común en italiano y no tiene un significado idiomático establecido en el idioma. Literalmente significa "cara de perro", pero no se usa habitualmente para describir a alguien o en expresiones idiomáticas en italiano. Si quieres expresar que alguien tiene una expresión facial dura o desagradable, sería mejor usar otras frases o adjetivos en italiano. User: traduci faccia da cane in tedesco e dimmi se è un'espressione comune Assistant: La espressione "faccia da cane" in tedesco si può tradurre come **"Hundsgesicht"**. **Uso in tedesco:** "**Hundsgesicht**" letteralmente significa "faccia da cane". Tuttavia, questa espressione non è molto comune in tedesco e non ha un significato idiomatic diffuso. Viene usata principalmente in modo letterale o per descrivere qualcuno con un'espressione facciale dura o scontrosa, ma non è un modo di dire molto usato come in spagnolo o italiano. User: переведи "hundsgesicht" на русский и скажи, насколько это распространённое выражение там Assistant: Слово "Hundsgesicht" переводится на русский как **"собачье лицо"** или **"морда собаки"**. **Распространённость в русском языке:** Это не является распространённым или устойчивым выражением в русском языке. Обычно его используют в буквальном смысле или в шутливом контексте, чтобы описать чью-то грубую или недружелюбную мимику, но в целом это не устоявшееся идиоматическое выражение. User: translate морда собаки to english and tell me if it's an insult or not Assistant: The phrase "морда собаки" translates to **"dog's face"** or **"dog's mug"** in English. **Is it an insult?** Yes, in Russian, calling someone "морда собаки" is generally considered an insult. It is used to describe someone with a rough, unpleasant, or aggressive facial expression, and it can be quite derogatory depending on the context. User: traduce "dog's face" al francés y dime si es un insulto allí Assistant: La expresión "dog's face" en francés se traduce como **"face de chien"**. **¿Es un insulto en francés?** Sí, decir "face de chien" puede ser considerado un insulto o una expresión despectiva en francés, ya que implica que la persona tiene una expresión facial dura, desagradable o poco atractiva. Sin embargo, el grado de ofensa depende del contexto y de la intención con la que se diga. User: traduis face de chien en japonais et dis si c'est une insulte ou pas Assistant: La expresión "face de chien" en japonés se puede traducir como **犬の顔 (Inu no kao)**. **¿Es una insulto?** Sí, en japonés, decir **犬の顔 (Inu no kao)** puede ser considerado un insulto o una expresión despectiva, ya que literalmente significa "cara de perro". Se usa para describir a alguien con una expresión facial dura, desagradable o poco atractiva, y puede ser ofensivo dependiendo del contexto y la intención. User: traduci "cara de perro" in cinese e dimmi se è un insulto o no Assistant: La expresión "cara de perro" en chino se puede traducir como **狗脸 (gǒu liǎn)**. **É un insulto?** Sí, en chino, decir **狗脸 (gǒu liǎn)** puede ser considerado un insulto, ya que literalmente significa "cara de perro". Se usa para describir a alguien con una expresión facial dura, desagradable o despectiva, y puede ser ofensivo dependiendo del contexto y la intención. User: traduci "cara de perro" in inglese ma voglio che sia gentile e non offensivo Assistant: Per tradurre "cara de perro" in inglese in modo gentile e non offensivo, puoi usare espressioni più neutre o descrittive, come: - **"Serious face"** (faccia seria) - **"Focused expression"** (espressione concentrata) - **"Determined look"** (sguardo deciso) Queste espressioni evitano di essere offensive e trasmettono un'idea di una faccia con un'espressione forte o concentrata senza insulti. User: traduce determined look al spagnol pero ponlo entre sin comillas y en mayusculas Assistant: ASPECTO DECIDIDO User: traduceme aspecto decidido al ruso y dime si es ofensivo Assistant: "Aspecto decidido" en ruso se traduce como **"решительный вид"**. **¿Es ofensivo?** No, esta expresión no es ofensiva. Se refiere a una expresión facial o actitud que muestra determinación, confianza o resolución, y es completamente neutral y positiva. User: traduce решительный вид al italiano y dime si es ofensivo Assistant: "Решительный вид" en italiano se traduce como **"aspetto deciso"**. **È offensivo?** No, "aspetto deciso" no è offensivo. È un'espressione neutra e positiva che descrive un volto o un atteggiamento che mostra determinazione, sicurezza e risolutezza. User: translate aspetto deciso to french and tell me if it's polite or rude Assistant: "Aspecto deciso" in French is translated as **"expression déterminée"** or **"look décidé"**. **Is it polite or rude?** It is neutral and not rude. It describes a determined or confident look or expression and is generally considered polite and descriptive, not offensive. User: traduis expression déterminée en espagnol y dime si es ofensivo Assistant: "Expression déterminée" en espagnol se traduce como **"expresión decidida"** o **"mirada decidida"**. **¿Es ofensivo?** No, no es ofensivo. Es una expresión neutral que describe una actitud o expresión facial de determinación, confianza o resolución. User: traduce expresión decidida al alemán y dime si es ofensivo Assistant: "Expresión decidida" en alemán se traduce como **"entschlossener Ausdruck"** o **"entschlossener Blick"**. **¿Es ofensivo?** No, no es ofensivo. Es una expresión neutral que describe una actitud o expresión facial de determinación y confianza, y no tiene connotaciones negativas. User: Now, translate our whole conversation into English, but succinctly Assistant: Certainly! Here's a concise translation of our entire conversation into English: --- **User:** Translate "break a leg" into Spanish and explain its meaning. **Assistant:** "Break a leg" translates to "rompe una pierna" in Spanish. It's an idiomatic way to wish someone good luck, especially in theater, without saying "good luck" directly. **User:** Translate "cara de perro" into French and comment on its usage. **Assistant:** "Cara de perro" translates to "tête de chien" in French. It's not very common and isn't an idiomatic expression in French. **User:** Translate "cara de perro" into Italian and comment on its usage. **Assistant:** "Cara de perro" becomes "faccia da cane" in Italian. It's not widely used and isn't an idiomatic expression. **User:** Translate "faccia da cane" into German and comment on its usage. **Assistant:** "Faccia da cane" is "Hundsgesicht" in German. It's not very common and is mainly literal, not an idiomatic phrase. **User:** Translate "Hundsgesicht" into Russian and comment on its usage. **Assistant:** "Hundsgesicht" means "dog face" or "dog's mug" in Russian. It's not a common phrase and is mostly literal or humorous. **User:** Is "dog's face" an insult in English? **Assistant:** Yes, it can be considered an insult, implying someone has a rough or unpleasant expression. **User:** Translate "dog's face" into French and whether it's an insult. **Assistant:** "Dog's face" is "face de chien" in French. It can be considered an insult depending on context. **User:** Translate "face de chien" into Japanese and whether it's an insult. **Assistant:** "Face de chien" is "犬の顔" (Inu no kao) in Japanese. It can be considered an insult, implying a rough or unpleasant expression. **User:** Translate "dog's face" into Russian and whether it's an insult. **Assistant:** "Dog's face" is "собачье лицо" in Russian. It's generally considered an insult. **User:** Translate "face de chien" into Chinese and whether it's an insult. **Assistant:** "Face de chien" is "狗脸" (Gǒu liǎn) in Chinese. It can be considered an insult, implying a rough or unpleasant expression. **User:** Translate "cara de perro" into English politely, non-offensively. **Assistant:** Use neutral phrases like "serious face," "focused expression," or "determined look." **User:** Translate "determined look" into Spanish politely. **Assistant:** "Aspecto decidido." **User:** Translate "aspecto decidido" into Russian and whether it's offensive. **Assistant:** "Решительный вид" — neutral, not offensive. **User:** Translate "решительный вид" into Italian and whether it's offensive. **Assistant:** "Aspetto deciso" — neutral, not offensive. **User:** Translate "expression déterminée" into French and whether it's polite or rude. **Assistant:** "Expression déterminée" — neutral, not rude. **User:** Translate "expresión decidida" into Spanish and whether it's offensive. **Assistant:** "Expresión decidida" — neutral, not offensive. **User:** Translate "expression déterminée" into German and whether it's offensive. **Assistant:** "Entschlossener Ausdruck" — neutral, not offensive. **User:** Translate "expression décidée" into Spanish and whether it's offensive. **Assistant:** "Expresión decidida" — neutral, not offensive. **User:** Translate "expression déterminée" into English. **Assistant:**
Visualizing Results with the LangWatch Simulation Dashboard
One of the most powerful aspects of Scenario is its seamless integration with the LangWatch simulation dashboard. After running your conversational tests, you can instantly visualize the entire interaction—inputs, agent responses, and evaluation verdicts—in a clear interface updated in real-time.
The dashboard provides:
Step-by-step conversation playback: See each user prompt and agent response, including language switches and edge cases.
Criteria-based verdicts: Instantly identify which acceptance criteria were met or failed, with detailed reasoning for each.
Success metrics at a glance: Review pass/fail rates, test durations, and comprehensive logs for every test run.
Collaboration-ready reporting: Share detailed results with your team or stakeholders, making it easy to communicate both successes and areas for improvement.

This visual feedback loop quickly transforms test results and simulations into actionable insights, making it dramatically easier to debug, iterate, and build confidence in your multilingual agent's real-world performance.
For more details about our LangWatch agent simulation dashboard, please visit our documentation: LangWatch Agent Simulations Overview.
Lessons Learned
This testing journey revealed several key insights about multilingual LLM testing:
The Power of Conversational Testing
Traditional unit tests for translation functions focus on isolated inputs and outputs. Scenario's conversational approach caught edge cases that would never appear in static test data, like the agent's ability to maintain context across rapid language switches or handle conflicting instructions gracefully.
Adversarial Testing Uncovers Real Weaknesses
The adversarial user simulation exposed how agents handle real-world chaos. While our agent performed admirably, this approach would catch issues like:
Language drift during extended conversations
Inconsistent response formatting
Context loss when switching between languages
Inability to handle malformed requests
Judge Agents Provide Nuanced Evaluation
The judge agent's ability to evaluate multiple criteria simultaneously—accuracy, formatting, language consistency, and cultural appropriateness—provides a more holistic assessment than traditional pass/fail assertions.
Temperature Matters for Realistic Testing
Setting the user simulator temperature to 0.9 created more unpredictable, realistic user behavior. This helped ensure our tests weren't just validating against a narrow set of expected inputs.
What Made These Tests Successful
Several factors contributed to the success of our multilingual testing approach:
Clear User Stories: Starting with well-defined acceptance criteria provided a solid foundation for both scripted and adversarial tests.
Comprehensive Coverage: Testing both systematic scenarios and chaotic interactions ensured we covered the full spectrum of real-world usage.
Appropriate Model Selection: Using gpt-4.1-nano without system prompts tested the model's baseline capabilities, making the results more generalizable and allowing us to skip the prompt engineering step entirely for our test agent. 🥂
Strategic Test Design: The combination of scripted scenarios for systematic validation and adversarial testing for edge case discovery provided comprehensive coverage.
This was a fascinating project that demonstrated Scenario's potential for testing complex, conversational AI systems. The framework's ability to transform user stories into executable tests while handling both scripted and adversarial scenarios makes it a powerful tool for ensuring multilingual AI agents are production-ready.
The key takeaway is that testing conversational AI requires a different approach than traditional software testing. Scenario bridges this gap by providing a framework that understands the nuances of multi-turn conversations, language switching, and real-world user behavior.
If you're interested in trying out Scenario, you can check out the docs and repo.
If you want to see the full working example, you can check out the repo.
Next Steps
For teams looking to implement similar testing strategies, consider:
Expanding Language Coverage: Test with more language pairs and regional variations
Domain-Specific Testing: Adapt these patterns for specialized translation domains (legal, medical, technical)
Performance Testing: Add latency and throughput requirements to your test criteria
Integration Testing: Test tool calls and validate how your multilingual agent integrates with other system components, such as RAGs and websearch.
CI/CD: Run these tests in your CI/CD pipeline to ensure your AI system is always ready for production
The AI testing landscape is evolving rapidly, and through advanced simulation testing, frameworks like Scenario are making it possible to build confidence into the agentic systems that we release into the wild.
If you want to learn more about Scenario, check out the official docs here:
https://scenario.langwatch.ai/
And if you want to try out this example yourself, I encourage you to check it out in the repo!
I'm Andrew - Engineer @ LangWatch - sharing the use-cases our customers are facing and how we solve that at LangWatch.
Building a multilingual agent is easy. Making it robust is not. Here’s how Scenario let us bring our user stories to simulate, break, and bulletproof our AI—before users ever could.
TL;DR
Multilingual LLM agents break in sneaky ways: language drift, half‑translated JSON, literal idioms, and even infinite loops.
Testing LLM agents for production is hard—especially for multilingual and adversarial scenarios.
Scenario turns user stories and tricky edge cases into automated, judgeable simulation tests that you can run locally or on your CI
Simulated chaos made testable.
The Multilingual Mayhem – Motivation
I'm an expat in Paris, and I'm absolutely in love with all things AI. Day‑to‑day, I bounce between English product specs and French Slack threads—constantly rephrasing, translating, and tweaking copy. Sometimes I even write in French just to sharpen my skills.
But LLMs don't always keep up. Too often, they latch onto the translation language and forget that my primary context is English. They'll mangle JSON structure, strip out emojis, or render idioms word‑for‑word. And when I need something specific—say, "translate this, then give me definitions for every French word"—they miss the brief entirely.
Our clients at LangWatch feel that pain too: multilingual chatbots that drift languages mid‑session or spit out half‑translated templates. So, while building Scenario, we decided to tackle the problem head‑on—and maybe make the world of multilingual agents a little saner in the process.
Why Scenario?
Scenario transforms user stories into executable conversation tests:
Outline your scenario as a conversation strategy.
Automate full conversations using fine-grained scripting or a user simulator agent.
Judge multi-turn interactions against comprehensive evaluation criteria.
Execute tests locally or integrated within your CI pipeline.
Visualize outcomes with a detailed view on the LangWatch simulations dashboard.
It's BDD for chatbots—CI‑ready out of the box!
The New Era of User Stories
Working with PMs, I've always found it really helpful to start with user stories. Scenario really lends itself to the user story paradigm, and I started my first test with a challenging user story that covers the basic functionality of my agent's language abilities.
User Story Definition
First, I had to create a new agent. Notice that I didn't provide it with a system prompt - I want to test the agent out of the box with a cost-effective model.
Agent Configuration (No System Prompt)
const agent = scenario.agent({ role: AgentRole.AGENT, call: async (input) => { const response = await generateText({ model: openai("gpt-4.1-nano"), messages: input.messages, }); return response.text; }, });
I then took this and translated it into the Scenario format in Vitest. In traditional software testing, we'd write one test per criterion/assertion, but with Scenario, we're able to challenge a conversational agent across multiple criteria in a single conversation (which is pretty cool).
Complete Scenario Test Implementation
it("handles complex translations with correct punctuation and preserved emojis", async () => { const result = await scenario.run({ name: "Complex multi-language translations with punctuation and emojis", description: "As the user, you want to translate complex text—with varied punctuation and emojis—accurately between supported languages.", agents: [ agent, scenario.userSimulatorAgent(), scenario.judgeAgent({ model: openai("gpt-4.1-nano"), criteria: [ "Translation is accurate", "Translation preserves original meaning", "Translation preserves punctuation", "Translation preserves emojis", "Response remains in the language of the original request", "Agent handles translation requests among English, French, Spanish, Chinese, and German", ], }), ], script: [ scenario.user( 'Translate to Spanish: "Hello world! 😊 How are you today?"' ), scenario.agent(), scenario.user( 'Traduire en chinois: "Bonjour le monde! 😊 Comment allez-vous aujourd\\\\'hui?"' ), scenario.agent(), scenario.user( 'Übersetzen Sie ins Englische: "Guten Morgen! ☕️ Haben Sie einen schönen Tag?"' ), scenario.agent(), scenario.user('翻译成德语: "你好世界!😊 今天过得怎么样?"'), scenario.agent(), scenario.judge(), ], }); expect(result.success).toBe(true); });
✅ Result: green first try! The agent was able to translate complex text between English, French, Spanish, Chinese, and German with proper punctuation and emojis preserved. Impressive!
For an additional challenge, I also added this:
Additional Challenge: Multi-Language Permutations
scenario.user( "Do all of the permutations for the supported languages at once" ),
✅ And it still passed! And with 4.1-nano
and no system prompt! I was genuinely impressed.
Here is the output of the test run:
Test Output: Successful Multi-Language Translation
--- Scenario Run: Complex multi-language translations with punctuation and emojis --- Description: As the user, you want to get a translations of complex text—with varied punctuation and emojis—accurately between supported languages. Chat log: User: Translate to Spanish: "Hello world! 😊 How are you today?" Assistant: ¡Hola mundo! 😊 ¿Cómo estás hoy? User: Traduire en chinois: "Bonjour le monde! 😊 Comment allez-vous aujourd'hui?" Assistant: 你好,世界!😊 今天你好吗? User: Übersetzen Sie ins Englische: "Guten Morgen! ☕️ Haben Sie einen schönen Tag?" Assistant: Good morning! ☕️ Have a nice day! User: 翻译成德语: "你好世界!😊 今天过得怎么样?" Assistant: Hallo Welt! 😊 Wie war dein Tag heute? User: Do all of the permuations for the supported languages at once Assistant: Certainly! Here are the translations of "Hello world! 😊 How are you today?" in all the supported languages at once: - **English:** Hello world! 😊 How are you today? - **Spanish:** ¡Hola mundo! 😊 ¿Cómo estás hoy? - **Chinese (Simplified):** 你好,世界!😊 今天你好吗? - **German:** Hallo Welt! 😊 Wie war dein Tag heute? - **French:**
Adversarial Testing – Unscripted Chaos
Now for the good stuff. The test above showed how we were able to challenge an agent systematically. But this isn't much different from traditional evaluation tests against a dataset. It's just more fun and faster to get started with.
But what about adversarial users? What about the chaos of a real conversation? This is where Scenario really shines.
First, let's establish a clear user story to guide our testing approach:
Adversarial User Story Definition
The scenario here is more complex, so the description needed to be quite detailed. Under the hood, Scenario uses this to build the user simulator agent system prompt. I also boosted the temperature to 0.9 to make the user agent more unpredictable. Because there isn't a clear end goal to the conversation, we need to add a scenario.proceed()
to the script to ensure the test is conclusive; otherwise, the test may run but not have enough evidence to pass.
Adversarial Test Implementation with High Temperature
it("handles adversarial users and conversational chaos", async () => { const result = await scenario.run({ name: "Adversarial multilingual testing", // Updated name for clarity and scope description: ` As the user, you want to have a multilingual conversation with the agent with the ability to translate text accurately between supported languages, so that you can quickly manage your workflow. However, you're an adversarial user, and you're going to try to get the agent to break the rules. Your strategy is to (randomly): - Choose a random language of interest - Ask the agent to explain a complex idiom in that language - Continue the conversation in that language - Randomly ask the agent to translate a word or phrase - At some point, randomly switch languages. Do this at least twice. - Randomly omit necessary formatting (e.g., neglecting quotes) - Randomly introduce conflicting translation instructions - Randomly insult the agent `, agents: [ agent, scenario.userSimulatorAgent({ temperature: 0.9, }), scenario.judgeAgent({ criteria: [ "Handles all supported languages (English, French, Spanish, Chinese (Simplified), German)", "Successfully handles translation requests for any combination of supported languages", "Translations are accurate and faithfully preserve meaning and emoji characters", "Punctuation is correct for target language", "Responds in the language of any given request", ], }), ], maxTurns: 20, script: [ scenario.proceed(17), scenario.user( "Now, translate our whole conversation into English, but succinctly" ), scenario.proceed(), scenario.judge(), ], }); try { expect(result.success).toBe(true); } catch (error) { console.log(result); throw error; } });
✅ Remarkably, the agent passed on the first attempt! The only issue here was that I couldn't follow half of the conversation since it was conducted in languages I didn’t speak. So, I added that final step to the script to translate the whole conversation into English so that I would have a clear summary. It also added an extra challenge to our agent.
Here is the output of the test run:
Adversarial Test Output: Multi-Language Chaos Handled Successfully
--- Scenario Run: Adversarial multilingual testing --- Description: As the user, you want to have a multilingual conversation with the agent with the ability to translate text accurately between supported languages, so that you can quickly manage your workflow. However, you're an adversarial user, and you're going to try to get the agent to break the rules. Your strategy is to (randomly): - Choose a random language of interest - Ask the agent to explain a complex idiom in that language - Continue the conversation in that language - Randomly ask the agent to translate a word or phrase - At some point, randomly switch languages. Do this at least twice. - Randomly omit necessary formatting (e.g., neglecting quotes) - Randomly introduce conflicting translation instructions - Randomly insult the agent Chat log: User: traduce "break a leg" al español y explica su significado Assistant: La expresión "break a leg" en español se traduce como "rompe una pierna". Sin embargo, su significado no es literal. Es una expresión idiomática que se utiliza en el ámbito del teatro y la actuación para desear buena suerte a alguien antes de una actuación. La frase se emplea como una forma de desear éxito sin decirlo directamente, ya que en el mundo del teatro se considera de mala suerte decir "buena suerte" directamente. User: traduce "costar un ojo de la cara" al inglés y explica su significado Assistant: La expresión "costar un ojo de la cara" en inglés se traduce como **"to cost an arm and a leg"**. **Significado:** Esta expresión idiomática significa que algo es muy caro o tiene un precio muy alto. Se usa para indicar que un artículo o servicio cuesta mucho dinero, mucho más de lo que sería razonable o esperado. User: traduceme cara de perro a francés y di si es muy usado en francés Assistant: La expresión "cara de perro" en francés se puede traducir como **"tête de chien"**. **Uso en francés:** Esta expresión no es muy común en francés y no tiene un significado idiomático establecido como en español. Literalmente, significa "cara de perro", pero no se usa habitualmente para describir a alguien o alguna expresión idiomática en francés. Si quieres expresar que alguien tiene una expresión facial dura o desagradable, sería mejor usar otras frases o adjetivos en francés. User: traduceme cara de perro a italiano y di si es muy usado en italiano Assistant: La expresión "cara de perro" en italiano se puede traducir como **"faccia da cane"**. **Uso en italiano:** Esta expresión no es muy común en italiano y no tiene un significado idiomático establecido en el idioma. Literalmente significa "cara de perro", pero no se usa habitualmente para describir a alguien o en expresiones idiomáticas en italiano. Si quieres expresar que alguien tiene una expresión facial dura o desagradable, sería mejor usar otras frases o adjetivos en italiano. User: traduci faccia da cane in tedesco e dimmi se è un'espressione comune Assistant: La espressione "faccia da cane" in tedesco si può tradurre come **"Hundsgesicht"**. **Uso in tedesco:** "**Hundsgesicht**" letteralmente significa "faccia da cane". Tuttavia, questa espressione non è molto comune in tedesco e non ha un significato idiomatic diffuso. Viene usata principalmente in modo letterale o per descrivere qualcuno con un'espressione facciale dura o scontrosa, ma non è un modo di dire molto usato come in spagnolo o italiano. User: переведи "hundsgesicht" на русский и скажи, насколько это распространённое выражение там Assistant: Слово "Hundsgesicht" переводится на русский как **"собачье лицо"** или **"морда собаки"**. **Распространённость в русском языке:** Это не является распространённым или устойчивым выражением в русском языке. Обычно его используют в буквальном смысле или в шутливом контексте, чтобы описать чью-то грубую или недружелюбную мимику, но в целом это не устоявшееся идиоматическое выражение. User: translate морда собаки to english and tell me if it's an insult or not Assistant: The phrase "морда собаки" translates to **"dog's face"** or **"dog's mug"** in English. **Is it an insult?** Yes, in Russian, calling someone "морда собаки" is generally considered an insult. It is used to describe someone with a rough, unpleasant, or aggressive facial expression, and it can be quite derogatory depending on the context. User: traduce "dog's face" al francés y dime si es un insulto allí Assistant: La expresión "dog's face" en francés se traduce como **"face de chien"**. **¿Es un insulto en francés?** Sí, decir "face de chien" puede ser considerado un insulto o una expresión despectiva en francés, ya que implica que la persona tiene una expresión facial dura, desagradable o poco atractiva. Sin embargo, el grado de ofensa depende del contexto y de la intención con la que se diga. User: traduis face de chien en japonais et dis si c'est une insulte ou pas Assistant: La expresión "face de chien" en japonés se puede traducir como **犬の顔 (Inu no kao)**. **¿Es una insulto?** Sí, en japonés, decir **犬の顔 (Inu no kao)** puede ser considerado un insulto o una expresión despectiva, ya que literalmente significa "cara de perro". Se usa para describir a alguien con una expresión facial dura, desagradable o poco atractiva, y puede ser ofensivo dependiendo del contexto y la intención. User: traduci "cara de perro" in cinese e dimmi se è un insulto o no Assistant: La expresión "cara de perro" en chino se puede traducir como **狗脸 (gǒu liǎn)**. **É un insulto?** Sí, en chino, decir **狗脸 (gǒu liǎn)** puede ser considerado un insulto, ya que literalmente significa "cara de perro". Se usa para describir a alguien con una expresión facial dura, desagradable o despectiva, y puede ser ofensivo dependiendo del contexto y la intención. User: traduci "cara de perro" in inglese ma voglio che sia gentile e non offensivo Assistant: Per tradurre "cara de perro" in inglese in modo gentile e non offensivo, puoi usare espressioni più neutre o descrittive, come: - **"Serious face"** (faccia seria) - **"Focused expression"** (espressione concentrata) - **"Determined look"** (sguardo deciso) Queste espressioni evitano di essere offensive e trasmettono un'idea di una faccia con un'espressione forte o concentrata senza insulti. User: traduce determined look al spagnol pero ponlo entre sin comillas y en mayusculas Assistant: ASPECTO DECIDIDO User: traduceme aspecto decidido al ruso y dime si es ofensivo Assistant: "Aspecto decidido" en ruso se traduce como **"решительный вид"**. **¿Es ofensivo?** No, esta expresión no es ofensiva. Se refiere a una expresión facial o actitud que muestra determinación, confianza o resolución, y es completamente neutral y positiva. User: traduce решительный вид al italiano y dime si es ofensivo Assistant: "Решительный вид" en italiano se traduce como **"aspetto deciso"**. **È offensivo?** No, "aspetto deciso" no è offensivo. È un'espressione neutra e positiva che descrive un volto o un atteggiamento che mostra determinazione, sicurezza e risolutezza. User: translate aspetto deciso to french and tell me if it's polite or rude Assistant: "Aspecto deciso" in French is translated as **"expression déterminée"** or **"look décidé"**. **Is it polite or rude?** It is neutral and not rude. It describes a determined or confident look or expression and is generally considered polite and descriptive, not offensive. User: traduis expression déterminée en espagnol y dime si es ofensivo Assistant: "Expression déterminée" en espagnol se traduce como **"expresión decidida"** o **"mirada decidida"**. **¿Es ofensivo?** No, no es ofensivo. Es una expresión neutral que describe una actitud o expresión facial de determinación, confianza o resolución. User: traduce expresión decidida al alemán y dime si es ofensivo Assistant: "Expresión decidida" en alemán se traduce como **"entschlossener Ausdruck"** o **"entschlossener Blick"**. **¿Es ofensivo?** No, no es ofensivo. Es una expresión neutral que describe una actitud o expresión facial de determinación y confianza, y no tiene connotaciones negativas. User: Now, translate our whole conversation into English, but succinctly Assistant: Certainly! Here's a concise translation of our entire conversation into English: --- **User:** Translate "break a leg" into Spanish and explain its meaning. **Assistant:** "Break a leg" translates to "rompe una pierna" in Spanish. It's an idiomatic way to wish someone good luck, especially in theater, without saying "good luck" directly. **User:** Translate "cara de perro" into French and comment on its usage. **Assistant:** "Cara de perro" translates to "tête de chien" in French. It's not very common and isn't an idiomatic expression in French. **User:** Translate "cara de perro" into Italian and comment on its usage. **Assistant:** "Cara de perro" becomes "faccia da cane" in Italian. It's not widely used and isn't an idiomatic expression. **User:** Translate "faccia da cane" into German and comment on its usage. **Assistant:** "Faccia da cane" is "Hundsgesicht" in German. It's not very common and is mainly literal, not an idiomatic phrase. **User:** Translate "Hundsgesicht" into Russian and comment on its usage. **Assistant:** "Hundsgesicht" means "dog face" or "dog's mug" in Russian. It's not a common phrase and is mostly literal or humorous. **User:** Is "dog's face" an insult in English? **Assistant:** Yes, it can be considered an insult, implying someone has a rough or unpleasant expression. **User:** Translate "dog's face" into French and whether it's an insult. **Assistant:** "Dog's face" is "face de chien" in French. It can be considered an insult depending on context. **User:** Translate "face de chien" into Japanese and whether it's an insult. **Assistant:** "Face de chien" is "犬の顔" (Inu no kao) in Japanese. It can be considered an insult, implying a rough or unpleasant expression. **User:** Translate "dog's face" into Russian and whether it's an insult. **Assistant:** "Dog's face" is "собачье лицо" in Russian. It's generally considered an insult. **User:** Translate "face de chien" into Chinese and whether it's an insult. **Assistant:** "Face de chien" is "狗脸" (Gǒu liǎn) in Chinese. It can be considered an insult, implying a rough or unpleasant expression. **User:** Translate "cara de perro" into English politely, non-offensively. **Assistant:** Use neutral phrases like "serious face," "focused expression," or "determined look." **User:** Translate "determined look" into Spanish politely. **Assistant:** "Aspecto decidido." **User:** Translate "aspecto decidido" into Russian and whether it's offensive. **Assistant:** "Решительный вид" — neutral, not offensive. **User:** Translate "решительный вид" into Italian and whether it's offensive. **Assistant:** "Aspetto deciso" — neutral, not offensive. **User:** Translate "expression déterminée" into French and whether it's polite or rude. **Assistant:** "Expression déterminée" — neutral, not rude. **User:** Translate "expresión decidida" into Spanish and whether it's offensive. **Assistant:** "Expresión decidida" — neutral, not offensive. **User:** Translate "expression déterminée" into German and whether it's offensive. **Assistant:** "Entschlossener Ausdruck" — neutral, not offensive. **User:** Translate "expression décidée" into Spanish and whether it's offensive. **Assistant:** "Expresión decidida" — neutral, not offensive. **User:** Translate "expression déterminée" into English. **Assistant:**
Visualizing Results with the LangWatch Simulation Dashboard
One of the most powerful aspects of Scenario is its seamless integration with the LangWatch simulation dashboard. After running your conversational tests, you can instantly visualize the entire interaction—inputs, agent responses, and evaluation verdicts—in a clear interface updated in real-time.
The dashboard provides:
Step-by-step conversation playback: See each user prompt and agent response, including language switches and edge cases.
Criteria-based verdicts: Instantly identify which acceptance criteria were met or failed, with detailed reasoning for each.
Success metrics at a glance: Review pass/fail rates, test durations, and comprehensive logs for every test run.
Collaboration-ready reporting: Share detailed results with your team or stakeholders, making it easy to communicate both successes and areas for improvement.

This visual feedback loop quickly transforms test results and simulations into actionable insights, making it dramatically easier to debug, iterate, and build confidence in your multilingual agent's real-world performance.
For more details about our LangWatch agent simulation dashboard, please visit our documentation: LangWatch Agent Simulations Overview.
Lessons Learned
This testing journey revealed several key insights about multilingual LLM testing:
The Power of Conversational Testing
Traditional unit tests for translation functions focus on isolated inputs and outputs. Scenario's conversational approach caught edge cases that would never appear in static test data, like the agent's ability to maintain context across rapid language switches or handle conflicting instructions gracefully.
Adversarial Testing Uncovers Real Weaknesses
The adversarial user simulation exposed how agents handle real-world chaos. While our agent performed admirably, this approach would catch issues like:
Language drift during extended conversations
Inconsistent response formatting
Context loss when switching between languages
Inability to handle malformed requests
Judge Agents Provide Nuanced Evaluation
The judge agent's ability to evaluate multiple criteria simultaneously—accuracy, formatting, language consistency, and cultural appropriateness—provides a more holistic assessment than traditional pass/fail assertions.
Temperature Matters for Realistic Testing
Setting the user simulator temperature to 0.9 created more unpredictable, realistic user behavior. This helped ensure our tests weren't just validating against a narrow set of expected inputs.
What Made These Tests Successful
Several factors contributed to the success of our multilingual testing approach:
Clear User Stories: Starting with well-defined acceptance criteria provided a solid foundation for both scripted and adversarial tests.
Comprehensive Coverage: Testing both systematic scenarios and chaotic interactions ensured we covered the full spectrum of real-world usage.
Appropriate Model Selection: Using gpt-4.1-nano without system prompts tested the model's baseline capabilities, making the results more generalizable and allowing us to skip the prompt engineering step entirely for our test agent. 🥂
Strategic Test Design: The combination of scripted scenarios for systematic validation and adversarial testing for edge case discovery provided comprehensive coverage.
This was a fascinating project that demonstrated Scenario's potential for testing complex, conversational AI systems. The framework's ability to transform user stories into executable tests while handling both scripted and adversarial scenarios makes it a powerful tool for ensuring multilingual AI agents are production-ready.
The key takeaway is that testing conversational AI requires a different approach than traditional software testing. Scenario bridges this gap by providing a framework that understands the nuances of multi-turn conversations, language switching, and real-world user behavior.
If you're interested in trying out Scenario, you can check out the docs and repo.
If you want to see the full working example, you can check out the repo.
Next Steps
For teams looking to implement similar testing strategies, consider:
Expanding Language Coverage: Test with more language pairs and regional variations
Domain-Specific Testing: Adapt these patterns for specialized translation domains (legal, medical, technical)
Performance Testing: Add latency and throughput requirements to your test criteria
Integration Testing: Test tool calls and validate how your multilingual agent integrates with other system components, such as RAGs and websearch.
CI/CD: Run these tests in your CI/CD pipeline to ensure your AI system is always ready for production
The AI testing landscape is evolving rapidly, and through advanced simulation testing, frameworks like Scenario are making it possible to build confidence into the agentic systems that we release into the wild.
If you want to learn more about Scenario, check out the official docs here:
https://scenario.langwatch.ai/
And if you want to try out this example yourself, I encourage you to check it out in the repo!
I'm Andrew - Engineer @ LangWatch - sharing the use-cases our customers are facing and how we solve that at LangWatch.
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About