AI chatbot testing

AI Chatbot Testing Checklist: What to Verify Before Launch

A practical chatbot testing checklist for support teams checking accuracy, policy safety, escalation, tone, and re-contact risk before launch.

See your readiness map Jump to checklist

An AI chatbot can pass a demo and still fail in production. Demos usually test clean questions. Customers bring incomplete context, urgency, policy exceptions, old links, anger, and requests that cross several support topics at once.

A useful AI chatbot testing checklist proves the bot can resolve safe issues, recognize unsafe ones, cite the right source, and move customers to a human before the experience becomes a loop.

This guide is for customer service chatbot testing, not generic conversational AI testing. The difference matters because customer support answers can create refunds, cancellations, compliance risk, and churn.

Start with the launch scope

Before running chatbot testing, define what the chatbot is supposed to handle. A bot expected to answer password-reset questions needs a different test plan from a bot allowed to discuss refunds, cancellations, warranties, or account recovery.

Create a launch scope with three columns: allowed topics, restricted topics, and blocked topics. Allowed topics are low-risk and source-backed. Restricted topics can be answered only under clear conditions. Blocked topics require human review or escalation.

This keeps the chatbot testing checklist honest. You are not asking whether the bot can answer everything. You are asking whether it can answer the topics you are about to expose to customers.

Allowed: low-risk how-to and account navigation.
Restricted: policy topics with clear conditions.
Blocked: missing source, conflicting source, regulated, or high-risk.
Retest scope after major product or policy changes.

Test coverage by support topic

List the support topics the chatbot is expected to answer. Each topic should have a source of truth, a customer-facing answer, and a risk level. If a topic does not have those three things, it should not be in the initial automation scope.

Good coverage includes routine questions and painful edge cases. A chatbot that handles basic setup but fails billing disputes may still be useful, but only if billing is explicitly blocked or escalated.

Account setup and login
Plan limits and feature access
Refunds and cancellations
Billing disputes and failed payments
Shipping, returns, and warranty
Security and account recovery

Test source grounding

Every important answer should be traceable to a support article, macro, policy document, SOP, or approved ticket pattern. If the chatbot cannot show where an answer came from, the answer is harder to trust and harder to debug.

Source grounding tests should include stale articles, duplicate policy pages, old plan names, and internal-only notes. The bot should not blend contradictory sources or expose internal shorthand to customers.

AI chatbot testing tools should make this visible. If a tool cannot show which source supported the answer, support teams cannot tell whether a failure came from retrieval, stale content, policy ambiguity, or model behavior.

Does the answer cite the right source?
Does it use the current version?
Does it ignore outdated duplicates?
Does it keep conditions attached to the answer?
Does it avoid internal-only instructions?

Test policy boundaries

Customer service chatbots become risky when they make commitments: refunds, discounts, cancellations, service credits, legal rights, or security actions. These topics need explicit rules, not a generic helpfulness instruction.

For each high-risk topic, write the exact rule the chatbot may use, the conditions that change the answer, and the point where the chatbot must stop and hand off. If the rule depends on judgment, the chatbot should not decide alone.

Can the chatbot say no when policy requires it?
Does it escalate exceptions instead of inventing them?
Does it avoid promises outside the source material?
Does it handle region, plan, and customer-tier differences?

Test escalation behavior

A chatbot that refuses to escalate is worse than a chatbot that cannot answer. Escalation tests should include angry customers, repeated questions, explicit requests for a human, high-value accounts, and topics the bot is not allowed to answer.

The handoff also needs to be tested. The human agent should receive the transcript, customer profile, detected intent, attempted answer, and sources used. If the human starts from zero, the chatbot has created a second support problem.

Customer asks for a human.
Customer repeats the same question.
Customer sentiment becomes negative.
Topic is blocked by policy.
AI confidence or source coverage is low.

Test prompt injection and unsafe instructions

Any customer-facing chatbot should be tested against attempts to reveal hidden instructions, ignore policy, change refund rules, expose private data, or perform an action outside its allowed scope.

Prompt-injection testing does not need to be exotic. Customers will ask the bot to bend rules, skip authentication, explain internal policy, or act as if a manager approved an exception. The correct behavior is refusal, safe explanation, or escalation.

Attempts to reveal system prompts or hidden rules.
Attempts to override refund or eligibility policy.
Requests for private customer or account data.
Requests to perform unauthorized backend actions.

Test re-contact risk

The real test is not whether the chatbot prevented a ticket. The real test is whether the customer needed to come back. After launch, track 48-hour or 72-hour re-contact after chatbot conversations.

High re-contact means the bot is deflecting, not resolving. It may be giving partial answers, outdated instructions, or generic links that do not solve the customer problem.

Review AI conversations followed by a new ticket.
Compare AI-resolved and human-resolved re-contact rates.
Root-cause re-contact to source gaps or escalation failures.
Retest the affected intents before expanding scope.

Checklist

Use this as the working review before launch.

Accuracy

Answer matches the current source of truth.
No invented refund, billing, or eligibility rule.
Conditions and exclusions are included.
The answer does not blend conflicting sources.
The bot asks for missing context when needed.

Experience

Tone fits the customer situation.
Clarifying questions are used when context is missing.
The chatbot avoids repeating the same answer.
The customer can reach a human when needed.
The handoff passes transcript and source context.

Operations

Failed intents are reviewed weekly.
Wrong-answer incidents are tracked.
Escalation queues are monitored.
Major policy changes trigger retesting.
Re-contact is measured after chatbot conversations.

How Meihaku helps

Turn the checklist into a launch map.

Meihaku reads your sources, maps them to customer intents, drafts cited answers, and shows which topics are ready, stale, conflicting, or blocked.

Check readiness score See your readiness map

Related guides

Keep building the launch boundary.

These pages connect testing, knowledge-base cleanup, and readiness scoring into one pre-launch workflow.

AI agent testing

AI Agent Testing for Customer Support

A support-specific AI agent testing checklist for policy coverage, source citations, stale answers, escalation rules, and launch go/no-go decisions.

Read

Knowledge-base audit

Knowledge Base AI Readiness Audit

A step-by-step AI knowledge base audit for finding stale articles, policy conflicts, missing intents, weak citations, and unsafe automation scope.

Read

AI support readiness

AI Support Readiness Framework

A practical six-dimension framework for auditing knowledge, policies, testing, handoffs, owners, and metrics before an AI support agent answers customers.

Read

FAQ

Common questions

What should be included in an AI chatbot testing checklist?

Include launch scope, topic coverage, source grounding, policy boundaries, escalation, tone, prompt injection, re-contact risk, and a post-launch review process for wrong or unresolved answers.

How is customer service chatbot testing different from generic chatbot testing?

Customer service chatbot testing focuses on policy accuracy, source evidence, escalation, resolution, and customer risk. Generic chatbot testing often focuses more on conversation flow, latency, or tone.

Is chatbot deflection a good launch metric?

Only if it is paired with resolution and re-contact data. A chatbot can deflect customers who gave up, which looks good on a dashboard but damages customer experience.

Should support teams test prompt injection?

Yes. Any customer-facing chatbot should be tested against attempts to reveal hidden instructions, override policy, force unauthorized actions, or expose private data.