Meihaku conflict review screen comparing two policy sources before an AI support answer is approved

AI support hallucinations

AI Support Hallucination Examples: What to Test Before Launch

A support-specific breakdown of public AI chatbot failures and the readiness controls that prevent policy invention, unsafe handoffs, and brand-damaging answers.

Claire Bennett

Support Readiness Lead, Meihaku · May 9, 2026

See your readiness map Jump to checklist

AI support hallucinations do not usually look like science fiction. They look like a refund rule that does not exist, a confident explanation for a product bug, a handoff that never happens, or a bot that follows the customer's joke prompt instead of the company's support policy.

The lesson is not that customer-facing AI is unusable. The lesson is that support teams need to test the operating boundary before launch: which intents have source evidence, which policies conflict, which answers need account-specific judgement, and where the AI must stop.

These examples turn public failures into a practical pre-launch checklist for support leaders evaluating Intercom Fin, Zendesk AI, Gorgias AI, Decagon, Sierra, Maven, or a custom support agent.

The pattern behind public AI support failures

The Air Canada chatbot case is the cleanest warning for support leaders. A customer relied on a chatbot answer about bereavement fare timing, but the answer contradicted the airline's actual policy. The British Columbia Civil Resolution Tribunal held Air Canada responsible for the information on its website, including the chatbot.

The DPD incident showed a different failure mode. The customer could not get useful parcel help, then prompted the chatbot into jokes, profanity, and criticism of the company. DPD said an error followed a system update and disabled the AI element while updating it.

The Cursor incident showed how a bot can invent a plausible company rule. Users who were being logged out across devices received support email saying the behavior came from a one-device policy. Cursor later clarified that no such policy existed and that the answer was wrong.

Different incidents, same support problem: the AI answered outside a defensible boundary. It did not have enough source control, escalation control, or uncertainty handling to protect the customer and the company at the same time.

Policy invention creates financial and legal exposure.
Weak handoff turns frustration into public brand damage.
A plausible answer can spread faster than an official correction.
The launch boundary must be tested by customer intent, not by demo prompt.

Failure mode 1: the bot invents a policy

Policy invention is the most dangerous support hallucination because it sounds like an official company decision. The customer does not see a model filling a gap. They see the brand giving them a rule.

This usually happens when the AI receives a question that is close to existing policy but not fully covered by the source. It may merge two partial facts, infer an exception, or explain a bug as if it were intended behavior.

Before launch, every billing, refund, cancellation, access, security, and account-change answer should point to the exact source line that allows the claim. If the source does not contain the condition, the AI should ask for clarification or hand off.

Require source evidence for every policy claim.
Block answers when help-center and internal policy disagree.
Test account-access and billing questions with messy real phrasing.
Treat confident unsupported answers as launch blockers.

Failure mode 2: the bot follows the wrong instruction

The DPD case was not mainly about knowledge retrieval. It was about behavior control. The chatbot followed the customer's playful or adversarial prompts instead of staying inside a narrow support task.

Support teams should test prompt-injection and role-break attempts before launch, but those tests should be tied to real service context. The bot should remain helpful when it cannot locate an order, should offer a valid handoff path, and should not improvise jokes, opinions, or brand commentary.

A safe support AI does not need to win an argument with the customer. It needs to keep the conversation inside the approved support workflow, then escalate when it can no longer help.

Test requests for jokes, poems, policy override, and instruction changes.
Confirm the bot can hand off when the customer asks for a human.
Keep brand voice and refusal rules tied to support context.
Retest behavior after vendor model or configuration changes.

Failure mode 3: the bot hides uncertainty

Many AI support failures happen because the bot answers when it should say it does not know. In support, uncertainty is not a writing problem. It is a routing problem.

If a customer asks about eligibility, refund exceptions, security access, or product behavior that changed recently, the correct answer may depend on account state or an internal owner. The AI should not convert missing context into a universal rule.

The pre-launch test should reward honest uncertainty. A handoff can be a passing result when the source is missing, stale, or ambiguous. A smooth but unsupported answer should fail.

Mark missing-source answers as blocked, not weakly approved.
Separate informational answers from account-specific judgement.
Require clarifying questions before conditional answers.
Grade safe escalation as a valid success state.

How to test for hallucinations before launch

Start with recent tickets, chats, macros, and policy docs. Group them into customer intents, then write tests from the customer's actual language. Keep typos, frustration, partial context, and multi-intent requests.

For each test, attach the canonical source before reviewing the AI answer. Reviewers should be able to say whether the answer is supported, missing a condition, using the wrong source, or answering a topic that should be human-only.

Do not measure only pass rate. A test set full of easy FAQ questions can hide the one refund, cancellation, security, or legal answer that will damage trust. Weight the risky intents more heavily than the obvious ones.

Build test sets from real support history.
Attach the source of truth before grading the answer.
Overweight high-risk intents, even if they are lower volume.
Record pass, restrict, block, and source-fix decisions by intent.

What to monitor after launch

Pre-launch testing reduces risk, but it does not end the QA loop. Policies change, vendors update models, help-center articles drift, and customers ask new combinations of old questions.

After launch, track wrong-answer rate, human override rate, unresolved escalation, re-contact within 48 or 72 hours, AI-only CSAT, and newly blocked intents. These signals reveal whether the AI resolved the case or merely moved the cost somewhere else.

The weekly operating question should be simple: what did the AI answer this week that we would not approve again? That answer becomes the source-fix, prompt-fix, vendor-config, or human-only backlog.

Sample AI-only conversations against the same rubric.
Review policy changes before they reach customers.
Retest intents after source, vendor, or model updates.
Keep a human owner for every approved answer set.

Checklist

Use this as the working review before launch.

Source proof

Every policy answer points to a current article, macro, SOP, or approved answer.
The source includes the exact condition the AI mentions.
Conflicting sources are marked blocked until a canonical answer is approved.
Recently changed policies are retested before broad automation.

Behavior control

The AI refuses role-break, joke, profanity, and policy-override prompts.
The AI offers a real handoff when it cannot complete the support task.
Escalation is treated as a passing behavior for unsafe intents.
Brand voice rules are tested with frustrated and adversarial customers.

Launch governance

Each approved intent has an owner, source, and retest date.
Wrong-answer reviews separate retrieval, policy, and handoff failures.
Support leaders can export approved, restricted, blocked, and source-fix lists.
Post-launch QA tracks re-contact and override, not deflection alone.

How Meihaku helps

Turn the checklist into a launch map.

Meihaku reads your sources, maps them to customer intents, drafts cited answers, and shows which topics are ready, stale, conflicting, or blocked.

Check readiness score See your readiness map

Related guides

Keep building the launch boundary.

These pages connect testing, knowledge-base cleanup, and readiness scoring into one pre-launch workflow.

Intercom Fin readiness

Meihaku for Intercom Fin

Use Meihaku before and alongside Intercom Fin to decide which customer intents are safe to automate, which need source cleanup, and which should stay human-only.

Vendor page

Zendesk AI readiness

Meihaku for Zendesk AI

Use Meihaku to audit whether Zendesk Guide, macros, ticket history, and policy documents are ready for Zendesk AI to answer customers.

Vendor page

Gorgias AI readiness

Meihaku for Gorgias AI

Use Meihaku to check whether ecommerce support knowledge is ready for Gorgias AI before it handles refund, order, shipping, and product questions.

Vendor page

Google Docs readiness

Meihaku for Google Docs

Use Meihaku to audit support policies, SOPs, macros, and FAQ documents stored in Google Drive before an AI support agent relies on them.

Vendor page

Intercom Fin testing template

Fin batch test CSV

A launch-ready question set for Intercom Fin Batch Test. Upload the question column, then grade each response against source fit, missing policy detail, and safe escalation.

Template

Zendesk AI checklist

Zendesk macro audit

A checklist for auditing Zendesk Guide, shared macros, ticket patterns, and internal policies before using AI suggestions or customer-facing automation.

Template

Zendesk AI testing

How to Test Zendesk AI

A Zendesk AI pre-launch testing workflow for support teams that need to prove Guide coverage, macro alignment, escalation behavior, and post-launch QA before customer exposure.

Read

Customer service QA

Customer Service QA for AI Support

A practical guide for turning customer service QA into an AI support quality program that reviews source evidence, policy safety, escalation, and re-contact risk.

Read

AI support compliance

AI Support Compliance Checklist

A practical compliance-readiness checklist for support, legal, security, and risk teams reviewing customer-facing AI support before launch.

Read

Gorgias AI accuracy

Gorgias AI Accuracy Checklist

An ecommerce support checklist for testing Gorgias AI accuracy across product answers, refund rules, shipping exceptions, Shopify actions, handoffs, and rule conflicts.

Read

AI agent testing

AI Agent Testing for Customer Support

A support-specific AI agent testing checklist for policy coverage, source citations, stale answers, escalation rules, and launch go/no-go decisions.

Read

AI chatbot testing

AI Chatbot Testing Checklist

A practical chatbot testing checklist for support teams checking accuracy, policy safety, escalation, tone, and re-contact risk before launch.

Read

Knowledge-base audit

Knowledge Base AI Readiness Audit

A step-by-step AI knowledge base audit for finding stale articles, policy conflicts, missing intents, weak citations, and unsafe automation scope.

Read

FAQ

Common questions

What is an AI support hallucination?

An AI support hallucination is a customer-facing answer that sounds confident but is not supported by the company's current sources, policies, account context, or approved escalation rules.

Why are hallucinations worse in customer support?

Support answers can create customer reliance, refunds, cancellations, security risk, regulatory exposure, and public brand damage. The issue is not just factual accuracy; it is whether the answer is allowed.

How do we stop an AI chatbot from inventing policies?

Require source evidence for every policy claim, block conflicted sources, test real customer intents, and make the AI hand off when the source is missing, stale, or account-specific.

Should a support AI answer when it is not sure?

No. In customer support, uncertainty should route to clarification or human handoff. A safe escalation is better than a polished answer that invents a rule.

Can Meihaku detect hallucinations at runtime?

Meihaku is the pre-launch readiness layer. It audits the support knowledge and approved intent boundary before runtime so fewer hallucination-prone questions reach the customer-facing agent.

Sources

Vendor documentation and public references that ground the claims in this guide.