Speaking with Machines: The Art of Prompting Voice AI

Why Prompting Feels Different in Voice

When you read a poorly written chatbot response, you skim past the awkward phrasing. Your eyes jump to the relevant information. You mentally edit as you go.

When you hear a poorly prompted voice agent, there’s no skimming. Every stumble lands. Every pause feels like an eternity. Every mismatched tone feels wrong.

That’s the thing about voice AI: prompting mistakes that seem minor in text become glaring in speech.

Consider two voice assistants built on the same underlying model, handling the same customer support scenario. One responds in three seconds with a calm, direct answer. The other takes six seconds, starts with “Absolutely! I’d be more than happy to help you with that!”, and then rambles through context the user already knows.

Same model. Same infrastructure. Completely different experiences.

The difference is the prompt.

In text interfaces, users are forgiving. They scroll, they skim, they move on. In voice, they can’t skip ahead. They have to wait. They have to listen. And if what they hear doesn’t match their expectations (if the tone is wrong, if the pacing is off, if the response takes too long) the illusion of a helpful assistant shatters instantly.

Latency amplifies everything. A 500-millisecond delay in a chat window is imperceptible. In spoken conversation, 500ms is the upper edge of normal. Push past 800–1000ms, and the pause starts to feel unnatural. Past a second or two, the interaction starts to feel unnatural. Add processing overhead from an overly complex prompt, and you’ve crossed into uncomfortable territory.

Prompting for voice AI requires a different mindset. You’re not writing instructions for a text generator. You’re directing a real-time interaction.

How Voice AI Works

Before diving into prompting strategies, it helps to understand what happens between a user speaking and an AI responding.

The real-time loop looks like this:

Audio in → The user speaks into a microphone. Raw audio streams to the system.

Voice activity detection → The system detects when the user starts and stops speaking.

Transcription → Speech-to-text converts audio into text, typically in chunks as the user speaks.

LLM reasoning → The transcribed text hits your prompt and the model generates a response.

Speech synthesis → Text-to-speech converts the model’s output back into audio.

Audio out → The synthesized speech streams back to the user.

This is the standard cascading pipeline architecture. It’s still the most common approach and the easiest to debug, since you can inspect text at each stage. It’s also more flexible: you can insert RAG lookups, PII redaction, or other processing between transcription and the LLM. Native audio models that skip text entirely do exist, but they can be more expensive and may offer less flexibility depending on the use case. This article focuses on the orchestrated pipeline.

Where Knowledge Comes From

The prompt tells the agent how to behave. It doesn’t give the agent domain knowledge. That usually comes from a retrieval system that pulls relevant documentation, knowledge base articles, or customer data at runtime and injects it into the context.

Your prompt should be written to work with retrieved context, not to contain all the answers. Instructions like “Answer based on the provided documentation” or “If the answer isn’t in the context, say you don’t know” help the model use retrieved knowledge appropriately.

Each step in this pipeline adds latency. Transcription takes time. Model inference takes time. Synthesis takes time. Network round trips take time. The total end-to-end latency, from when the user stops speaking to when they hear the first word of the response, determines whether the conversation feels natural or robotic. Human conversation typically has response gaps of 200–600 milliseconds. Voice AI systems that exceed about one second can start to feel unresponsive. Beyond two seconds, users start wondering if something broke.

Orchestration matters as much as model choice. You can use the most capable model available, but if your pipeline adds unnecessary latency, the experience suffers. Low-latency real-time platforms handle the networking and streaming complexity, but the prompt itself is still your responsibility. Verbose prompts often add unnecessary constraints, which often leads the model to generate longer responses. In a streaming pipeline, every filler word at the start of a response delays the delivery of the actual answer. Complex reasoning instructions (like “think step-by-step”) create increased latency, since the model may generate additional internal tokens before producing the first streamable output.

Every unnecessary token adds latency, but don’t sacrifice clarity for shorter prompts. An ambiguous prompt can cost more time than the tokens you saved.

Core Prompting Principles for Voice

Be Explicit About Role, Tone, and Pacing

Text prompts often leave tone implicit. You write “You are a helpful assistant” and let the model figure it out.

Voice prompts cannot afford ambiguity. You need to specify:

Who the assistant is (role, expertise level, personality)
How they should sound (calm, energetic, formal, casual)
What pace they should maintain (brief responses, detailed explanations, or adaptive)

A prompt like “You are a helpful customer service agent” produces generic results. A prompt like “You are a calm, patient support specialist. Speak in short, clear sentences. Avoid filler phrases. Get to the point quickly” produces voice-ready responses.

Design for Speech, Not Text

Written language and spoken language follow different rules. Sentences that read well often sound awkward when spoken aloud.

Guide your model toward speech-friendly output:

Short sentences over compound structures
Direct phrasing over hedged language
Concrete words over abstract ones
Active voice over passive constructions

“The information you requested regarding your account status can be found by navigating to the settings panel” reads fine. Spoken aloud, it’s exhausting.

“Your account status is in Settings. Want me to walk you through it?” This works better.

Format for TTS (and What to Avoid)

Text-to-speech systems are sensitive to the visual structure of text. Characters that look fine in text can cause a voice agent to stutter or read literal code.

Test how your TTS handles numbers and symbols: Don’t assume “spelled out” is always better. While older engines required “twenty-three,” many modern neural engines (like ElevenLabs or Cartesia) have sophisticated normalization layers. They may prefer formats like “23” or “$5” to correctly trigger normalization for dates or currency, depending on the TTS engine. Run a quick A/B test with your specific provider to see which produces the most natural rhythm.

Avoid markdown formatting: Explicitly instruct your LLM to strip out asterisks, hashes, and bullets (**bold**, ### Header, * list). Some TTS engines will ignore them, but others will literally say “Hash hash hash” or “Asterisk,” completely shattering the user’s immersion.

Punctuate for breath: TTS engines use periods, commas, and dashes as cues for natural pauses and pitch shifts. If a response sounds like a wall of sound, use your prompt to enforce more frequent punctuation to give the digital voice room to breathe.

Use phonetic spellings for industry slang: If your brand or product has a unique name (e.g., “Agora” vs “Ag-ora”), you may need a conditional rule: “When mentioning our brand, spell it phonetically as [your phonetic spelling] to ensure correct pronunciation.”

Use Conditional Rules

Voice interactions are unpredictable. Users interrupt, change topics, ask unexpected questions, or go silent. Your prompt needs to handle these gracefully.

Structure behavioral guidance as clear conditionals:

“If the user interrupted, don’t try to finish your previous thought. Address their new input directly.”
“If the user gives a partial answer, ask for the missing information directly.”
“If you don’t know the answer, say so directly. Don’t speculate.”
“If the user seems frustrated, acknowledge it before continuing.”

These rules give the model a decision logic without requiring you to anticipate every scenario.

Keep Prompts Lightweight

Longer prompts can lead to longer outputs, depending on how the instructions are written.

That doesn’t mean stripping your prompt to nothing. But you should be strict about what’s actually necessary.

Ask yourself:

Does this instruction change behavior in a meaningful way?
Could this be shorter without losing clarity?
Is this rule actually triggered in real conversations?

A 2,000-token system prompt that covers every edge case might produce slightly better responses in rare scenarios. A 400-token prompt that covers the common cases well will usually be faster, but only if it’s clear. A short, ambiguous prompt can cost more than a longer, precise one.

For voice interactions, lower latency directly improves the experience.

Example Transcripts

Theory only takes you so far. Here’s what different prompting choices sound like in practice.

Concise vs. Rambling

Prompt A (minimal guidance): “You are a helpful voice assistant for a hotel.”

User: What time is checkout?

Assistant: Great question! So, checkout time at our hotel is typically 11 AM, although I should mention that this can sometimes vary depending on your specific room type and any arrangements you may have made at the time of booking. If you need a late checkout, that’s often something that can be arranged depending on availability, so you might want to check with the front desk about that option if you’re interested.

Prompt B (voice-optimized): “You are a concise voice assistant for a hotel. Give brief, direct answers. One to two sentences maximum unless the user asks for details.”

User: What time is checkout?

Assistant: Checkout is at 11 AM. Need a late checkout? The front desk can help with that.

The first response is 78 words. The second is 18. Both answer the question. Only one respects the user’s time.

Calm vs. Overly Energetic

Prompt A: “You are an enthusiastic and friendly assistant helping users with their fitness tracking app. Be upbeat and encouraging!”

User: I missed my step goal yesterday.

Assistant: Oh no, but that’s totally okay! Every day is a brand new opportunity to crush those goals! Missing one day doesn’t define your journey, what matters is that you’re here now, ready to get back on track! You’ve got this! Want me to help you set up a reminder so you can smash today’s goal?

Prompt B: “You are a supportive assistant for a fitness tracking app. Be warm but calm. Avoid excessive enthusiasm. Match the user's energy level.”

User: I missed my step goal yesterday.

Assistant: That happens. One day doesn’t change much in the long run. Want to check what today’s goal is?

The first response might work for some users. For many, especially those already feeling discouraged, it reads as tone-deaf. The second meets the user where they are.

Latency-Friendly vs. Bloated

Prompt A (exhaustive): “You are an AI assistant for TechCorp's customer support line. Your role is to help customers with questions about their accounts, billing, technical issues, product information, shipping, returns, warranty claims, and general inquiries.

When responding:
- Always greet the customer warmly
- Acknowledge their question before answering
- Provide complete information
- Offer additional help
- Use the customer's name if known
- Be empathetic and understanding
- Never say "I don't know" without offering an alternative
- Always end with asking if there's anything else
- Reference relevant policies when applicable
- Escalate to human support if the issue is complex
- Document the interaction type for analytics
- Maintain a professional yet friendly tone
- Use active listening techniques
- Validate customer feelings
- Provide step-by-step instructions when needed

You have access to the knowledge base for: account management, billing systems, product catalog, shipping policies, return procedures, warranty information, and troubleshooting guides.”

Prompt B (focused): “You're a support agent for TechCorp. Be helpful, brief, and direct. If you can't solve something, say so and offer to connect them with a specialist.”

Prompt A is 197 words. Prompt B is 29 words.

In testing, Prompt B produced faster responses for the vast majority of queries. A shorter prompt gives the model less to process and fewer competing instructions to balance.

Building Natural Voice Experiences

Good prompts matter, but they’re not the whole picture. Natural voice experiences require thinking beyond individual responses to the full conversational flow.

Turn-Taking

Human conversations have rhythm. People signal when they’re done speaking. They pause for responses. They know when it’s their turn.

Voice AI needs the same awareness. Your prompt should guide the model on when to speak and when to listen.

“Keep responses brief enough that users can easily interject. For complex explanations, deliver one part at a time and end with a check-in question. Let the user respond before continuing.”

Without this guidance, models tend toward monologue. They’ll deliver comprehensive answers that technically address the question but don’t leave room for the natural back-and-forth of conversation.

Interruptibility

Users will interrupt. They’ll realize mid-sentence that they asked the wrong question. They’ll want to redirect. They’ll get impatient.

Build interruption handling into your prompt:

“If the user interrupted, acknowledge it briefly if needed, then address their new input. Don’t try to finish your previous thought.”

Sounds obvious, right? But without explicit instruction, many models will try to complete their thought before addressing the interruption. That’s exactly the opposite of natural conversation.

Context Management

Voice conversations accumulate context quickly. Ten exchanges in, there’s a lot of history. But loading all of that into every prompt increases latency and can confuse the model.

Think about:

What context is essential for the current turn?
What can be summarized?
What can be dropped entirely?

“Focus on the user’s most recent question. Reference earlier conversation only if directly relevant.”

Safety Boundaries

Voice interactions feel more personal than text. Users might share sensitive information, make inappropriate requests, or become abusive. Your prompt needs clear boundaries.

“If users share personal information unprompted, don’t repeat it back. If asked to do something outside your role, decline politely and redirect. If a user becomes abusive, stay calm and offer to end the conversation.”

These aren’t just ethical requirements. They’re user experience requirements. An assistant that handles edge cases gracefully maintains trust.

The Limits of Prompting

Prompts shape what the LLM says, but much of what makes voice conversations feel natural happens at the orchestration layer, before the LLM is ever invoked.

Consider what’s actually required for smooth turn-taking: the system needs to detect when the user has stopped speaking (endpointing), decide whether a pause means the user finished speaking (voice activity detection), and determine if the user is trying to interrupt (barge-in detection). These are real-time audio and orchestration problems in the pipeline, not language model problems. No amount of prompt engineering can make an LLM listen to audio or decide when to stop talking mid-sentence.

Orchestration platforms like Agora’s Conversational AI Engine handle these concerns at the agent level. They provide built-in algorithms for VAD, endpointing, and interruption handling that run independently of your LLM. The orchestration layer decides when to send audio to transcription, when to invoke the LLM, and when to cut off TTS playback. Your prompt controls what the model says once it’s called; the orchestration layer controls everything else.

This is why choosing the right infrastructure matters as much as writing the right prompt. You need both working together. Agora provides tools such as the Conversational AI Performance Lab to help teams evaluate low-latency combinations across AI models — https://www.agora.io/en/conversational-ai-performance-lab/.

Testing and Iteration

Prompting voice AI isn’t a one-time task. You’ll be testing, measuring, and refining throughout the life of your product.

Golden Test Scripts

Create a set of standard interactions that cover your core use cases. Run them regularly against prompt changes.

A good test set includes:

Happy path interactions (common questions, expected flows)
Edge cases (unusual requests, ambiguous input)
Stress tests (long conversations, topic switches, interruptions)
Adversarial inputs (attempts to break character, inappropriate requests)

Record both the transcripts and the audio. Reading responses and hearing them are different experiences.

A/B Testing Prompts

When possible, test prompt variations with real users. Small wording changes can produce measurably different outcomes.

Track:

Task completion rates
Conversation length (shorter is usually better for transactional tasks)
User-initiated interruptions (high rates may indicate pacing problems)
Explicit feedback (if you ask for it)

Don’t just test for correctness. Test for feel.

Metrics That Matter

For voice specifically, pay attention to:

Time-to-first-token: How long from user input to the first word of response? This is your latency floor. Both prompt length and clarity affect this. Longer prompts take more time to process, and ambiguous prompts can increase the model’s “thinking” time before it starts generating.

Response length consistency: Are responses predictably sized, or do they vary wildly? Inconsistency makes conversations hard to follow.

Interruption rate: How often do users cut off the assistant? Some interruption is natural. Excessive interruption suggests responses are too long or off-target.

Conversation completion: Do users accomplish their goals? Do they give up partway through?

Treat prompting as product iteration. Every change is a hypothesis. Test it, measure it, learn from it.

Conclusion

Prompting voice AI is closer to directing a real-time interaction than writing documentation.

You’re shaping how an agent sounds, when it speaks, how long it talks, and how it handles the unexpected. Every word in your prompt influences the user’s experience. Not just the content of responses, but their rhythm, their tone, their timing.

The principles aren’t complicated:

Be explicit about how you want the agent to behave
Design for speech, not text
Keep prompts lean to minimize latency
Test with your ears, not just your eyes
Iterate based on real conversations

Infrastructure matters too. The best prompt in the world can’t overcome a pipeline that adds seconds of latency. Low-latency real-time platforms handle the hard networking problems, but they can’t fix a prompt that produces rambling, tone-deaf responses.

Good prompts and solid infrastructure work together. You can’t skip either one.

If you’re building voice AI, start by listening. Record your system handling real interactions. Play them back. Note where conversations feel natural and where they feel forced. Let those observations guide your prompt iterations.

The goal isn’t a prompt that looks good in text. It’s a conversation that feels good in the ear.

‍

Learn more about Agora's video and voice solutions

Ready to chat through your real-time video and voice needs? We're here to help! Current Twilio customers get up to 2 months FREE.

Complete the form, and one of our experts will be in touch.

Try Agora for Free

Try for Free

TEN

App Builder

Flexible Classroom

Download SDKs

Support Plans and Pricing