Real-Time Speech to Text API for Live Transcription

HOW TO INTEGRATE?

Streamlined 3-step integration process:

01

Activate Agora Conversational AI Engine

Unlock real-time Speech-to-Text (STT) and Text-to-Speech (TTS) capabilities, enabling seamless conversational interactions.

02

Integrate Agora Edge Chip on Hardware

Optimize microphone, speaker, and system efficiency to ensure ultra-low-latency and high-fidelity conversations.

03

Deploy AI Voice Agents

Enable interactive, multilingual, and user-customized conversations for a wide range of IoT applications.

Frequently asked questions

How does Agora improve the experience in comparison with other solutions for voice interaction with AI?

Agora enables more natural voice conversations with AI, thanks to low-latency responses and real-time interruption handling. Agora’s built-in background noise suppression, echo cancelation, and selective attention locking allow AI to hear the user clearly in any environment. Agora’s global real-time network ensures connectivity and performance in any location.

What LLMs can be connected to Agora’s conversational AI platform?

Agora's Conversational AI Engine offers support for a wide range of large language models (LLMs), including:

OpenAI
OpenAI Realtime API
Azure OpenAI
Google Gemini
Google Vertex AI
Anthropic Claude
Dify
Custom LLM

Review our documentation on connecting LLMs here: https://docs.agora.io/en/conversational-ai/models/llm/overview

What automatic-speech-recognition (ASR) / speech-to text (STT) models are supported?

Agora’s Conversational AI Engine currently supports the following ASR providers:

ARES (default)
Microsoft Azure
Deepgram

Review our documentation on connecting ASR models here: https://docs.agora.io/en/conversational-ai/models/asr/overview

What text-to-speech (TTS) models are supported?

Agora’s Conversational AI Engine currently supports the following TTS providers:

Microsoft Azure
ElevenLabs
Cartesia (Beta)
OpenAI (Beta)
Hume AI (Beta)

Review our documentation on connecting TTS models here: https://docs.agora.io/en/conversational-ai/models/tts/overview

What avatar providers are supported?

Agora’s Conversational AI Engine currently supports the following AI avatar providers:

Akool (Beta)
HeyGen (Alpha)

Review our documentation on connecting avatar providers here: https://docs.agora.io/en/conversational-ai/models/avatar/overview

What additional technology is required to implement a voice AI agent?

To implement a voice AI agent, you need to connect an LLM and a text-to-speech service to Agora’s Conversational AI Engine. This enables full customization of the experience, with the LLM and voice of your choice.

What is a “chained” or “cascade” model” in relation to conversational voice AI?

The chained or cascade model refers to the processing flow of the user’s voice being processed by automatic speech recognition (ASR) technology that converts speech to text, then that text being processed by the LLM, then the LLM’s response being processed by text-to-speech technology and ultimately outputting the AI agent’s voice response.

Does Agora’s Conversational AI Engine enable the creation of an AI model or LLM?

No, Agora’s Conversational AI Engine requires an existing AI model or LLM. The Engine enables customized voice interaction with the LLM but is not capable of creating or training an LLM.

FAQs

What is Agora Voice Calling?

Agora Voice Calling is a real-time voice API that lets developers embed high-quality, ultra-low latency voice chat into any application. It supports one-to-one calls, group voice chat, and large-scale audio rooms across devices and platforms.

Which platforms does Agora Voice Calling support?

Agora Voice Calling supports Android, iOS, Web, Windows, Electron, Flutter, React Native, Unity, and Unreal Engine. This allows teams to build consistent voice experiences across mobile, web, desktop, and immersive environments.

How does Agora deliver HD audio quality with low latency?

Agora uses a 48 kHz sampling rate with full-bandwidth audio capture and intelligent routing over its global real-time network. This minimizes latency, jitter, and packet loss to deliver clear, stable voice calls—even on unstable networks.

Does Agora support AI-powered voice features?

Yes. Agora Voice Calling includes AI-powered features such as Noise Suppression, Real-Time Speech to Text, and seamless integration with large language models and text-to-speech engines to enable intelligent, voice-driven experiences.

Can I record voice calls and audio sessions?

Yes. Agora supports flexible voice recording in the cloud or on premises. Developers control audio formats, storage locations, and recording quality to support playback, analytics, moderation, or compliance needs.

What is 3D Spatial Audio and when should I use it?

3D Spatial Audio simulates real-world sound positioning, making conversations feel more immersive and natural. It’s commonly used in gaming, social audio rooms, virtual workspaces, and metaverse-style experiences.

How quickly can I launch a voice calling experience?

You can integrate Agora Voice Calling within hours using SDKs, documentation, and sample apps. For teams that want to move faster, Agora App Builder offers a no-code option to deploy voice chat without custom development.

What applications are best suited for Agora Voice Calling?

Agora Voice Calling is ideal for education platforms, multiplayer games, social apps, collaboration tools, live shopping, customer engagement, and IoT devices—any use case that requires reliable, real-time voice communication at global scale.

FAQs

What is Agora Video Calling?

Agora Video Calling is a real-time video API that lets developers embed high-quality, low-latency video calls into web, mobile, and native applications. It supports everything from 1:1 calls to large-scale video experiences with full customization.

Which platforms are supported by Agora’s Video Calling SDK?

Agora Video Calling supports Android, iOS, Web, Windows, Electron, Flutter, React Native, Unity, and Unreal Engine—making it easy to deliver consistent video experiences across devices and operating systems.

How does Agora ensure reliable video quality in poor network conditions?

Agora uses intelligent routing and adaptive video optimization to reduce jitter, lag, and packet loss. The platform dynamically adjusts video quality in real time to maintain smooth, uninterrupted calls—even on slow or unstable networks.

What collaboration features are available with Agora Video Calling?

Agora supports advanced collaboration features such as screen sharing, interactive whiteboards, multi-user video layouts, and real-time messaging. These features make it well suited for meetings, education, telehealth, and collaborative work apps.

Can I record video calls and meetings?

Yes. Agora provides flexible video call recording options, allowing you to record securely to the cloud or on local servers. Developers control video format, resolution, storage location, and access permissions to meet compliance and operational needs.

Does Agora support multi-camera or multi-audio setups?

Yes. Agora supports multi-track audio and video, making it possible to publish multiple camera feeds or microphone streams within a single session. This is ideal for live production workflows, virtual events, and advanced conferencing scenarios.

How fast can I launch a video calling experience?

You can ship a video calling app within hours using Agora SDKs, documentation, and sample apps. For even faster deployment, Agora App Builder provides a no-code option to launch video, voice, and live streaming features without custom development.

What use cases are best suited for Agora Video Calling?

Agora Video Calling is ideal for education, remote work, gaming, social apps, live shopping, and telehealth. Any application that requires scalable, real-time video communication with global reach and low latency can benefit from Agora’s platform.

FAQs

What is Agora Real-Time Chat?

Agora Real-Time Chat is a customizable chat SDK that lets developers add secure, scalable messaging to real-time video, voice, and live streaming applications. It supports one-to-one messaging, group chat, and large community channels.

Which platforms are supported by Agora’s Chat SDK?

Agora’s Chat SDK supports Android, iOS, Web, Windows, Flutter, React Native, and Unity, making it easy to deliver consistent messaging experiences across mobile, desktop, and cross-platform apps.

What messaging features does Agora Chat support?

Agora Chat supports rich media messaging including emojis, images, files, GPS locations, structured messages, and voice notes. Core messaging features also include offline messaging, message recall and deletion, read receipts, typing indicators, presence, and push notifications.

How does Agora ensure chat security and compliance?

Agora Chat uses TLS/SSL encryption for data in transit and encrypted file storage to protect user data. The platform also supports privacy compliance features such as user data deletion and secure message handling.

Does Agora Chat include moderation and community safety tools?

Yes. Agora Chat includes built-in content moderation to help filter profanity, offensive language, and inappropriate images or text. Developers can also integrate third-party moderation tools for additional control.

Can Agora Chat support multilingual users?

Yes. Agora Chat supports multilingual message translation with automatic, on-demand, or push-based translation options, enabling users to communicate in their preferred language.

How quickly can I launch a chat experience with Agora?

Developers can launch a chat experience within hours using Agora SDKs, documentation, and sample apps. For faster implementation, Agora UI Kit provides a low-code option to add messaging with minimal development effort.

What use cases are best suited for Agora Real-Time Chat?

Agora Real-Time Chat is ideal for education platforms, gaming communities, social apps, collaboration tools, live commerce, and telehealth—any application that requires reliable, secure, and engaging real-time messaging.

FAQs

What is Agora Real-Time Speech to Text?

Agora Real-Time Speech to Text is a cloud-based live transcription and subtitling service that converts real-time audio into accurate text for live audio and video applications. It enables captions, transcripts, and AI-powered workflows without impacting real-time performance.

How does Real-Time Speech to Text work in live audio and video sessions?

Agora’s cloud-based transcription processes audio streams in real time and converts speech into text with low latency. Transcripts can be delivered as live captions to participants, stored for later review, or exported for downstream processing.

Can I integrate Real-Time Speech to Text with large language models (LLMs)?

Yes. Real-time transcripts can be integrated with large language models to generate summaries, meeting notes, action items, feedback, or translations. Transcripts can also be exported as .vtt files for seamless LLM processing without affecting RTC performance.

Does Agora support multiple speakers and overlapping speech?

Yes. Agora supports real-time speaker recognition and labeling for up to three simultaneous speakers. Each speaker can be transcribed separately, improving accuracy in conversations with interruptions or overlapping dialogue.

What languages are supported by Agora’s Real-Time Speech to Text?

Agora supports all major languages and regional dialects. Each channel can transcribe up to two languages simultaneously, making it ideal for multilingual meetings, events, and global applications.

Can I generate captions for recorded audio or video?

Yes. Agora supports transcription for cloud-recorded audio and video, enabling closed captions (CC) during playback and searchable transcripts for reviewing important discussion points.

How does Agora ensure transcription accuracy at scale?

Agora uses advanced AI techniques to reduce silence, lower Word Error Rate (WER), and maintain accuracy even with accents, overlapping speech, poor audio quality, or unstable networks. The solution scales from one-to-one sessions to millions of participants with consistent accuracy.

Is Real-Time Speech to Text secure and compliant?

Yes. Agora is ISO and SOC 2 certified and supports compliance with GDPR, CCPA, and HIPAA. Live captions and transcripts can be encrypted using the same security mechanisms as Agora’s real-time audio and video streams.

FAQs

What is Agora Recording?

Agora Recording is an extension that allows developers to record audio streams, video streams, interactive content, and web pages for archive, review, compliance, or redistribution. It supports cloud, on-premises, and webpage recording options.

What types of content can I record with Agora?

Agora Recording can capture audio, video, screen content, whiteboards, chat messages, and live streaming elements. You can record single streams or multiple streams separately, making it easy to edit, combine, or repurpose content later.

What’s the difference between single-stream and multi-stream recording?

Single-stream recording combines audio, video, and content into one synchronized file. Multi-stream recording captures each audio, video, or content stream separately, giving you greater flexibility for post-production, analysis, or moderation workflows.

Where are recordings stored?

Recordings can be stored in the cloud or on-premises, depending on your deployment needs. Agora supports third-party cloud storage providers such as Amazon S3, Microsoft Azure, Google Cloud, Alibaba Cloud, Tencent Cloud, and others.

Can Agora Recording support moderation and compliance requirements?

Yes. Agora Recording supports screenshots for moderation, customizable capture intervals, digital watermarks, and content moderation tools. These features help enforce community guidelines, protect intellectual property, and meet regulatory or organizational requirements.

How secure is Agora Recording?

Agora Recording is built with enterprise-grade security, including end-to-end encryption for calls, transmission, and storage. It supports globally distributed clusters, automatic backups, proxy services, and LAN deployment to meet strict data security and privacy needs.

How quickly can I integrate recording into my application?

Developers can integrate Agora Recording in as little as 30 minutes using RESTful APIs. The service is designed to be easy to embed, test, and deploy, with automatic uploading and backup to ensure recordings are not lost.

What use cases are best suited for Agora Recording?

Agora Recording is ideal for virtual events and webinars, large-scale live streaming, customer service quality assurance, education and online classes, and telehealth consultations—any scenario where capturing, reviewing, or distributing real-time interactions is essential.

TEN

App Builder

Flexible Classroom

Download SDKs

Support Plans and Pricing

Real-Time Speech to Text

Real-Time Speech to Text

Features

Cloud-based live transcription

LLM integration

Transcribing and labeling simultaneous speakers

Captioning for cloud recordings

Multi-language support

Enterprise-grade security and compliance

Talk to a voice agent powered by the Conversational AI Engine

Your vision, unrestricted.

See OpenAI's Realtime API in action

Instantly transcribe speech to text for live audio and video

Reduce cost and increase efficiency

Reduce cost and increase efficiency

Get the most accurate results at scale

Get the most accurate results at scale

Integrate with ease

Integrate with ease

Recording options for:

Agora Media Services

Made for developers

Quickstart guide

How the Conversational AI Engine works

Made for developers

Agora SDK

Agora SDK

App Builder

Agora UI Kit

Agora SDK

Agora UI Kit

Documentation

Agora SDK

App Builder

Agora SDK

Fastboard

Integrated chipset and module

Transcribe speech to text for any real-time application

Education

Telehealth

Events

Live shopping

Virtual meetings

Social & metaverse

Fastboard

Frequently asked questions

FAQs

FAQs

FAQs

FAQs

FAQs

Get started with 300 free minutes

Talk to Us

Developer Resources