EXTENSION

Real-Time Speech to Text

Create a better user experience and integrate with large language models (LLMs) using the most accurate cloud-based live transcription and subtitling.
A live video conference interface showing a woman presenting with real-time transcription and meeting notes displayed, including participant thumbnails and a transcription panel with highlighted key points.
Supported Platforms
RESTful API
EXTENSION

Real-Time Speech to Text

Create a better user experience and integrate with large language models (LLMs) using the most accurate cloud-based live transcription and subtitling.
Stylized glowing AI orb and a microphone icon labeled 'Your AI Agent'
Supported Platforms
RESTful API
Customers building with
Agora and OpenAI
grepp logoWYZE logokileon logokumu logoScaler logoParallel logoJorJin logoAnotherBall logoEllie logozigbang logo
grepp logoWYZE logokileon logokumu logoScaler logoParallel logoJorJin logoAnotherBall logoEllie logozigbang logo

Features

Cloud-based live transcription icon

Cloud-based live transcription

Cloud-based transcription converts audio to text for active or selected hosts in real time. Text can be distributed as live captions to all participants in the channel.
LLM integration icon

LLM integration

Integrate real-time transcription with large language models (LLMs) like GPT to generate summaries, notes, or feedback—without impacting RTC performance. Export transcripts as .vtt files for seamless processing.
Simultaneous speakers icon

Transcribing and labeling simultaneous speakers

Accurately identify and label multiple speakers - up to 3 at once - with real-time speaker recognition. Separate transcription for each host ensures accuracy and allows you to choose which specific host(s) to transcribe for.
Captioning for cloud recordings icon

Captioning for cloud recordings

Transcribe audio to text on video or audio recordings to enable closed captions (CC) on playback or review important discussion items in the transcript.
Multi-language support icon

Multi-language support

Real-time transcription supports all major languages and dialects, and each channel can support audio-to-text transcription for up to two languages simultaneously. 
Enterprise-grade security and compliance icon

Enterprise-grade security and compliance

Agora is ISO and SOC 2 certified and meets compliance standards for regional privacy laws and industry regulations, including GDPR, CCPA, and HIPAA. Live captions and transcription can be encrypted in the same way as encrypted RTC audio or video.

Talk to a voice agent powered by the Conversational AI Engine

Try it now
One real-time view for the metrics that matter the most
Use a single dashboard to monitor every active session around the world. Track the metrics that are most important to you, from concurrent users and channels to network latency and so much more.

Your vision, unrestricted.

With Interactive Whiteboard, you can build a collaborative app fast—with custom branding and full of features. Our platform makes it easy to create a customized and engaging learning environment.
  • Flexible APIs support custom branding and extensive digital whiteboard features.
  • Easily integrate real-time voice and video calling, interactive streaming and signaling.
  • Save users’ bandwidth by preloading, sharing, and annotating files, and retain all the dynamic content.
And have peace of mind with HIPAA, GDPR, and CCPA compliance.

See OpenAI's Realtime API in action

Instantly transcribe speech to text for live audio and video

Agora’s Real-Time Speech to Text provides accurate live transcription and subtitling services at a low cost.
Reduce cost and increase efficiency icon

Reduce cost and increase efficiency

More efficient and cost-effective than traditional client-side live transcription, Agora’s solution by uses advanced technology to remove silence, reduce Word Error Rate (WER), and distribute live captions to all participants in a channel.
Reduce cost and increase efficiency icon

Reduce cost and increase efficiency

Get the most accurate results at scale icon

Get the most accurate results at scale

Cutting-edge AI ensures the highest accuracy even with overlapping speech, regional accents, and poor network conditions. Scale from one-to-one meetings to up to millions of participants with the same accuracy.
Get the most accurate results at scale icon

Get the most accurate results at scale

Integrate with ease icon

Integrate with ease

Agora’s Real-Time Speech to Text is highly integrated with Agora’s network (SD-RTN™), providing global user transcription and real-time text distribution even in poor network environments.
Integrate with ease icon

Integrate with ease

Recording options for:

Cloud recording
Store, retrieve and share recordings in the cloud.
Go to Docs
On-premise recording
Store on a local server for security and confidentiality.
Go to Docs
Webpage recording
Record the entire web browser screen experience.
Go to Docs

Agora Media Services

Recording icon
Recording
Record audio streams, video streams and web pages for archive, review, or distribution.
Live icon
Media Gateway
Directly push media streams into Agora voice and video channels using the RTMP/SRT protocol and enable advanced transcoding processing on media streams to facilitate distribution.
Cloud Transcoding
Beta
Obtain audio and video source streams from hosts in RTC channels and perform transcoding, audio mixing, and video compositing.
Download icon
Media Pull
Add additional engagement to your Agora sessions by  pulling live or recorded video and audio content and ingesting directly into your Agora channel.
Media Push
Expand your audience with hybrid engagement experiences by pushing audio and video streams from Agora channels to Content Delivery Networks (CDN).

Made for developers

Quickstart guide

View the quickstart guide to get up and running with Agora and Open AI.

How the Conversational AI Engine works

Made for developers

Your Code

Agora SDK

Customize your experience from the start with our flexible SDK.
Your Code

Agora SDK

Build and integrate real-time video into your app with the most flexibility and  customization using Agora's Video SDK.
NO CODE

App Builder

Agora’s App Builder is the fastest and easiest way to real-time video into your product using our no-code visual designer.
Go to Docs
low code

Agora UI Kit

Add real-time video to your app with only a few lines of code using low-code UI Kit libraries.
Go to Docs
your code

Agora SDK

Customize your experience from the start with our flexible SDK.
RESTful API
Go to Docs
low code

Agora UI Kit

Integrate real-time communication and streaming using only a few lines of code with low-code UIKit libraries.
Go to Docs

Documentation

This project presents you a set of API examples to help you understand how to use Agora APIs.
Platform-agnostic RESTful APIs make it easy to add highly accurate and cost-effective real-time speech-to-text capabilities.
RESTful API
Go to Docs

Activate the AI Noise Suppression extension on the Agora Console.

Activate the Real-Time Speech to Text extension in the Agora Console.

your code

Agora SDK

Build and integrate Live Streaming with the most flexibility and full customization using Agora's Video SDK.
RESTful API
Go to Docs
NO code

App Builder

Agora’s App Builder is the fastest and easiest way to add real-time voice chat, video chat, and live streaming into your product.
Go to Docs
your code

Agora SDK

Build and integrate real-time visual collaboration features into your application with the most flexibility and full customization using Agora's Interactive Whiteboard SDK.
RESTful API
Go to Docs
LOW code

Fastboard

Build real-time visual collaboration faster with a pre-built UI and the ability to include custom plug ins.
Try it Now
Security, privacy and compliance
Agora is certified to the ISO/IEC 27001, 27017, 27018, 27701 and SOC 2 security standards and meets privacy regulations like GDPR, CCAP, COPPA, and HIPAA. Agora doesn’t collect or store any end-user data aside from Internet Protocol (IP) addresses and operational information necessary for providing our services.
ISO 27001:2022
ISO 27017:2015
ISO 27018:2019
ISO 27701:2019
HIPAA
GDPR
SOC2 Type1&2
CCPA
COPPA
HOW TO INTEGRATE?
Streamlined 3-step integration process:
01
Activate Agora Conversational AI Engine
Unlock real-time Speech-to-Text (STT) and Text-to-Speech (TTS) capabilities, enabling seamless conversational interactions. 
02
Integrate Agora Edge Chip on Hardware
Optimize microphone, speaker, and system efficiency to ensure ultra-low-latency and high-fidelity conversations.
03
Deploy AI Voice Agents
Enable interactive, multilingual, and user-customized conversations for a wide range of IoT applications.

Integrated chipset and module

By building our Conversational AI technology into RiseLink's high-performance IoT chip modules, the turnkey solution makes it easy to integrate voice AI into any connected toy.
“With Agora’s conversational AI technology and our optimized AI hardware, we’re enabling the next generation of toys to think, respond, and interact naturally. We are excited to usher in the future of robotics and toys, ones that can react to the environment around them and interact fluently with users.” 
Pengfei Zhang
CEO, Riselink
Use cases

Transcribe speech to text for any real-time application

Securely transcribe and record real-time audio or video and organize recordings and transcripts to speed up workflows.
An online classroom with real-time captioning powered by speech-to-text transcription and subtitling.

Education

Give faculty and students real-time captions and analyze them with an LLM to provide lesson summaries and suggestions for further learning.
A live video call with a doctor and speech-to-text transcription services.

Telehealth

Keep secure records of virtual appointments for Minimum Effective Response (MER) and cross-reference telehealth knowledge bases.
A live basketball game showing player soaring through the air and making a slam dunk in front of a packed arena. Overlay text via speech-to-text reads "Unbelievable move! The score is now 68-65."

Events

Empower your event with real-time, accurate notes, ensuring a more accessible, searchable, and engaging event experience.
A speech-to-text enriched live shopping session with woman detailing a veggie basket product offering.

Live shopping

Use virtual assistants to improve accessibility and reach a wider audience by offering detailed product information, personalized recommendations, and guiding customers through the purchasing process.
A virtual meeting between four people with real-time automated notes and documented outstanding questions and action items via an LLM.

Virtual meetings

Provide real-time automated notes in meetings and document outstanding questions and action items via an LLM.
An influencer on social channel sharing a review of a sandwich with speech-to-text translations into Vietnamese.

Social & metaverse

Eliminate communication barriers for people with different languages or disabilities. Extract conversation for business optimization, advertising, and moderation.
Robopoet's Fuzzoo, an AI companion robot, leverages Agora's ConvoAI Device Kit to deliver real-time emotional support and personalized interaction.
"Agora’s AI technology enables toys and robots to interact in a way that feels natural and engaging. With real-time voice processing, emotional AI, and advanced speech capabilities, Agora makes seamless human-machine interaction possible and ensures exceptional performance and reliability." 
Yuna Pan
Co-Founder and CTO
Mouse cursor illustration

Fastboard

Easily build and integrate Agora’s Interactive Whiteboard with our newest Fastboard SDK that delivers all the same whiteboard features with a pre-built UI and the ability to include custom plug ins.
Try it Now
“Agora’s Real-Time Speech to Text enabled us to integrate with AI to automate translation and feedback, providing substantial improvements in the overall language learning experience.”
Zackery Ngai
Zackery Ngai
CEO, HelloTalk
Request more information
Connect with our experts to answer your questions, discuss requirements, and provide more detail on the ConvoAI Device Kit

Frequently asked questions

How does Agora improve the experience in comparison with other solutions for voice interaction with AI?

Agora enables more natural voice conversations with AI, thanks to low-latency responses and real-time interruption handling. Agora’s built-in background noise suppression, echo cancelation, and selective attention locking allow AI to hear the user clearly in any environment. Agora’s global real-time network ensures connectivity and performance in any location.

What LLMs can be connected to Agora’s conversational AI platform?

Agora's Conversational AI Engine offers support for a wide range of large language models (LLMs), including:

  • OpenAI
  • OpenAI Realtime API
  • Azure OpenAI
  • Google Gemini
  • Google Vertex AI
  • Anthropic Claude
  • Dify
  • Custom LLM

Review our documentation on connecting LLMs here: https://docs.agora.io/en/conversational-ai/models/llm/overview

What automatic-speech-recognition (ASR) / speech-to text (STT) models are supported?

Agora’s Conversational AI Engine currently supports the following ASR providers:

  • ARES (default)  
  • Microsoft Azure
  • Deepgram

Review our documentation on connecting ASR models here: https://docs.agora.io/en/conversational-ai/models/asr/overview

What text-to-speech (TTS) models are supported?

Agora’s Conversational AI Engine currently supports the following TTS providers:

  • Microsoft Azure
  • ElevenLabs
  • Cartesia (Beta)
  • OpenAI (Beta)
  • Hume AI (Beta)

Review our documentation on connecting TTS models here: https://docs.agora.io/en/conversational-ai/models/tts/overview

What avatar providers are supported?

Agora’s Conversational AI Engine currently supports the following AI avatar providers:

  • Akool (Beta)
  • HeyGen (Alpha)

Review our documentation on connecting avatar providers here: https://docs.agora.io/en/conversational-ai/models/avatar/overview

What additional technology is required to implement a voice AI agent?

To implement a voice AI agent, you need to connect an LLM and a text-to-speech service to Agora’s Conversational AI Engine. This enables full customization of the experience, with the LLM and voice of your choice.

What is a “chained” or “cascade” model” in relation to conversational voice AI?

The chained or cascade model refers to the processing flow of the user’s voice being processed by automatic speech recognition (ASR) technology that converts speech to text, then that text being processed by the LLM, then the LLM’s response being processed by text-to-speech technology and ultimately outputting the AI agent’s voice response.

Does Agora’s Conversational AI Engine enable the creation of an AI model or LLM?

No, Agora’s Conversational AI Engine requires an existing AI model or LLM. The Engine enables customized voice interaction with the LLM but is not capable of creating or training an LLM.

FAQs

What is Agora Voice Calling?

Agora Voice Calling is a real-time voice API that lets developers embed high-quality, ultra-low latency voice chat into any application. It supports one-to-one calls, group voice chat, and large-scale audio rooms across devices and platforms.

Which platforms does Agora Voice Calling support?

Agora Voice Calling supports Android, iOS, Web, Windows, Electron, Flutter, React Native, Unity, and Unreal Engine. This allows teams to build consistent voice experiences across mobile, web, desktop, and immersive environments.

How does Agora deliver HD audio quality with low latency?

Agora uses a 48 kHz sampling rate with full-bandwidth audio capture and intelligent routing over its global real-time network. This minimizes latency, jitter, and packet loss to deliver clear, stable voice calls—even on unstable networks.

Does Agora support AI-powered voice features?

Yes. Agora Voice Calling includes AI-powered features such as Noise Suppression, Real-Time Speech to Text, and seamless integration with large language models and text-to-speech engines to enable intelligent, voice-driven experiences.

Can I record voice calls and audio sessions?

Yes. Agora supports flexible voice recording in the cloud or on premises. Developers control audio formats, storage locations, and recording quality to support playback, analytics, moderation, or compliance needs.

What is 3D Spatial Audio and when should I use it?

3D Spatial Audio simulates real-world sound positioning, making conversations feel more immersive and natural. It’s commonly used in gaming, social audio rooms, virtual workspaces, and metaverse-style experiences.

How quickly can I launch a voice calling experience?

You can integrate Agora Voice Calling within hours using SDKs, documentation, and sample apps. For teams that want to move faster, Agora App Builder offers a no-code option to deploy voice chat without custom development.

What applications are best suited for Agora Voice Calling?

Agora Voice Calling is ideal for education platforms, multiplayer games, social apps, collaboration tools, live shopping, customer engagement, and IoT devices—any use case that requires reliable, real-time voice communication at global scale.

FAQs

What is Agora Video Calling?

Agora Video Calling is a real-time video API that lets developers embed high-quality, low-latency video calls into web, mobile, and native applications. It supports everything from 1:1 calls to large-scale video experiences with full customization.

Which platforms are supported by Agora’s Video Calling SDK?

Agora Video Calling supports Android, iOS, Web, Windows, Electron, Flutter, React Native, Unity, and Unreal Engine—making it easy to deliver consistent video experiences across devices and operating systems.

How does Agora ensure reliable video quality in poor network conditions?

Agora uses intelligent routing and adaptive video optimization to reduce jitter, lag, and packet loss. The platform dynamically adjusts video quality in real time to maintain smooth, uninterrupted calls—even on slow or unstable networks.

What collaboration features are available with Agora Video Calling?

Agora supports advanced collaboration features such as screen sharing, interactive whiteboards, multi-user video layouts, and real-time messaging. These features make it well suited for meetings, education, telehealth, and collaborative work apps.

Can I record video calls and meetings?

Yes. Agora provides flexible video call recording options, allowing you to record securely to the cloud or on local servers. Developers control video format, resolution, storage location, and access permissions to meet compliance and operational needs.

Does Agora support multi-camera or multi-audio setups?

Yes. Agora supports multi-track audio and video, making it possible to publish multiple camera feeds or microphone streams within a single session. This is ideal for live production workflows, virtual events, and advanced conferencing scenarios.

How fast can I launch a video calling experience?

You can ship a video calling app within hours using Agora SDKs, documentation, and sample apps. For even faster deployment, Agora App Builder provides a no-code option to launch video, voice, and live streaming features without custom development.

What use cases are best suited for Agora Video Calling?

Agora Video Calling is ideal for education, remote work, gaming, social apps, live shopping, and telehealth. Any application that requires scalable, real-time video communication with global reach and low latency can benefit from Agora’s platform.

FAQs

What is Agora Real-Time Chat?

Agora Real-Time Chat is a customizable chat SDK that lets developers add secure, scalable messaging to real-time video, voice, and live streaming applications. It supports one-to-one messaging, group chat, and large community channels.

Which platforms are supported by Agora’s Chat SDK?

Agora’s Chat SDK supports Android, iOS, Web, Windows, Flutter, React Native, and Unity, making it easy to deliver consistent messaging experiences across mobile, desktop, and cross-platform apps.

What messaging features does Agora Chat support?

Agora Chat supports rich media messaging including emojis, images, files, GPS locations, structured messages, and voice notes. Core messaging features also include offline messaging, message recall and deletion, read receipts, typing indicators, presence, and push notifications.

How does Agora ensure chat security and compliance?

Agora Chat uses TLS/SSL encryption for data in transit and encrypted file storage to protect user data. The platform also supports privacy compliance features such as user data deletion and secure message handling.

Does Agora Chat include moderation and community safety tools?

Yes. Agora Chat includes built-in content moderation to help filter profanity, offensive language, and inappropriate images or text. Developers can also integrate third-party moderation tools for additional control.

Can Agora Chat support multilingual users?

Yes. Agora Chat supports multilingual message translation with automatic, on-demand, or push-based translation options, enabling users to communicate in their preferred language.

How quickly can I launch a chat experience with Agora?

Developers can launch a chat experience within hours using Agora SDKs, documentation, and sample apps. For faster implementation, Agora UI Kit provides a low-code option to add messaging with minimal development effort.

What use cases are best suited for Agora Real-Time Chat?

Agora Real-Time Chat is ideal for education platforms, gaming communities, social apps, collaboration tools, live commerce, and telehealth—any application that requires reliable, secure, and engaging real-time messaging.

FAQs

What is Agora Real-Time Speech to Text?

Agora Real-Time Speech to Text is a cloud-based live transcription and subtitling service that converts real-time audio into accurate text for live audio and video applications. It enables captions, transcripts, and AI-powered workflows without impacting real-time performance.

How does Real-Time Speech to Text work in live audio and video sessions?

Agora’s cloud-based transcription processes audio streams in real time and converts speech into text with low latency. Transcripts can be delivered as live captions to participants, stored for later review, or exported for downstream processing.

Can I integrate Real-Time Speech to Text with large language models (LLMs)?

Yes. Real-time transcripts can be integrated with large language models to generate summaries, meeting notes, action items, feedback, or translations. Transcripts can also be exported as .vtt files for seamless LLM processing without affecting RTC performance.

Does Agora support multiple speakers and overlapping speech?

Yes. Agora supports real-time speaker recognition and labeling for up to three simultaneous speakers. Each speaker can be transcribed separately, improving accuracy in conversations with interruptions or overlapping dialogue.

What languages are supported by Agora’s Real-Time Speech to Text?

Agora supports all major languages and regional dialects. Each channel can transcribe up to two languages simultaneously, making it ideal for multilingual meetings, events, and global applications.

Can I generate captions for recorded audio or video?

Yes. Agora supports transcription for cloud-recorded audio and video, enabling closed captions (CC) during playback and searchable transcripts for reviewing important discussion points.

How does Agora ensure transcription accuracy at scale?

Agora uses advanced AI techniques to reduce silence, lower Word Error Rate (WER), and maintain accuracy even with accents, overlapping speech, poor audio quality, or unstable networks. The solution scales from one-to-one sessions to millions of participants with consistent accuracy.

Is Real-Time Speech to Text secure and compliant?

Yes. Agora is ISO and SOC 2 certified and supports compliance with GDPR, CCPA, and HIPAA. Live captions and transcripts can be encrypted using the same security mechanisms as Agora’s real-time audio and video streams.

FAQs

What is Agora Recording?

Agora Recording is an extension that allows developers to record audio streams, video streams, interactive content, and web pages for archive, review, compliance, or redistribution. It supports cloud, on-premises, and webpage recording options.

What types of content can I record with Agora?

Agora Recording can capture audio, video, screen content, whiteboards, chat messages, and live streaming elements. You can record single streams or multiple streams separately, making it easy to edit, combine, or repurpose content later.

What’s the difference between single-stream and multi-stream recording?

Single-stream recording combines audio, video, and content into one synchronized file. Multi-stream recording captures each audio, video, or content stream separately, giving you greater flexibility for post-production, analysis, or moderation workflows.

Where are recordings stored?

Recordings can be stored in the cloud or on-premises, depending on your deployment needs. Agora supports third-party cloud storage providers such as Amazon S3, Microsoft Azure, Google Cloud, Alibaba Cloud, Tencent Cloud, and others.

Can Agora Recording support moderation and compliance requirements?

Yes. Agora Recording supports screenshots for moderation, customizable capture intervals, digital watermarks, and content moderation tools. These features help enforce community guidelines, protect intellectual property, and meet regulatory or organizational requirements.

How secure is Agora Recording?

Agora Recording is built with enterprise-grade security, including end-to-end encryption for calls, transmission, and storage. It supports globally distributed clusters, automatic backups, proxy services, and LAN deployment to meet strict data security and privacy needs.

How quickly can I integrate recording into my application?

Developers can integrate Agora Recording in as little as 30 minutes using RESTful APIs. The service is designed to be easy to embed, test, and deploy, with automatic uploading and backup to ensure recordings are not lost.

What use cases are best suited for Agora Recording?

Agora Recording is ideal for virtual events and webinars, large-scale live streaming, customer service quality assurance, education and online classes, and telehealth consultations—any scenario where capturing, reviewing, or distributing real-time interactions is essential.