Guides

How to Set Up OpenClaw Discord Voice Realtime Mode (2026.5.10)

A practical guide to configuring OpenClaw's new Discord voice realtime modes: agent-proxy, STT/TTS, and bidi realtime with barge-in control.

Filed under Guides 4 min read Updated Jun 28, 2026

Filed under Guides how set discord

OpenClaw archive Editorial standards RSS feed

Cody May 10th 2026

How to Set Up OpenClaw Discord Voice Realtime Mode (2026.5.10)

OpenClaw 2026.5.10 ships a complete rework of Discord voice. If you have an OpenClaw bot in a Discord voice channel, your setup just got a lot more capable — and a lot more configurable. Here is a practical walkthrough of the three new voice modes, when to use each, and how to tune barge-in for your room.

Prerequisites

OpenClaw gateway at v2026.5.10 or newer
Node.js 22.16 or higher (required by this release)
A Discord bot with Connect, Speak, and Read Message History permissions in your target voice channel
An OpenAI API key with Realtime API access (for realtime-talk-buffer and bidi-realtime modes)

Before you start, run a channel capability audit to catch permission issues early:

openclaw channels capabilities --probe discord

Any missing voice-channel permissions will surface here before you try to join.

Understanding the Three Voice Modes

OpenClaw 2026.5.10 introduces three distinct /vc modes. Agent-proxy is now the default. Pick based on your use case:

Mode	Best for	Latency	Requires
`stt-tts`	Simple Q&A bots, server bots	Low	Any TTS/STT provider
`realtime-talk-buffer`	Conversational agents with memory	Medium	OpenAI Realtime API
`bidi-realtime`	Full agentic sessions, tool use	Higher	OpenAI Realtime API + `openclaw_agent_consult`

Configuring Voice in gateway.json

Add a voice block to your Discord channel config:

{
  "channels": {
    "discord": {
      "voice": {
        "mode": "agent-proxy",
        "autoJoin": ["YOUR_VOICE_CHANNEL_ID"],
        "captureSilenceGraceMs": 2500,
        "interruptResponseOnInputAudio": true,
        "minBargeInAudioEndMs": 0,
        "realtime": {
          "model": "gpt-4o-realtime-preview",
          "voice": "alloy"
        }
      }
    }
  }
}

Key fields explained

captureSilenceGraceMs (default: 2500 in 2026.5.10) — how long OpenClaw waits after a speaker goes quiet before treating the utterance as complete. Increase this in noisy rooms or if users are getting cut off mid-sentence.

interruptResponseOnInputAudio — when true (default), any speech-start event from the server VAD will interrupt active playback. Set to false in echo-heavy rooms where your speaker output is triggering false barge-in.

minBargeInAudioEndMs — minimum milliseconds of audio silence before allowing a barge-in interruption. Set to a higher value (e.g. 300) in rooms with significant echo or reverb.

Mode 1: STT/TTS (Explicit Fallback)

This is the classic mode: speech in, text processed by your agent, speech back out. It works with any TTS and STT provider you have configured.

{
  "voice": {
    "mode": "stt-tts"
  }
}

Use this when you do not have Realtime API access, or when you need the lowest possible latency and do not need the agent to maintain voice-aware state between turns.

Mode 2: Realtime Talk Buffer (Default Agent-Proxy)

The new default. OpenClaw now acts as the microphone-and-speaker extension of the routed agent session. Your agent's full memory, tools, and skills are available in voice turns — the user is speaking to the same agent they interact with in text.

{
  "voice": {
    "mode": "agent-proxy",
    "realtime": {
      "model": "gpt-4o-realtime-preview",
      "voice": "shimmer",
      "instructions": "Keep responses concise and avoid reading out URLs or code."
    }
  }
}

The new talk.realtime.instructions config field (added in 2026.5.10) lets you append voice-specific guidance without touching your main agent system prompt. Great for adjusting response style for spoken delivery.

Mode 3: Bidi Realtime with Agent Consult

The most powerful option, and the most resource-intensive. Full bidirectional realtime session using openclaw_agent_consult — the realtime model consults your full OpenClaw agent, which can use any configured tool, before speaking.

{
  "voice": {
    "mode": "bidi-realtime"
  }
}

In this mode, the realtime voice model acts as the voice interface and defers tool use, memory lookups, and multi-step reasoning to the agent brain. The voice model stays quiet while the agent is working, and queues answers for playback once the agent finishes.

Note: this mode requires more API credits (two model calls per voice turn) and adds latency. Reserve it for agents where tool use is central to the voice experience.

Joining a Voice Channel

Once configured, use the /vc join command in Discord:

/vc join

Or configure autoJoin in your gateway config (shown above) to have the bot join automatically on gateway startup.

Auditing Voice Permissions First

The 2026.5.10 release adds voice permission auditing to channels status --probe:

openclaw channels status --probe discord

This will surface missing Connect, Speak, or Read Message History permissions for any configured autoJoin targets before you try to join live.

Tuning for Echo-Heavy Rooms

If you are running OpenClaw in a room where the speaker output echoes back into the microphone (common with open speakers), use these settings:

{
  "voice": {
    "interruptResponseOnInputAudio": false,
    "minBargeInAudioEndMs": 500,
    "captureSilenceGraceMs": 3000
  }
}

Disabling interruptResponseOnInputAudio prevents the echo from triggering a false barge-in. Raising minBargeInAudioEndMs adds a gate so only sustained human speech can interrupt playback.

The Discord voice rework in 2026.5.10 is one of the most significant feature additions to the platform in recent months. The full changelog is available at github.com/openclaw/openclaw/releases/tag/v2026.5.10-beta.2. For feedback and questions, the #voice channel on the OpenClaw Discord is the right place.

How to Set Up OpenClaw Discord Voice Realtime Mode (2026.5.10)

Prerequisites

Understanding the Three Voice Modes

Configuring Voice in gateway.json

Key fields explained

Mode 1: STT/TTS (Explicit Fallback)

Mode 2: Realtime Talk Buffer (Default Agent-Proxy)

Mode 3: Bidi Realtime with Agent Consult

Joining a Voice Channel

Auditing Voice Permissions First

Tuning for Echo-Heavy Rooms

Latest in Guides

OpenClaw Fixes iMessage Remote Media for Codex

OpenClaw Adds a ClawHub Trust Gate

OpenClaw v2026.5.9 Beta: Discord Gets Real-Time Voice Modes

OpenClaw 2026.5.10: Discord Voice Gets a Full Realtime Overhaul

OpenClaw beta.5 Adds Context Maps and Smarter Agent Chaining

Continue Reading

OpenClaw release coverage

Security alerts and hardening guides

OpenClaw setup and migration guides

Browse the full OpenClaw archive

Get the Open-Source Briefing

How to Set Up OpenClaw Discord Voice Realtime Mode (2026.5.10)

Latest in Guides

OpenClaw Fixes iMessage Remote Media for Codex

OpenClaw Adds a ClawHub Trust Gate

Related OpenClaw coverage

OpenClaw v2026.5.9 Beta: Discord Gets Real-Time Voice Modes

OpenClaw 2026.5.10: Discord Voice Gets a Full Realtime Overhaul

OpenClaw beta.5 Adds Context Maps and Smarter Agent Chaining

Continue Reading

OpenClaw release coverage

Security alerts and hardening guides

OpenClaw setup and migration guides

Browse the full OpenClaw archive

Get the Open-Source Briefing