OpenAI's WebRTC problem

186

aatgctg 1 day ago 50 commentsRead Article on moq.dev

⚡ Community Insights

Discussion Sentiment

45% Positive

Analyzed from 2570 words in the discussion.

Discussion (50 Comments)Read Original on HackerNews

Sean-Der•about 3 hours ago

Responding to some technical points first, but then after that I do see a future that isn't WebRTC. I don't think it matches where WebTransport+WebCodecs etc is going though.

> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate

This is the opposite of the feedback I get. Users want instant responses. If you have delay in generating responses/interruptions it kills the magic. You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)

> TTS is faster than real-time

https://research.nvidia.com/labs/adlr/personaplex/ Voice AI for the latest/aspirational is moving away from what the author describes. It is trickled in/out at 20ms

> We really hope the user’s source IP/port never changes, because we broke that functionality.

That is supported. When new IP for ufrag comes in its supported

> It takes a minimum of 8* round trips (RTT)

That's wrong. https://datatracker.ietf.org/doc/draft-hancke-webrtc-sped/

> I’d just stream audio over WebSockets

You lose stuff like AEC. You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily. Lots of developers struggled with Realtime API + web sockets (lots of code and having to do stuff by hand)

----

I think if I had my choice I would pick Offer/Answer model and then doing QUIC instead of DTLS+SCTP. Maybe do RTP over QUIC? I personally don't feel strongly about the protocol itself. I don't know how to ship code to multiple clients (and customers clients) with a much large code footprint.

toast0•28 minutes ago

> . You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily.

WebRTC is complex, even if it's a library (even if it's a library built into the browser they're already using). For a client/server voice interaction, I don't see why you would willingly use it. Ship voice samples over something else; maybe borrow some jitter buffer logic for playback.

My job currently involves voice and video conferencing and 1:1 calls, and WebRTC is so much complexity... it got our product going quickly, but when it does unreasonable things, it's a challenge to fix it; even though we fork it for our clients.

I could write an enormous rant about TURN [1]. But all of the webrtc protocol suite is designed for an internet that doesn't exist.

[1] Turn should allocate a rendesvous id rather than an ephemeral port when the turn client requests an allocation. Then their peer would connect to the turn server on the service port and request a connection to the rendesvous id, without needing the client to know the peer address and add a permission. It would require less communication to get to an end to end relayed connection. Advanced clusters could encode stuff in the id so the client and peer could each contact a turn server local to them and the servers could hook things up; less advanced clusters would need to share the turn server ip and service port(s) with the id.

kixelated•44 minutes ago

HELLO MR SEAN,

1. Of course users want lower latency, but they also want fewer instances where the LLM "misheard" them. It would be amazing to run A/B experiments on the trade-off between latency vs quality, but WebRTC makes that knob difficult to turn.

2. I'm obviously not an TTS expert, but what benefit is there to trickling out the result? The silicon doesn't care how quickly the time number increments?

3. Yeah, sometimes the client is aware when their IP changes and can do an ICE renegotiation. But often they aren't aware, and normally would rely on the server detecting the change, but that's not possible with your LB setup. It's not a big deal, just unfortunate given how many hoops you have to jump through already.

4. Okay, that draft means 7 RTTs instead of 8 RTTs? Again some can be pipelined so the real number is a bit lower. But like the real issue is the mandatory signaling server which causes a double TLS handshake just in case P2P is being used.

5. Of course WebRTC is easier for a new developer because it's a black box conferencing app. But for a large company like OpenAI, that black box starts to cause problems that really could be fixed with lower level primitives.

I absolutely think you should mess around with RTP over QUIC and would love to help. If you're worried about code size, the browser (and one day the OS) provides the QUIC library. And if you switch to something closer to MoQ, QUIC handles fragmentation, retransmissions, congestion control, etc. Your application ends up being surprisingly small.

The main shortcoming with RoQ/MoQ is that we can't implement GCC because QUIC is congestion controlled (including datagrams). We're stuck with cubic/BBR when sending from the browser for now.

pocksuppet•23 minutes ago

Latency versus reliability is a false dichotomy anyway. The alternative to WebRTC isn't to wait for the user to finish speaking before you send any of the audio. Open a websocket and send the coded audio packets as they're generated. Now you're still sending audio packets immediately, but if one is dropped, TCP retransmits it until it makes it through. If the connection is really slow, packets queue up, and the user has to wait, but it still works. You get the low latency in the best case and the robustness in the worst case.

adgjlsfhk1•about 1 hour ago

> You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)

You only need to send ~1 second at a time. There's no reason to send 20ms or 10 min at a time. Both are stupid.

sbrother•about 1 hour ago

> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate

I disagree with this SO strongly. I find the conversational voice mode to be a game changer because you can actually have an almost normal conversation with it. I'd be thrilled if they could shave off another 50-100ms of latency, and I might stop using it if they added 200ms. If I want deep research I'll use text and carefully compose my prompt; when I'm out and about I want to have a conversation with the Star Trek computer.

Interestingly I'm involved with a related effort at a different tech company and when I voiced this opinion it was clear that there was plenty of disagreement. This still surprises me since it seems so obvious to me that conversational fluidity is the number one most important feature.

kixelated•14 minutes ago

To clarify, I meant waiting an extra 200ms if the alternative was dropping part of the prompt. During periods of zero congestion, the latency would be the same.

bredren•18 minutes ago

It is very important, the low latency.

I prompt orchestrations most of the day, and am very particular about the fidelity of my context stack.

Yet I’ve used advanced voice mode on ChatGPT via the iOS app a lot. And I have not had a problem with it understanding my requests or my side of the conversation.

I have looked at the dictation of my side and seen it has blatant mistakes, but I think the models have overcome that the same way they do conference audio stt transcripts.

I have had times where the ~sandbox of those conversations and their far more limited ability to build useful corpus of context via web searches or by accessing prior conversation content.

The biggest problem I have had with adv voice was when I accidentally set the personality to some kind of non emotional setting. (The current config seems much more nuanced)

The AI who normally speaks with relative warmth and easy going nature turned into an emotionless and detached entity.

It was unable to explain why it was acting this way. I suspect the low latency did disservice there because when it is paired with something adversarial it was deeply troubling.

jedberg•about 2 hours ago

I have a lot of experience in this area (and some patent applications). For Alexa, the device established a connection back to the server and then kept that open, sending basically HTTP2/SPDY/Something like it over the wire after it detected the wake word. This allowed the STT start processing before you finish talking, so there is only a small delay in processing the last few chunks of your utterance.

The answer came back over the same connection.

In the case of OpenAI, they can't exactly keep a persistent connection open like Alexa does, but they can use HTTP2 from the phone and both iOS and Android will pretty much take care of that connection magically.

The author is absolutely right, a real time protocol isn't necessary. It's more important to get all the data. The user won't even notice a delay until you get over 500ms. Especially in the age of mobile phones, where most people are used to their real time human to human communications to have a delay.

(If you work at OpenAI or Anthropic, give me a shout, I'm happy to get into more details with you)

Aeroi•about 1 hour ago

I run the gemini live api over a mesh hosted managed webrtc cloud. works fantastic, and Ive been running it for 2 years. you can try websocket, handle ephemeral keys, ect ect. but when you speak with people running voice agents at scale in this space, many of the issues are solved with webRTC and pipecat and the many resources allocated to solved problems in this space. It certainly feels overkill, and it probably is, but once connection is established, it's pretty magical. the startup time and buffering has been solved for quicker voice connections too, https://github.com/pipecat-ai/pipecat-examples/tree/main/ins... (video is harder)

awkii•about 3 hours ago

This poor soul. There are few protocols I hate implementing more than WebRTC. Getting a simple client going means you need to quickly acclimate to SDP, TURN/STUN, ice-candidates, offers, peer-to-peer protocols, and the complex handshake that is implemented from scratch each time. I can't imagine re-writing the whole trenchcoat of protocols and unintended "best-practices".

jgalt212•about 3 hours ago

Have you attempted to use the Microsoft Graph API to interact with email?

tempaccount5050•about 2 hours ago

It's way better than the old powershell modules imo. What don't you like?

edoceo•about 3 hours ago

Ugh. Who's decided to Graph all the things.

Sean-Der•about 3 hours ago

What platforms were you targeting that you found it painful! Sorry it was frustrating.

I hope it’s getting better with education/more libraries. It’s also amazing how easy Codex etc… can burn through it now

moomoo11•about 3 hours ago

i like livekit for this reason and their ceo is cool

hnav•10 minutes ago

Exactly what I thought when I read the original article, though to be fair WebTransport is barely now entering the mainstream with Safari shipping support this year.

schappim•26 minutes ago

"WebRTC is the problem" is bait; his real claim is "WebRTC has annoying transport-layer characteristics that hurt cloud Voice AI scaling"...

Having just had to tackle this again for my own startup, I'm reminded about what you would lose by ditching WebRTC - the audio DSP pipeline, transmit side VAD, echo cancellation, noise suppression, NAT traversal maturity, codec integration, browser ubiquity etc.

fidotron•about 3 hours ago

> WebRTC is designed to degrade and drop my prompt during poor network conditions

You want real time that's what you are going to deal with. If you don't want real time and instead imagine everything as STT -> Prompt -> TTS then maybe you shouldn't even be sending audio on the wire at all.

kixelated•about 2 hours ago

Hello Mr Author here. Apologies that my comment replies aren't as funny.

Every low-latency application has to decide the user experience trade-off between quality and latency. Congestion causes queuing (aka latency) and to avoid that, something needs to be skipped (lower quality).

The WebRTC latency vs. quality knob is fixed. It's great at minimizing latency, but suffers from a lack of flexibility. We still (try to) use WebRTC anyway, because like you implied, browser support has made it one of the only options.

Until now of course! WebTransport means you can achieve WebRTC-like behavior via a generic protocol. Choose how long you want to wait before dropping/resetting a stream, instead of that decision being made for you.

And yeah my point in the blog is that often the user wants streaming, but not dropping. Obviously you can stream audio input/output without WebRTC. The application should be able to decide when audio packets are lost forever... is it 50ms or 500ms or 5000ms? My argument is that voice AI shouldn't pick the 50ms option.

cowsandmilk•about 3 hours ago

> You want real time

Isn’t the point that OpenAI’s use case does not require realtime?

When OpenAI responds, it has most of the audio in advance of when the user needs to hear it. It produces audio faster than real time, so a real time protocol is a bad fit.

Sean-Der•about 2 hours ago

That is not the case. See get-realtime-translate[0 that's doing it as a trickle instead (not turn based).

[0] https://developers.openai.com/api/docs/models/gpt-realtime-t...

telman17•about 3 hours ago

Yep. Maybe there's some additional configuration I'm missing to mitigate the delay but clients don't seem to want to deal with the delay with STT -> Prompt -> TTS. They'll happily suffer occasional quality issues if the conversation feels "real".

DonHopkins•5 minutes ago

>Yep. Maybe there's some [dropped] issues if the conversation feels "real".

Can you repeat that please? It didn't make any sense. This conversation doesn't feel "real".

r2vcap•about 4 hours ago

This is frustratingly one-sided writing. Yeah, WebRTC has limitations, but relying on a standard buys you a lot of correctness and reduces long-term engineering cost. The fact that WebRTC is complicated does not mean it is wrong; it means real-time media over the public internet is complicated.

Also, networking is inherently stateful. NAT traversal, jitter buffers, congestion control, packet loss, codec state, encryption, and session routing do not disappear because you put audio over TCP or WebSocket. Pretending otherwise is not architectural clarity. It is just moving the complexity somewhere less visible.

tekacs•about 3 hours ago

You might have noticed that the author started the blog post explaining themselves:

  Like 6 years ago I wrote a WebRTC SFU at Twitch.
  Originally we used Pion (Go) just like OpenAI,
  but forked after benchmarking revealed that it was too slow.
  I ended up rewriting every protocol, because of course I did!

  Just a year ago, I was at Discord and I rewrote the WebRTC SFU in Rust.
  Because of course I did! You’re probably noticing a trend.

  Fun Fact: WebRTC consists of ~45 RFCs dating back to the early 2000s.
  And some de-facto standards that are technically drafts (ex. TWCC, REMB).
  Not a fun fact when you have to implement them all.

  You should consider me a Certified WebRTC Expert.
  Which is why I never, never want to use WebRTC again.

I think that they've done more than enough of 'trying the normal way' to be warranted in having an opinion the other way, don't you think?

sam1r•about 2 hours ago

Yes,agreed. I also found it apparently obvious that they have proven their experts worth on this subject matter. Many times, over and over.

BonoboIO•about 1 hour ago

But ChatGPT said …

charcircuit•about 3 hours ago

QUIC is also a standard.

Waterluvian•about 3 hours ago

“How hard can it be?” the strawman asked.

It’s 2026 and teleconferencing is still such a shit show. There’s billions of dollars to be had and Zoom is at best mediocre, and it can be as bad as Microsoft Whatchamacallit. I’ve never not seen teleconferencing be a ham handed mess.

fragmede•about 2 hours ago

Facetime does alright in the consumer segment.

lpln3452•about 3 hours ago

I haven't really experienced disconnections while using ChatGPT. Gemini is the frustrating part. Simply backgrounding the app (and the web version too) and resuming it causes the response or the conversation with an assigned ID to disappear. Haha.

Sean-Der•about 3 hours ago

I believe Gemini is Websockets? I have the same experience with heavy/custom applications that try to roll their own media stuff.

You run into issues around AudioContext and resumption etc... it's a PITA to have to handle all those corner cases :(

Aeroi•about 2 hours ago

there are a lot of extremely smart people that have come back to webRTC time and time again because it continues to solve problems other methods and protocols can't. with saying that, quic is certainly interesting going forward, but i primarily stream voice + vision at 1fps so it just makes sense, and websockets fail and are insecure at scale for this use case (see https://www.daily.co/videosaurus/websockets-and-webrtc/) . also just listen to sean in this thread, dude knows whats up.

fy20•about 2 hours ago

Nice fun article. Gives me Why The Lucky Stiff vibes.

sam1r•about 2 hours ago

>> ... I say hi to <strike> Scarlett Johansson <strike>

Had a nice chuckle.

spongebobstoes•about 3 hours ago

this misses a few key things but hits on many others

webrtc is a bad protocol, without a doubt. I do like websockets as an easy alternative, but you do need to reinvent decent portions of webrtc as a result

I like the idea of MoQ but it's not widely used. probably worth experimenting with, especially as video enters the chat

> and then a GPU pretends to talk to you via text-to-speech

OpenAI is speech-to-speech, there is no TTS in voice mode

> It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection

signalling can be done long ahead of time, though I don't see this mentioned in the OpenAI blog. I also saw some new webrtc extensions that should reduce setup time further

ultimately though, it comes down to

> It’s not like LLMs are particularly responsive anyway

I expect to see a shift in how S2S models work to be lower latency like the new voice API models that OpenAI announced

to be fair, the new models were released the day after this MoQ blog was published

Terretta•17 minutes ago

> OpenAI is speech-to-speech, there is no TTS in voice mode

Which results in the interesting situation where the transcript isn't what was said:

Q: Why do the voice transcripts sometimes not match the conversation I had?

A: Voice conversations are inherently multimodal, allowing for direct audio exchange between you and the model. As a result, when this audio is transcribed, the transcription might not always align perfectly with the original conversation.

keizo•about 2 hours ago

interesting read albeit over my head, but i spent half of yesterday comparing Gemini Live (websockets) vs gpt-realtime-2 and while gpt is super good, seemingly more robust. Gemini connects faster.

giancarlostoro•about 4 hours ago

Probably because WebTransport is the lesser known alternative to WebRTC.

est•about 3 hours ago

WebTransport requires some speicific server setup.

cldouflare doesn't support WebTransport well.

brcmthrowaway•about 1 hour ago

This is interesting. Does niche knowledge in this area command $1mn salary?

hnav•16 minutes ago

It can, in general knowing how to shuffle packets according to RFCs is a pretty decent gig. Pretty much every hyperscaler ends up building various LBs and the learning curve is too steep to just toss randos at it unsupervised, but at the same time it's not necessarily inventing anything new most of the time.

ec109685•39 minutes ago

> “Here’s a million dollars to implement WebRTC for the fourth time”

“Hell no”

> “Umm…”

Giefo6ah•about 3 hours ago

Yet another victim of IPv4, and you still find countless detractors of IPv6 on every thread where it's mentioned.

spongebobstoes•about 3 hours ago

IPv4 support is necessary, but IPv6 isn't

whattheheckheck•about 3 hours ago

How would ipv6 handle it

tardedmeme•about 3 hours ago

You just send packets to the other party's address and they send packets back to yours. Both parties know their address and you don't need a relay in the middle.

pocksuppet•19 minutes ago

It's not really relevant in this case since one endpoint is a massive server farm.