Microsoft VibeVoice: Open-Source Frontier Voice AI

143

ttosh about 3 hours ago 95 commentsRead Article on github.com

ES version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

54% Positive

Analyzed from 1748 words in the discussion.

Discussion (95 Comments)Read Original on HackerNews

maxloh•about 2 hours ago

I think we should stop calling this type of models open source. They are indeed "open weight." The training code is proprietary and never revealed.

https://github.com/microsoft/VibeVoice/issues/102

simonw•26 minutes ago

I'm reserving that complaint for "open source" models which are released under non-open-source licenses.

I care that I know what I can DO with the project when I see it described as "open source".

data-ottawa•8 minutes ago

That would be “permissive license”

Maybe we should have a little cue card for models: vendor/name, size, open weights, open source, permissive license.

It’s simple enough an idea.

yjftsjthsd-h•20 minutes ago

> I care that I know what I can DO with the project when I see it described as "open source".

Yes, the first of which is that you should be able to build it from source. Which requires the source code, and in this case data.

jcmfernandes•about 1 hour ago

Indeed. We now live in a world where freeware is named open source. We are very sorry, Stallman.

MarsIronPI•about 1 hour ago

If you're going to apologize to Stallman, you should apologize for conflating open source with software freedom. ;D

jcmfernandes•24 minutes ago

I totally get you, but this is yet another thick layer away.

psychoslave•about 1 hour ago

With free libre software, where freedom and liberty are about what the end user is empowered with actually, the software is mostly metonymic. Free software, free society, because there are free people in the middle of course.

JumpCrisscross•about 2 hours ago

> we should stop calling this type of model open source. They are indeed "open weight”

This ship has sailed. It’s now in the same category as hacker/cracker and the pronunciation of GIF.

andy_ppp•about 2 hours ago

I think you mean GIF.

giancarlostoro•about 2 hours ago

It's the same as GIS, you wouldn't say jizz now would you?

DoctorOW•about 2 hours ago

I absolutely do, every single time it comes up.

ziml77•about 1 hour ago

I hadn't thought about how to pronounce GIS, but do you have a problem with the pronunciation of the Japanese Industrial Standards: JIS?

kevin_thibedeau•about 1 hour ago

The developer of the format declared the pronunciation 30+ years ago. It has always been jif.

dijksterhuis•about 1 hour ago

i am absolutely going to from now on

notabotiswear•about 2 hours ago

I take it that you haven’t met the Arcgees people…

pardon_me•about 1 hour ago

How do you pronounce giraffe?

WarmWash•about 2 hours ago

And "hallucination" which should have been "delusion".

Way early on (spring 2023) people tried to stop it, but no luck.

MagicMoonlight•31 minutes ago

Why would it be delusion? It’s making something up which isn’t there and describing it.

bitvvip•41 minutes ago

What you said makes a lot of sense. Free software should not be confused with open source

scotty79•12 minutes ago

Open weights is not exactly right either because we do get source of the software that uses those open weights.

Maybe open inference?

But we often also get source code for fine tunning the model.

So maybe it's closer to open source than to anything else?

Isn't it a bit like not calling a game open source because engine tooling used to made it isn't open source and they didn't publish .psd files with asset designs?

btown•about 1 hour ago

At least it's MIT licensed! As much as non-open training data irks me, restrictive licensing irks me more!

giancarlostoro•about 2 hours ago

I mean, you have "AI" which means just about anything in marketing speak, "Agentic" is kind of becoming similar, hopefully they don't goof that one too badly, would be nice to know what you are trying to sell me. Used to be "Cloud" meant storage not just hosting (I guess it still does).

Then there's "Smart" in front of Car, Phone, TV, and so on... Meaning different things.

I do think "Open Weight" should be more commonly used. There's definitely communities that spring up that build the training infrastructure and inference infrastructure around open models on the other hand.

jrm4•about 1 hour ago

I'm genuinely torn on this one; I get technically why not, but why I think I have no problem with it is the wishy-washiness of "open source" generally.

As I teach this stuff to people newer to this tech, it's probably just easier and more helpful to refer to the wide array of "stuff you can just download and use yourself" as "open-source" and then after that, go deeper and talk about why Stallman was right, how "Free Software" was first. etc.

notabotiswear•about 2 hours ago

Openwashing is the new greenwashing, which, coincidently, seems to have gone out of fashion a few hundred datacentres ago.

dist-epoch•about 1 hour ago

it was replaced with abundancewashing

Geezus_42•about 1 hour ago

What is "abundancewashing"?

steinvakt2•about 2 hours ago

This is not a new model. Also, it hallucinates a lot. Also, it's very heavy and slow in inference. It's also bad in multilingual.

Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.

scotty79•9 minutes ago

You just saved me an afternoon.

gagan2020•15 minutes ago

It is not good for text to speech (TTS) as well. I am trying it for few days. First of all 1.5B model documentation is not there. 0.5B realtime is shit model. I was converting text, line by line and it was randomly adding music and couldn't handle special characters like "…".

I really disappointed with this model to say the least.

lblock•about 2 hours ago

Yeah, I don't get why it is suddenly getting so much attention today, it is all over twitter too

xnx•34 minutes ago

Simonw (who has a bit of a Midas touch for posts here) just posted about it https://simonwillison.net/2026/Apr/27/vibevoice/

GuinansEyebrows•15 minutes ago

there is so much more subversive marketing out there than any of us can really fathom. i try not to be too paranoid but it's getting a lot harder every day.

i know someone who worked in what we might call the 'astroturfing' space within the entertainment industry. after having a few discussions with him and with things like this[0] becoming more known, it's really difficult to afford any assumption of organic intent when money is on the line - especially at the scale that microsoft works at compared to something as comparatively quaint as the music industry.

[0] https://www.wired.com/story/geese-chaotic-good-marketing-ind...

ramon156•about 2 hours ago

well duh, they updated the news section

https://github.com/microsoft/VibeVoice/commit/e73d1e17c3754f...

which is microsoft for "we removed two dead links". AI innovation knows no limits!

Vinnl•about 1 hour ago

Interestingly that seems to be in response to [1], which might indeed be the trigger for this.

[1] https://doublepulsar.com/microsoft-vibing-capturing-screensh...

SecretDreams•about 2 hours ago

I think this was all covered when they said it was released by Microsoft?

NobleLie•about 1 hour ago

The nuance is lost on LLM agentic dominant partakers.

aqme28•about 2 hours ago

Interesting to see "vibe" enshrined by the likes of Microsoft as an AI product word.

accrual•about 2 hours ago

Especially when "vibe coded" can have a negative connotation meaning quickly put together without understanding.

ryandrake•4 minutes ago

In my mind, Vibe-anything means "some slop carelessly thrown together to ship as fast as possible." Wild that it's being used in a serious product name!

Barbing•about 1 hour ago

I’m just surprised they put the name of the e-waste slop company in their product

altmanaltman•about 2 hours ago

Which makes it even more weird they get offended when people use Mircoslop. They are the ones leaning into the marketing

Vinnl•about 1 hour ago

"get offended" is just what the clickbait news cycle made of it. It was based on the post at [1], and this is all it said:

> We need to get beyond the arguments of slop vs sophistication and develop a new equilibrium in terms of our “theory of the mind” that accounts for humans being equipped with these new cognitive amplifier tools as we relate to each other

[1] https://snscratchpad.com/posts/looking-ahead-2026/

altmanaltman•32 minutes ago

When a CEO says "We need to get beyond the arguments of X" it is universally a polite, PR-scrubbed way of saying, "Please stop talking about X, it is hurting our business" which is how the media interpreted it.

xnx•31 minutes ago

Still waiting for the open weights model that conclusively beats the multi-year old Whisper in accuracy, features, and performance.

scotty79•7 minutes ago

It's crazy that a lot is happening in open models for stt, but there's very little progress when it comes to results, esp multilingual.

Void_•about 2 hours ago

I the past month or so, I added 2 models to my app Whisper Memos (https://whispermemos.com):

- Cohere Transcribe (self hosted)

- Grok Speech To Text (they provide an API, only $0.10/hr!)

They are both excellent. I'm not sure about this one. Would you like to see it in a consumer speech to text app?

olejorgenb•about 2 hours ago

I've had good experiences with the Mistral Voxtral models (I've used the API, but some of the model-variants are open weight)

Barbing•about 1 hour ago

Does Cohere work with longer transcripts? Do you have to do some magic to merge recordings over 35 seconds long?

2ndorderthought•about 2 hours ago

Have you tried qwen?

SecretDreams•about 2 hours ago

Any non-Musk alternatives that are comparable in quality and cost?

jayphen•about 1 hour ago

Voxtral competes on price ($0.003/min) and quality. Speechmatics has best in class accuracy but is a bit more expensive ($0.004/min)

Void_•about 2 hours ago

Our default is still OpenAI Whisper. Grok is just a choice for users who might prefer it.

embedding-shape•about 2 hours ago

Isn't this project the one Microsoft published but then soon after pulled it for security/safety reasons? What has changed since then?

542458•about 2 hours ago

Look at the "News" section in the readme - The original TTS model is gone from this repo (you can still find it other places), but the SST/ASR, long form TTS, and streaming TTS models are newer.

infecto•about 2 hours ago

It’s confusing (at least for me) because the project covers a number of things including what you are mentioning.

Barbing•about 1 hour ago

[off topic]

When explanations get posted directly in HN comments, I imagine someone somewhere in the world is able to learn in spite of their Internet restrictions/firewalls

People will also post their own interpretations in response to comments, and quickly find out they missed something.

… But if you try to automate it, like include a summary under every HN post, you encourage laziness too much and are pre-chewing too heavily. Some balance here.

[on topic]

(OK I’m done making excuses, time to read the article… thanks for the encouragement!)

I thought this was not explained in the readme directly but in fact I missed it. I wasn’t going to read Microsoft entire changelog! But it was substantive, thanks to sibling commenter:

“2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.”

CubsFan1060•about 3 hours ago

Great post last night from Simon: https://simonwillison.net/2026/Apr/27/vibevoice/

542458•about 2 hours ago

Note that this just covers the Speech-to-Text/Speech-Recognition aspect (a-la whisper), there's also models for long-form Text-To-Speech and steaming Text-To-Speech.

JumpCrisscross•about 2 hours ago

“VibeVoice can only handle up to an hour of audio”

Why?

Anonyneko•about 2 hours ago

You have selected Microsoft Sam as the computer's default voice.

accrual•about 2 hours ago

My friends and I had fun in the computer lab with Microsoft Sam, inputting long strings of characters to create funny sound effects. Sususususususu.

ryukoposting•about 2 hours ago

Holy moly, a Microsoft AI product that isn't named Copilot!

DoctorOW•about 2 hours ago

Missed opportunity to call it Vopilot

silverwind•25 minutes ago

Slopilot

frangonf•about 1 hour ago

I took a look into local options for ASR and diarization some months ago, I missed that VibeVoice now has this feature.

My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.

Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?

podgietaru•about 3 hours ago

So we've really just settled on Vibe as the verb for AI then?

giarc•about 2 hours ago

I'd be willing to bet it will be "Word of the Year" for 2026. Merriam-Webster had 'slop' for 2025, and 'polarization' for 2024. Is there a prediction market for this?

internet_points•about 2 hours ago

it'll probably be something we're not even talking about yet - we still have 7 months in which to make the world even worse

pryanshu89•about 2 hours ago

Why use precise technical language when you can just vibe with your AI system?

JumpCrisscross•about 2 hours ago

What’s the current state of the art, for each of training locally and in the cloud, for learning my voice?

yreg•about 1 hour ago

Locally maybe https://voicebox.sh/

Elevenlabs in the cloud.

chrsw•about 2 hours ago

Local? No idea. Cloud? Eleven Labs, probably. But it's described as "cloning" not "training". Not sure what the distinction is or why it matters if the end result is you can to generate any TTS that sounds like you. There might very well be an important one, I just don't know it.

khimaros•about 1 hour ago

open weights i would say S2: https://github.com/rodrigomatta/s2.cpp

pluc•about 2 hours ago

Interesting story about this repo/product/author by cybersecurity researcher Kevin Beaumont: https://cyberplace.social/@GossiTheDog/116454846703138243

Mobius01•about 1 hour ago

Microsoft has historically made poor choices in product naming, but this has to be a new low.

khimaros•about 1 hour ago

looks like this offers ASR support in GGUF https://github.com/CrispStrobe/CrispASR -- haven't tested

BlastBash192•about 2 hours ago

Maybe Microsoft’s real strength was never making the best model, it was knowing you don’t need to, as long as you own the platform everyone builds on.

chaosprint•about 1 hour ago

Microsoft Store App Vibing.exe Accused of Harvesting Screens, Audio, and Clipboard Data:

https://cyberpress.org/microsoft-store-app-vibing-exe-accuse...

Zopieux•26 minutes ago

English only?

mistic92•about 2 hours ago

For me its giving me very poor results

ChrisArchitect•about 1 hour ago

Previously:

Sept 2025 https://news.ycombinator.com/item?id=45114245

simonw•22 minutes ago

That was about the text-to-speech model, the speech-to-text one was release in January.

starkeeper•about 1 hour ago

Microsoft is famous for choosing terrible names but how could they be this terrible.

walthamstow•about 2 hours ago

Seems quite heavy for a STT model, Parakeet and Whisper are much smaller and perform great for quick dictation and transcription of longer files. I guess that's due to additional accuracy and speaker diarisation?

The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck