ES version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
54% Positive
Analyzed from 1748 words in the discussion.
Trending Topics
#open#source#microsoft#https#model#speech#text#models#com#vibevoice

Discussion (95 Comments)Read Original on HackerNews
https://github.com/microsoft/VibeVoice/issues/102
I care that I know what I can DO with the project when I see it described as "open source".
Maybe we should have a little cue card for models: vendor/name, size, open weights, open source, permissive license.
It’s simple enough an idea.
Yes, the first of which is that you should be able to build it from source. Which requires the source code, and in this case data.
This ship has sailed. It’s now in the same category as hacker/cracker and the pronunciation of GIF.
Way early on (spring 2023) people tried to stop it, but no luck.
Maybe open inference?
But we often also get source code for fine tunning the model.
So maybe it's closer to open source than to anything else?
Isn't it a bit like not calling a game open source because engine tooling used to made it isn't open source and they didn't publish .psd files with asset designs?
Then there's "Smart" in front of Car, Phone, TV, and so on... Meaning different things.
I do think "Open Weight" should be more commonly used. There's definitely communities that spring up that build the training infrastructure and inference infrastructure around open models on the other hand.
As I teach this stuff to people newer to this tech, it's probably just easier and more helpful to refer to the wide array of "stuff you can just download and use yourself" as "open-source" and then after that, go deeper and talk about why Stallman was right, how "Free Software" was first. etc.
Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.
I really disappointed with this model to say the least.
i know someone who worked in what we might call the 'astroturfing' space within the entertainment industry. after having a few discussions with him and with things like this[0] becoming more known, it's really difficult to afford any assumption of organic intent when money is on the line - especially at the scale that microsoft works at compared to something as comparatively quaint as the music industry.
[0] https://www.wired.com/story/geese-chaotic-good-marketing-ind...
https://github.com/microsoft/VibeVoice/commit/e73d1e17c3754f...
which is microsoft for "we removed two dead links". AI innovation knows no limits!
[1] https://doublepulsar.com/microsoft-vibing-capturing-screensh...
> We need to get beyond the arguments of slop vs sophistication and develop a new equilibrium in terms of our “theory of the mind” that accounts for humans being equipped with these new cognitive amplifier tools as we relate to each other
[1] https://snscratchpad.com/posts/looking-ahead-2026/
- Cohere Transcribe (self hosted)
- Grok Speech To Text (they provide an API, only $0.10/hr!)
They are both excellent. I'm not sure about this one. Would you like to see it in a consumer speech to text app?
When explanations get posted directly in HN comments, I imagine someone somewhere in the world is able to learn in spite of their Internet restrictions/firewalls
People will also post their own interpretations in response to comments, and quickly find out they missed something.
… But if you try to automate it, like include a summary under every HN post, you encourage laziness too much and are pre-chewing too heavily. Some balance here.
[on topic]
(OK I’m done making excuses, time to read the article… thanks for the encouragement!)
I thought this was not explained in the readme directly but in fact I missed it. I wasn’t going to read Microsoft entire changelog! But it was substantive, thanks to sibling commenter:
“2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.”
Why?
My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.
Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?
Elevenlabs in the cloud.
https://cyberpress.org/microsoft-store-app-vibing-exe-accuse...
Sept 2025 https://news.ycombinator.com/item?id=45114245
The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck