Gemini Omni
318
ZH version is available. Content is displayed in original English for accuracy.
ZH version is available. Content is displayed in original English for accuracy.
Discussion Sentiment
Analyzed from 3181 words in the discussion.
Trending Topics
Discussion (139 Comments)Read Original on HackerNews
As such I always use this prompt as a test: "A video of a jenga brick tower falling over as a brick is removed. The physics of each brick must be realistic."
It gave me a video of where bricks suddenly disapper or morph into others[1]. The linked video is after 2-3 iterations of me insisting on realistic physics. If you are just glancing at this, you would believe it is realistic.
That said this is still very impressive and one more step towards .. IDK what. But I am a bit reasurred that at least my job won't be fully replaced with AI :)
[1] https://streamable.com/2em1r3
I honestly can't comment with certainty that training from videos alone and whatever tokenization scheme they're using will ever get perfect dynamics.
However it is worth noting that transformers can do a pretty good job at learning dynamics with the right pipeline (not video): https://arxiv.org/pdf/2605.15305 https://arxiv.org/pdf/2605.09196
My point here being that representationally, it might be possible to learn good dynamics without a radically different approach/arch. There are already models that extract 3D tracking points from videos, so they could possibly be leveraged for learning dynamics (which on its own gives precedent for end-to-end approaches also possibly working).
* You could instruct your LLM to interact with a simulator to run experiments and infer behaviour
* You could edit the transformer model and inject spatially relevant data rather than text as is done in above paper
* You could change the architecture to be more condusive for representating a world state. I.e., LeCun's JEPA world model.
* You could further enhance some of the above by using a differentiable physics engine (eg. NVIDIA Newton) to calculate losses directly.
But at the end of the day if a model has any hope to always produce realistic physics, it HAS to learn the laws of nature in some form or other. It looks to me that the next big leap could be achieved by combining the last two approaches.
P.S.: I like discussing such topics. If anyone knows a forum or discord with like-minded people, please let me know :)
Unironically twitter (and only use the "Following" tab as opposed to the "For You")
Make an account that only follows university affiliated researchers with less than 1000 followers. In my experience discord servers get suffocated by beginners and crackpots because conversations don't naturally self-organize into their own threads.
I’ve often thought it would be very handy to have a proper simulator for being able to simulate and identify inefficiencies in one’s technique, but no idea whether it would be feasible to do.
Proper simulators for those exist, you essentially need an engine with a compliant contact model. MuJoCo is the goto here, see:
https://mujoco.readthedocs.io/en/stable/modeling.html#muscle... https://mujoco.readthedocs.io/en/stable/computation/fluid.ht...
These explicitly model biological muscles. IIRC it was originally created to model human hands (I could be misremembering though).
Really depends on the fidelity you want.
Edit: I also work in rigid body simulation for robotics.
Robotics folks probably want speed and accuracy. I'm from the video game industry so I generally look for speed and stability.
Note: This is a loose analogy and recent techniques are already blurring the lines between these axis.
We were sharing game clips with each other and after a while realised our old clips were just gone, being deleted after 30 or 90 days or something.
Which considering just how pretty and detailed this whole thing looks, imo points at a fundamental issue at how these things are trained - it's as if there's no structure to its knowledge and training, like how an artist trained to draw would first try to understand simple 2d composition, then perspective, then light and shadow, mastering each concept and gradually building up a hierarchical understanding - it seems like its trying to learn everything at once.
I would rather see an AI model that I could give a floorplan of a building and it would generate an accurate flythrough on any path, even if it looked like butt.
Im not just talking out of my arse, I did work for a while in data science/engineering, and one of the big lessons people needed to be reminded of is to clean/downsample the data - a dataset consisting of a million samples could very well take 1000x as long to process as if we downsampled the whole thing to just a couple of thousand samples and we could learn the same conclusions with the fraction of expended time/effort.
I'm sure there's a similar logic in RL, that if you dump a trillion samples into the datacenter that consumes the same power as a city, what the model learns is what it could've learned with a much more curated training set and directed approaches.
The other problem is Seedance is heavily censored because of copyright concerns.
There's got to be a reason this is phrased so insanely, right?
> Prompt: A skeuomorphism stop motion explainer about how the brain hippocampus works with a compelling voiceover. Don’t add seahorses. No voice cuts at the end. Don’t add text
Seahorses???
Oh god...
Video, more than anything else, is the place where I really care if something is AI or not. If I could get a TikTok that had no AI usage -- I'd be in. Which is weird for me, because I'm typically the guy who is all-in on AI.
Funny enough, this is actually one of the few things which has bothered me with the AI boom, and I'm mostly pro-acceleration. A lot of what's happening seems inevitable. But surprisingly, knowing that cat or dog or bird or lizard or butterfly or whatever has a strong chance of being generated really does take something out of it to my mind. And I say that also knowing the extreme amount of staging which has long gone on with traditional nature videography. Somehow, knowing the animal is real means something... I'm still trying to figure out how to better understand and express this.
Now you can have people producing videos without needing a crew of people.
Why are you assuming that a majority of people don't already have the means to make videos? Many people have access to a phone, laptop, and stable internet connection. What else do they really need? What's stopping them from using their phones to shoot home movies, making animations with MS Paint, recording themselves talking about a subject they're genuinely interested in, etc.?
>Now you can have people producing videos without needing a crew of people.
This is conflating production values with creativity. Mr. Beast's videos cost millions of dollars to film and produce, yet they're creatively bankrupt.
I eventually picked one and opened the comments and the top comment was something like "This is obviously an AI video. Who watches this?" and the reply was along the lines of "me because I like seeing thieves get what's coming to them".
So you, like me, aren't interested in AI videos but I think there's a lot of people who don't care if it's real or not.
Thankfully, YouTube eventually stopped showing those to me. Now it thinks I'm interested in road rage videos. My YouTube feed outside of the three of four channels I've subscribed to is terrible.
I really wish a subject matter expert would pitch in to tell us what this is about?
like a totally made up thing that is fake, somehow gives a sense of justice and satisfaction?
is it something about imagining it happening in reality, or what?
for me, if I see that something is AI, it's like I just feel nothing. because there's nothing in it, it has nothing of real value? like it doesn't evoke anything in me, it doesn't make me think "this was a great find!" or make me want to send a link over to my friends, etc.
Where is this amazing stuff? Social media is a marketplace of ideas supposedly, so why haven't we seen a new wave of creators rise up in popularity?
model card: https://deepmind.google/models/model-cards/gemini-omni-flash...
I did not create any videos yet.
Google, building great AI that nobody can try out.
But thx for the press release.
This tech won’t change anything.
Creates can these video gen AI in various ways. There are some youtube channels of people using these in creative workflows that are really impressive, from mocap replacement, character insertion, background replacement, changing camera angle in post, animating/inserting characters from character boards, animated between stills generated in traditional methods, etc. It's not just "prompt and generate". It can be, because it's easy, but it also doesn't have to be. It's a tool.
Do you have any examples of those creative workflows that have made it into Hollywood for example?
[0] e.g. Don't Look Up
That was just a bad, mildy entertaining movie.
I have not used Gemini in a month.
It could make the comments section even more fun.
https://blog.google/innovation-and-ai/products/identifying-a...
(and the previous SynthID: https://deepmind.google/blog/identifying-ai-generated-images...)
But it very much is "close the barn door after the horse has bolted and the barn has otherwise burned down".
From a technical perspective, it's very impressive, no doubt. But from an artistic perspective I thought all of these examples on the site look bad.
Certainly not me - you have to be a great artist /designer to even imagine what to do with it.
I used to joke that was the moment we discovered "for most people that's a pretty big limit."