Reimagining the mouse pointer for the AI era

ddevhouse about 3 hours ago 43 commentsRead Article on deepmind.google

⚡ Community Insights

Discussion Sentiment

75% Positive

Analyzed from 1502 words in the discussion.

Discussion (43 Comments)Read Original on HackerNews

arjie•about 1 hour ago

Oh interesting, this is very cool. At first I thought it was just focus-follows-mouse but it's more interesting. You have certain keywords trigger "add to prompt". Ignoring the voice functionality (which is admittedly crucial currently because other inputs currently take over focus), I've often wanted to just have a continuous conversation with the LLM as I 'point and click' (or tab over and select) at various things. Might be neat to have text input focus continue to go to the LLM where I'm typing text etc.

Sometimes I go to a different page to take a screenshot and other times I'm browsing for a file, and other times I'm highlighting some log lines. Cursor did this well, with selecting text in the terminal auto-focusing the Cursor agent textbox so you could talk to the agent and then select some text and you didn't have to re-select the original agent textbox again. The agent is a top-level function in that system not "just another app I have to switch to" to take my context with.

I have some small amount of bias because I've always felt input-constrained on computers. I have to move my hands to go places and that's exasperating. I've tried head tracking, had a vim pedal for a while, and used tiling WMs, and things like this to aid but while my vim-fu is pretty good and I function inside things very well with it, my cross-application interface isn't.

In the end, perhaps we all have our home offices with our Apple Vision Pros and we talk to them like this to maneouvre faster through our machines and get our ideas into them.

Cool research. I wonder what we'll end up with.

why_at•about 1 hour ago

My first impression coming away from this is skepticism.

Anything with voice controls for routine use is a pretty tough sell. Doing this when you're not completely alone would be annoying to everyone around you.

Most of their examples seem like they could have been done with a right click drop down menu so they don't really need to "re-invent the mouse pointer".

So is this thing talking to Google's servers all the time for the AI integration? So it won't work if you're not connected to the internet? Privacy concerns are obvious; now Google wants to have an AI watching literally everything you do on your computer?

Does it cost the user anything for the LLM use? If it's free will it stay free forever? That's quite a lot to give away if they're expecting people to use it to change a single word like in one of their examples. I guess they're expecting to make the money back by gathering data about literally everything you do on your computer.

There might be a killer app for AI integration with personal computers that has yet to be invented, but this doesn't look like it.

AirMax98•about 1 hour ago

Right — it does seem cool but the voice is patching over a major gap. If I'm talking already, why wouldn't I just describe what I'm looking at and have the AI grab it for me?

nolist_policy•about 1 hour ago

The "Edit an Image" Demo at the bottom is pretty fun. Maybe this is just Google flexing their LLM inference capacity.

chromacity•about 1 hour ago

My reaction to the first demo (recipe) is that it was slower than typing the same thing on your keyboard.

The second demo seems to be a wash: there's no time saved in saying "move this" versus "move crab". And an app-specific contextual menu would probably be faster.

The third demo doesn't seem to warrant the use of a pointer at all, since there is only one way to interpret the prompt.

None of this means that this approach will not be successful, but there's a reason why so many attempts to revolutionize user interfaces ended up going nowhere. Talking to your computer was always supposed to be the future, but in practice, it's slower and more finicky than typing.

In fact, the only new UI paradigm of the past 28+ years appears to have been touchscreens and swipe gestures on phones. But they are a matter of necessity. No one wants to finger-paint on a desktop screen.

joe5150•36 minutes ago

Talking to your computer can only ever work for people in atomized work-from-home silos, surely. I can't really imagine living in a world where everybody is just muttering commands to the computer all the time.

kjellsbells•about 1 hour ago

I sense a privacy problem brewing.

It reminds me of Microsoft Recall in the sense that some portion of the screen is going to be continuously transmitted outside of the users control.

What happens when someone browses something very private (planning a surprise engagement. looking at medical data. planning a protest)? All that data gets slurped to google and subject to a warrant or discovery or building your advertising fingerprint.

Maybe the idea is that the data is sent to AI only when you right click, but that seems like a very thin firewall that a product manager will breach in the interests of delivering "predictive AI" via some kind of precomputed results.

gobdovan•32 minutes ago

This is how I always imagined FE development would work once ChatGPT 3 came out. Then Cursor appeared and seeing how successful they were with just a chat and a few tool calls, I thought I was over-complicating things.

Anyway, I built a prototype on this idea, but instead of relying only on hover, I press Option to select a node in a custom AST-ish semantic layer I designed around a minimalist UI grammar, and Option + up/down arrows to move to parent/child node. This way, I have have an accurate pointer to the element I want to talk about, plus a minimal context window (parent component, state, a few navigation related queries).

What I learned from using it, though, is that the killer use case isn't necessarily the flashy "talk to this UI element" interaction shown in the Google demos. I do use it that way too; I have `Option + Shift + click` to copy a selector to the clipboard, so I can give an LLM connected to the live medium a precise reference to the element I want to discuss.

But the place where it has been most useful day to day is much simpler: source navigation. Point at the thing in the UI, jump to the code that is responsible for it. The difficult part is jumping to the code you care about (the code for UI or for the semantic element?), but in my system that distinction turned out to be usually obvious, which is what makes the interaction useful.

jpatten•about 2 hours ago

Reminds me of Put That There https://m.youtube.com/watch?v=RyBEUyEtxQo

juancn•about 1 hour ago

Please don't.

I like text selection exactly how it is. I want precise controls.

It's fine for a touch interface like a phone, but on a computer I expect precision. As much as I can get.

nolist_policy•about 2 hours ago

Wiggle at CAPTCHAs, wiggle at Termux, wiggle at Emacs, wiggle at the Godot Editor, wiggle at my remote desktop.

(Not going to happen)

loaderchips•about 2 hours ago

It's beautiful how the human mind can take something very obvious but overlooked and make it into this fantastic innovation. Fab stuff.

dandaka•about 1 hour ago

Next generation of OS should have constant video and audio recognition by on device LLM. This will provide valuable context for a lot of scenarios. So instead of frequent copy-pasting we are used to, we can let agents access context of our whole workflows from different apps.

But Google is a very ill positioned candidate for such OS. I would rather trust Apple and local-first on-device models.

tintor•about 2 hours ago

Of course, it isn't a Google Demo, if you can't use it to book a table at restaurant. (shown at the bottom of the page)

maheenaslam•about 1 hour ago

The concept is good but accuracy in cluttered environment can be a concern, also misinterpreting context can be a problem

AbuAssar•about 2 hours ago

so Google will be monitoring whatever on the screen continuously or only when the user say the magic words (this, that, here, there)?

EdgeExplorer•about 2 hours ago

Indeed. "AI-enabled pointer" is misdirection. This isn't an AI-enabled pointer; it's sending screen to AI, which yes, includes pointer position. The AI doesn't live in the pointer. The AI lives, apparently, so thoroughly in the system that it can see and do anything, and the pointer is just a way of giving it context.

OtomotO•about 2 hours ago

Google Recall. Hey, it's all about the marketing.

jaccola•about 2 hours ago

This seems like one of those things that is usable infrequently enough to be forgotten/poorly developed/never used. (Even before accounting for the actual failure rate of the LLM which will be none-zero).

Perhaps a text box and file upload isn’t the perfect interface for every use case but it is versatile which is a huge barrier to overcome.

hmokiguess•about 1 hour ago

Don't build these things, instead build protocols and expose system level APIs for application developers to build things.

iridione•about 2 hours ago

Interesting! I wonder how UI will evolve in the long-term? If there are browser-use/computer-use and clicky-clones automating pointer actions, do we really need complex UI anymore? If yes, when?

Ancapistani•about 2 hours ago

I've been playing with writing a visionOS app that allows an AI agent to be aware of what you're looking at at any given time.

At some point I fully expect eye tracking (or attention tracking) to be common enough to be a first-class input method.

strgrd•about 2 hours ago

No thanks

SirFatty•about 2 hours ago

It only took Google and their AI offering to come up with Graffiti.

mcookly•about 2 hours ago

I wonder what sort of monstrous power would be unleashed if Google used Plan9 as a foundation.

bitwize•about 1 hour ago

They'd half-finish it then bury it, like they did with Fuchsia which is heavily Plan-9-inspired.

xiphias2•about 1 hour ago

Google needs to beat OpenAI and Antropic in coding models because that's where the big money is going. I love using the Gemini pro model for quick questions, but that's not where I'm spending the real money.

They have so many great software engineers but unable to use them to speed up coding AI research. Hopefully with Sergey's focus it will get better.

This cursor thing is just another experiment nobody cares about.

Joker_vD•about 1 hour ago

Just seven hours ago there was a plea on HN [0] to please not do this. Seriously, what are they smoking at Google right now?

[0] https://news.ycombinator.com/item?id=48107027

mvdtnz•about 2 hours ago

Both of the text based demos would have been simpler and faster with traditional mouse and keyboard interactions. What is the AI adding?

hyperhello•about 2 hours ago

They’re going to take your abilities to do anything and spread it across many places so you have to run around to do them, same as all the moneyed technology.

wartywhoa23•about 1 hour ago

Hype-flavored surveillance!

dfxm12•about 1 hour ago

It tracks what's on the screen and sends it back to Alphabet. If you're watching a video about BBQ, enjoy a bunch of ads for Omaha steaks and big green egg in your Gmail.

On a less serious note, the audience for this is people who want to optimize for what seems like the least amount of effort.

slopinthebag•about 2 hours ago

It feels like everything modern is like this. No value added, just the appearance of it.

jinkuan•about 2 hours ago

being able to make precise edits would be huge for AI

LocalH•about 2 hours ago

do not want

simondw•about 1 hour ago

Maybe I'm misunderstanding, but what is new about the pointer itself? Seems to be functionally the same as selecting + tooltips / context menus.

kwertyoowiyop•about 1 hour ago

Shush, how is anyone going to get promoted with that kind of talk!?

DaiPlusPlus•about 1 hour ago

> but what is new about the pointer itself?

I'm hoping for a const-reference joke.

OtomotO•about 2 hours ago

Like a dream come true...

Nightmares are dreams as well and this is a nightmare like Windows Recall.

Technically wonderful though.

pmarreck•about 1 hour ago

There's already a product that does this lol

Aaaaand now I can't remember the name of it

themafia•about 2 hours ago

> We’ve been exploring new AI-powered capabilities to help the pointer not only understand what it’s pointing at, but also why it matters to the user.

We couldn't quite track you well enough before. So we're fixing that under the guise of "AI powered capabilities."

SirMaster•about 1 hour ago

Thanks, I hate it

brgsk•about 1 hour ago

what the hell is going on at google