ES version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
67% Positive
Analyzed from 637 words in the discussion.
Trending Topics
#models#glm#model#click#browser#harness#coding#agent#turbo#play

Discussion (16 Comments)Read Original on HackerNews
I tested Qwen3.6, Gemma4, Nemotron3-nano-omni. They fully hallucinate x,y coords. (did not try GLM-5V yet)
GPT-5.5 can easily do it. But also Vocaela, a tiny 500M model, is quite good at it. Hope they improve the training for x,y clicking soon on the smallish multi-modals.
Recently slopped a http service together just so my local models can click, instead of relying on all the wild ways agents currently hack into the browser (browser-use, browser-harness, agent-browser, dev-browser etc) https://github.com/julius/vocaela-click-coords-http
Have you tried doing a two step: review the image, then render a vector?
At one point I had some smaller model draw bounding boxes around everything that looked interactable and labels like "e3" ... then asked the model to tell me "click on e3". Did not work in my tests was pretty much as bad as x,y.
Comprehensive evaluation results at https://gertlabs.com/rankings
>Comprehensive evaluation results at https://gertlabs.com/rankings
But if you go to the linked site, it seems like the only thing that's part of the evaluation is how well the models play various games? I suppose that counts as "reasoning", but I don't see how coding ability tested?
Coding is what we test for most heavily. Testing this via a game format (instead of correct/incorrect answers) allows us to score code objectively, scale to smarter models, and directly compare performance to other models. When we built the first iteration last year, I was surprised by how well it mapped to subjective experience with using models for coding. Games really are great for measuring intelligence.
However, both Kimi and GLM can end up in doom loops so be careful how you use them. Without a proper harness the agent can easily get into some tricky situations with no escape.
We had to develop new heuristics in our cloud harness just because of this but I am really grateful that we did as the platform feels now more robust.
A small price to pay for model plug & play!
Turbo makes a huge difference in everyday use because it saves you time and you are not in the mood always to wait endlessly.