RU version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
57% Positive
Analyzed from 12248 words in the discussion.
Trending Topics
#llms#calories#food#more#llm#don#same#using#apps#study

Discussion (283 Comments)Read Original on HackerNews
When I opened it up, I assumed the author would have at least attempted a calculation service, maybe even placed something like the size of the meal into an actual model, using the integration of pre-existing tools that are (slightly more) accurate. Hell - most food literally is required to have calorie information, and you can query open source data for others!
But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?
This is akin to the instagram reels that talk to chatGPT and ask it to time how long they're run is. Except those are treated as funny jokes rather than being turned into studies.
I'd like to see this study done using any kind of actual grounding knowledge, seeing what mistakes AI makes when attempting to query ground truth from picture analysis - there would at least be an interesting result methodology in that.
There are very popular apps on the App Store right now that are going viral among non-techie people that do exactly this, and they have no concept of how AI works. My wife was talking about one and I had to give her a reality check that the AI had no idea what ingredients were used to make the food. And she's a licensed nutritionalist.
Studies like this create something to point at for people who are confused and serve as a springboard for a conversation in the media.
I think i'm just dissapointed that this study doesn't go deep enough, and stays at a surface level statistical analysis of frontier models.
None of those apps have magic. They cannot do better than the frontier models.
The "pic of packaged food --> LLM --> nutrition DB call" pipeline is workable, but many users of these apps are using them for fresh prepared foods, which is just an unworkable problem without either an understanding of the preparation process or a bomb calorimeter.
Nutritionist?
In the real world you need to calibrate your behavior with the results. Are you gaining weight? You'll need to eat less if you want to lose any. You can do all the math with nutrition labels and macros you want but that's all theoretical.
See this study below for the 20% figure, as well as their experimental results on real food items (some even exceeded this threshold though most were within it). https://pmc.ncbi.nlm.nih.gov/articles/PMC3605747/?st_source=...
Either way, if you count calories and compare to your weight gain/loss over a few weeks and adjust your calorie target as warranted, assuming the types of food you are eating do not change drastically (e.g. you calibrated on regular diet and now have started an elimination diet), the error bars can be basically ignored.
It's indeed like astrology. Simply thinking about personality traits and thinking through your life and your desires and goals and current situation is already beneficial to take charge and navigate your life.
Do you want to count calories, or do you want to lose weight? Sounds like it's possible to hyper-optimize calorie counting to the point that it becomes counter-productive...
Some good news from it. If you weigh the food instead of depending on the package size then the labels become much more accurate!!
"Serving size, by weight, exceeded label statements by 1.2% [median] (25th percentile −1.4, 75th percentile 4.3, p=0.10). When differences in serving size were accounted for, metabolizable calories were 6.8 kcal (0.5, 23.5, p=0.0003) or 4.3% (0.2, 13.7, p=0.001) higher than the label statement."
If you look at the table "Deviation of metabolizable calories from label calories" [https://pmc.ncbi.nlm.nih.gov/articles/PMC3605747/figure/F1/] you'll see that most labels even for service side are pretty good and there are some that are really bad.
If you'll look at one of the worst offenders Tostitos, the label has "Tostitos Tortilla Chips - serving size 24 chips", but chips vary a lot in size, so you could have a huge variance in weight. If instead you weighed them, which I do with my chips, I bet the calories are much closer to the label.
Body composition comes down to routine. I've found found I love to eat, but I pretty much eat the same meals week over week, that makes it extremely easy for me to lose or gain weight depending on my goals.
So when you cook yourself and you weigh the ingredients used for cooking, you can know the real calorie content with far more accuracy than when buying ready-to-eat food.
It's just not that easy to change the nutritional content of a kilogram of a known cultivar of dry rice when it's passed all the standard checks for moisture content, protein content, etc.
And the more natural a food is the more inaccurate the results will be because of natural fluctuations. Think the amount of fat a chicken can have. So making this percent stricter will only benefit foods that are all chemicals.
The usual goyslop made of shit ingredients allows for very low tolerances. Mayo has some lab-grade soy oil, lab-grade yolk, and perhaps some lab-grade starch as thickener. Yipeee, we have a tolerance of 0.1% in calories. But how do you reach that level of accuracy with a roast chicken with no added stuff?
What has more calories: 1 lb of peanuts, OR 1 lb of peanuts ground into peanut butter?
I cant find the study, but the peanut butter has more calories since its pre-ground and more bioavailable. Peanuts get chomped up but larger pieces still remain and are not captured by the body.
Had the author written the article themselves rather than an LLM their motivation probably would have been clearer.
Yeah, for sure there are. And people will just ask ChatGPT as well.
The funny thing is that for people who are just trying to lose weight without managing any health issues precisely, this type of extreme variance doesn't really matter, because the mere act of consciously quantifying food consumption is, based on my experience counting calories, the single biggest factor in success with weight loss.
Once or twice a year I spend a few weeks meticulously measuring ingredients/cooked foods and recording calories and on complex recipes apps are next to useless at getting accurate data. You're trying to input five or ten relevant ingredients, and then weighing your cooked outcome to try and divide the ingredients by proportion. Frankly it's a mess and most people aren't doing it for home cooked meals, and are getting very lossy outcomes (weighing cooked chicken and marking it as raw chicken, etc)
With reasoning and tool calling (combined with me meticulously weighing before and after), it's producing fine data for my purposes.
> The prompt was adapted from the one used in the iAPS open-source automated insulin delivery system — it’s a real production prompt, not a toy example.
https://github.com/Artificial-Pancreas/iAPS
I think these are the prompts in the app: https://github.com/Artificial-Pancreas/iAPS/tree/5eabe22e7e2...
> The DTN-UK stated earlier this year that generic LLMs must never be used as autonomous advisory calculators for insulin delivery. This data is the quantitative evidence base for that statement.
This study is to prove that you should not rely on LLMs
The part where they talk about sampling multiple runs is interesting - it suggests to me that in the next few years as the reasoning process is improved the models may be able to do that autonomously.
My mind really is going to using a dedicated object detection models fine-tuned with nutrition information, but I don't think there's a fundamental reason LLMs can't eventually manage this use case, except perhaps the size of the needed weights being prohibitively large.
That has nothing to do with the question being asked, can you rely on an LLM today to help you track carbs as a diabetic?
This is very explicitly what the article is all about. Potential future LLMs are entirely irrelevant.
You can out-exercise almost any diet, but it takes 3-4 hours a day of a hard workout.
If calories in, calories out was useful advice rather than a banal statement of physics, nobody would be fat.
This is about people suffering from diabetes tracking their insulin needs. You can outrun any diet, but not insulin shots.
They’re writing in a neutral way that reaches their audience without lecturing or being condescending. They lead the reader to the conclusion rather than shoving it at them. I think that’s why it’s triggering so many angry comments on HN, but it’s effective for the audience they’re writing for (non technical people who may need convincing but don’t like being preached at)
There’s a gap between what the tool will allow you to tell it to do, and what it’s good at. The feedback mechanism to tell the difference is deficient compared to a hammer.
If not, then perhaps there's a problem in your analogy.
The article explains this: There are apps targeting people with diabetes that claim to count your carbs with AI.
> If you’re using AI carb counting in a diabetes app
Before you dismiss a study, try to understand where it’s coming from.
The authors of the study weren’t stupid. They knew the LLMs would provide poor results. They ran the study to quantify it and create a resource to spread the information in response to the rise of AI carb counting apps.
Yeah. I think it is under-appreciated that much of science is intended for debugging purposes. Sure, you and I know that X is positive, but what's it actual value? Can we find the causes that make it that way? Et cetera.
If there are apps targeting people with diabetes that claims to count your carbs with AI, why haven't those been analysed? That would be a far more effective claim.
I based the study off of the clickbait article that they wrote about the study - i'll read through the study to see whether they analyse that, but it would be far more effective to see if the 'carb-counting' AI app is returning similiar results to the frontier model - that's an interesting result that actually can forward discussion.
Because the apps aren’t going to let you submit 29,000 automated requests for statistical analysis.
And if you did, the authors of those apps would just release an update saying they changed models and try to dismiss the study.
The vitriol against this article on HN is sad. Commenters who agree with the article and its conclusions are grasping for reasons to be angry about it anyway
The fact that you somehow perceived this as an attack on LLMs as a technology is a failure entirely on your part. There is nothing in the article that suggests that people shouldn't use LLMs for other purposes - just a statistical verification of the fact that they shouldn't be used for this one particular thing.
If there are commercial services where you take pictures of food and are promised a realistic (paid for) response, then yes. And there are.
Having counted calories for years, I don’t think I could reliably estimate the calories or carbs in the example picture of a cheese sandwich. I can make assumptions about the bread and the cheese, but I might easily be off by 2-3x. Calorie counting apps that use text descriptions also have huge variance for the same thing. The problem might be the belief that a picture or description is enough, regardless of who or what is guessing…?
Edit: Ah, I see from sibling thread you meant commercial services are LLMs, I thought you meant there were human-backed services to compare to. Anyway, I totally agree there’s a problem if people rely on AI for safety, but I’m not sure LLMs are the core issue here, it seems like using vague information and guessing is the core issue.
You seem to be missing the context that this isn't just about diet apps - this is about apps claiming to be able to track carbs sufficiently accurately to be used in a medical context to dose insulin (a substance which can be lethal if incorrectly dosed)
The opening to the actual paper is quite explicit that (i) other studies have already tested commercial apps with with unimpressive results and (ii) a popular open source app for carb counting directly relies on API calls from these frontier models, and this research batch tested the images used the exact same models and prompts as the popular open source app.
Funny thing is 4o did look up calories but I guess it was too good for this world
So we’re trying to define, through trial and error, what problems “AI” will actually solve, and this paper is one of the many cobblestones on that road.
"AI can solve this one problem, but it needs X, Y, Z, because it's not a omnipotent god entity"
"I tried it without any of those things and it didn't work - this is worthless tech!"
I don't know if more accurate calorie counting using AI exists - but it's like being upset that the screwdriver isn't gluing wood. AI is far more than frontier LLMs.
LLMs are certainly not worthless, that’s a strawman in the same way my statement “AI will solve all of our problems” is a strawman. The question of their worth is being explored.
“AI is a black box that can solve problems”. Which problems? How consistently? At what cost? How quickly?
0 advertisements from openai or anthropic say this. They all sell you an omnipotent god entity.
I would occasionally check the estimates, maybe once every few days for meals I wasn't already pretty sure of, and it was generally accurate. Where it was extremely inaccurate was on portions, and anyone who has dealt with computer vision could tell you, you can't get scale from a picture. So I'd have to weigh some meals or ingredients, which would generally make things more accurate again.
So, I think it's possible, but you need multimodal data and grounded with regular checks.
It'll defend both sides (mutually contradictory) to the death. NOTHING will budge it from its initial stance.
Some people do well on 6 small meals, others do well on no breakfast and two large ones. Studies can't tell you anything useful about that, you have to experiment and find out what works best for you
Reminds me of that one youtube video (I forget who it is so I have no idea how to pull it up) where he turns on the camera on his phone for ChatGPT and asks it what everything it sees weighs, then puts it on a scale, and ChatGPT was never right, ever, which makes sense, I couldnt tell you what most things weigh on sight alone either, but ChatGPT often got it dramatically off. I got the feeling he thought it was terrible AI for this, but I don't think a model looking at an image of something and trying to guess its weight / calories / etc... is a reason to call an AI model bad...
It also exemplifies how current AI offerings are still quite limited in their capabilities, because one would expect that they’d do the intelligent thing on their own that you had expected, instead of the user having to come up with a working methodology.
This is a problem with the companies selling the AI models, not the customers. It is their responsibility to inform consumers about the limits of their services, and to train the models to say "I don't know, there is not enough information".
There are dozens of ios/android apps with 100-300k+ ratings and god knows how many millions of installs which do exactly this
"Cal AI - Food Calorie Tracker: Just snap a photo and our smart AI calorie tracker analyzes your meal instantly."
308k ratings on ios, 264k ratings on androids easily 5-10m installs across both platforms.
You say this (and I agree), but I know of quite a few companies in this area, including a couple accelerated by YCombinator, and that's pretty much 100% of what they do in their backend.
If you are counting calories, you don't want the answer to "how many calories are in the average avocado?", you want to know how many calories are in this avocado. Remember that bodyweight is roughly linear with BMR, so a 10% error in calorie counting is an extra 10% of bodyweight.
There is a shocking amount of Computer Vision tasks where the scientists claim you can get X info from a picture of Y and it's like, even with ML/AI you can't extract data where there isn't any. The fact I can add an arbritrary amount of high-calorie fat to a meal without changing the appearance by defintion shows it's pointless. A 1000 calorie and 100 calorie milkshake can look identical, and you'd have no way of working that out via an image even if it was a super-intelligent system.
Similarly I see it in things like extracting material of an object from an image of it in serious research papers, which for the same reason cannot be done, since how an object looks has very little to do with what its made of, else painting and other art would clearly be impossible. The information is just not there within the data.
I laughed, but you nailed it. Sadly so many people lack even basic understanding of LLMs and the ViT tower that makes it vLLM, that I expect a whole industry, similar to fortune telling, to emerge out of it.
Oh! Do the vendors offer trainings to make sure the users understand how LLMs work? If not, surely, the LLM itself is trained to know its limitations and politely decline in situations like that?...
The #1 use case for this tech is "here's a problem I don't feel like solving, let's have a computer do magic". It's how it's advertised on TV, it's how it promoted in the software I already use. Food preparation? Travel planning? Shopping? Tutoring your children? You can do anything now!
I just talked to a realtor who will make a killing on a real estate transaction. Instead of offering human insights, they sent me "AI reviews" of several properties. The AI has never been to any of these properties and has no idea how they actually look like. But I guess it's how we operate now as a society.
If you go to eBay, every other listing description for used items is AI-generated. This is an official platform feature for sellers. The AI doesn't know the condition of the item or what's included or missing. Doesn't matter, it's magic. It's AGI, it will figure it out.
Most of the uses of AI I encounter as a consumer are like that, and the companies selling this tech are 100% complicit.
https://academy.openai.com/public/content
https://www.commonsense.org/education/articles/practical-tip...
Quote
>Getting the most out of generative AI depends on what you put in. To quote our Outreach team, "It's a tool, not magic!" As the technology evolves, more and more chatbots are designed for specific purposes.
https://www.anthropic.com/learn https://anthropic.skilljar.com/ai-fluency-framework-foundati...
https://grow.google/ai
All of the above are completely free, you could start 3 coursers today that specifically teach you how AI tools work in practice. yes, this is different from the marketing that some of these tools use, but these resources are there, free and available.
Maybe we need to create a form of driving license for responsible AI use, but saying the resources don't exist is not accurate
Outside our tech-enabled bubble, there are folks who have been sold the idea that ChatGPT et al is a miracle worker capable of replacing dieticians, gym coaches, psychologists, etc.
So it's VERY plausible to believe that there are folks out there snapping pics of their meals and asking GPT to spit out nutritional values.
I suppose I just expected this study to be a little less 'water is wet' which made me dissapointed, but that may be coming at it from a more technical perspective.
This is because the people who promote these technologies, and the companies that sell these technologies, engage in a massive amount of puffery (aka hyperbolizing aka just straight telling lies).
These technologies are painted as the magical solution to whatever problem you have (all it costs you is a few tens of thousands of tokens, aka your water supply). There is literally nothing they CAN'T do if you will just let us build these gigantic small town destroying, noise polluting, water and electricity hungry 'AI data-centers'. So that we can use those datacenters to sell you more tokens to put into their slot machines.
The aim of the study was to understand the variation in results returned by models and how that could cause risks for patients using those models. The main result was measuring within-model variation.
From the pre-print (https://www.diabettech.com/wp-content/uploads/2026/04/diabet...):
We aimed to characterise the within-image reproducibility of carbohydrate estimates from four large language model (LLM) vision APIs and to quantify the clinical risk for insulin dosing, stratifying accuracy by reference value quality.
Methods
Thirteen food photographs were each submitted 495–561 times to four LLM vision APIs (GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview) using an identical structured prompt adapted from the iAPS automated insulin delivery system (26,904 total queries, temperature 0.01). The primary outcome was within- image variation (coefficient of variation [CV], range, distributional normality). Secondary outcomes included accuracy against reference values for nine images, stratified by quality tier (packet label, weighed/measured, portioned, or visual estimate). Clinical risk was translated at an insulin-to-carbohydrate ratio of 1:10.
>> I'd like to see this study done using any kind of actual grounding knowledge, seeing what mistakes AI makes when attempting to query ground truth from picture analysis - there would at least be an interesting result methodology in that
The ground truth was established by the author. There's an appendix in the pre-print (Appendix I) that describes the methodology. Methods are described in page 4 of the pre-print:
Reference values for accuracy analysis
For nine of the thirteen images, the author estimated the carbohydrate content using methods described in Appendix 1. Reference quality was categorised into four tiers:
Tier 1 (packet label): Carbohydrate values derived from manufacturer nutrition labelling. Two images (cheese sandwich, soup with bread) used bread with labelled carbohydrate content of 20 g per slice.
Tier 2 (weighed/measured): Portions directly weighed and cross-referenced with established composition data. Three images (Bakewell tart, bakery cookie, breakfast burrito).
Tier 3 (portioned): Portions estimated by the author (not weighed) and combined with USDA composition data. Three images (roast dinner, chilli con carne with rice, stuffed pork loin).
Tier 4 (visual estimate): Portions and composition estimated from visual inspection. One image (churros).
For the four restaurant dishes (pizza capricciosa, eggs benedict, crema catalana, paella), no reference value was established. These images were used for the primary reproducibility analysis only.
Carbohydrate values follow the EU convention with dietary fibre excluded.
The public's education comes from the incessant marketing from AI companies that their models are the panacea for everything.
If someone sent me a picture of a meal and asked me what the macros were or how many carbs this is, I would say "I can't tell from a photo. Nobody can". The problem is that current LLM chatbots don't seem to have a concept of telling you "I don't know", "you can't do that" or even "you're wrong".
You can say that somebody shouldn't trust an LLM for this but it's going to be a problem that LLMs give nonsencial answers. What I find particularly amusing is that there are still technical people (generally, not anyone specifically) who seem unable to acknowledge that LLMs hallucinate and lie.
There was a post on here recently that I couldn't find with some quick searching but the premise basically was that chatbots were trained like neurotypical people: A lot of affirmation and basically lying. Separately someone else characterized this NT style of communication as "tone poems" [1]. I keep thinking about that because to me that's so accurate.
Dunning-Kruger is a common refrain on HN, for good reason. Another way to put this is how often people are confidently wrong. I really wonder if this is an inevitable consequence of NT communication because most neurodivergent ("ND") people I know are incredibly intentional in what they say and mean.
[1]: https://news.ycombinator.com/item?id=47832952
If I ask the AI about some health issue, it says something along the lines of warning I'm not a doctor etc. So if I show it a picture and ask it to tell me the carbs, how about a warning telling me it can try, but that it probably wont be very accurate.
We've been seeing examples of this constantly since 2022. How many more do we need?
No jobs, ai Jesus is coming, and if you use ai it will use all of the worlds compute power to try to convince you it's correct even when it's not.
https://www.youtube.com/shorts/B7c9qJcRnVk
<https://nces.ed.gov/pubs2019/2019179/index.asp>.
There's a related study of adult technical literacy conducted in 33 OECD nations:
<https://www.oecd-ilibrary.org/education/skills-matter_978926...>.
Both show that only a small fraction (5--10%) of adults operate at high levels of literacy (whether of text, numeracy, or technology), and that a large fraction (roughly 50%) operate at a minimal or below-minimal rate.
Even if you _know_ the debit card transaction is safe, there’s no reason to risk it when a weirdo is filming you with some wild contraption.
AI is a very complicated calculator - you give it an input, magic happens, it gives you an output. Really no different, to a layman.
People who are unskilled at a task are unaware of what that task performed correctly is. So, somebody who can't count calories is unable to tell that the AI can't perform the task correctly either.
[1]: https://pubmed.ncbi.nlm.nih.gov/10626367/
Which is a good thing because it means we can talk like normal humans ("people don't know that it's unreliable") instead of acting like we're making such a profound claim that it needs a citation and psychological dissection.
They absolutely won't be 100% correct (bread sizes e.g. are going to be an estimate), but unless it's a trick sandwich drenched in olive oil or with hollow cheese, they're probably going to be in the right ballpark.
I don't think it's outside the realm of possibility for an LLM to be in the right ballpark as well, but that doesn't seem to be where we're at now.
That said, it's notable that diabetes education materials often suggest estimating glycemic loads by rough portion size / plate ratios. Which is to say that absent accurate weight measurements (themselves subject to variations in ingredients, moisture levels, etc.) current clinical recommendations are themselves pretty rough.
Photons don't carry that information? Sure. But you don't just have photons to go by. You can rely on a large database of prior knowledge about how food is usually made and with what ingredients.
Other people who have to rely on their imperfect human senses to decide what they can and can't eat: people with allergies, people with heart problems, hypertensives, kidney patients, etc. etc.
It could be much different -- it could one of those breads with weird macros, or fake cheese, or it could be hollowed out and packed full of hidden vegetables. But a human is going to give you the answer for two slices of plain white bread.
I am pretty good at this and the cheese sandwich example threw me, I would have estimated around 10-15g of carb for each slice. So the 28g is fairly consistent with that, not 40g. The only real way would be to weigh it and use the labeling. Another thing that often gets people is the labeling often has a serving size of say 2 slices and a weight that does not reflect the actual weight of 2 slices.
Luckily with good tools the significance is reduced, people using closed loop insulin pumps will automatically correct for that. Lots more room to wiggle.
> There is not enough information to make an accurate estimate, but if you'd like, I can take a stab at it. If so, how much effort to put into it?
> Yes, go ahead and spend up to 5mins and $1 to analyze it.
> Done, I've had 100 subagents analyze the image and have arrived at a 95% confidence interval of the portion containing ...
Many of the comments here assume the authors are stupid and were surprised by the result, but the point of the article is to inform readers that AI carb counting apps don’t work. That’s why they did the study.
Your cheese sandwich may contain a lot more or a lot less calories, even if you take the numbers from the packaging and calculate the correct ratios by weight. The calories on the label are based on an average and individual packages may contain more or less of any listed nutrient to some margin. Of course, counting calories is meaningless if not done on a long-term scale anyway, but on a long-term scale the LLM doesn't need to guess the correct amount either.
Not “Here’s a random guess that I just pulled out of my ass.”
LLMs have picked up the bad habit of trying to give an answer when no answer can be given from scientists, who overall don’t say “I don’t know” nearly as often as they should.
You need to write a specific prompt to avoid any warnings.
Of course a lot of people don't know what limitations LLMs have, so there's some value to a blog post about it, but it's not as black-and-white as the article might suggest with its graphs.
The prompt (documented here: https://www.diabettech.com/wp-content/uploads/2026/04/Supple...) lists specific instructions and a specific output format that doesn't allow the LLM any room for explanation or warning in processable data (only in notes fields). In fact, the prompt explicitly tells the LLM to ignore visual inferencing for some statistics and to rely on a nutrition authority instead.
Even in that intentionally restricted format, the English language output uses words like "roughly" and "estimated" in the LLMs I've tested.
Sure, if you take the numeric values and plot them in graphs, you get wildly inconsistent results, but that research method intentionally restricts the usefulness and reliability of the LLMs being researched.
What's much more troubling is this line from the preprint:
> The open-source iAPS automated insulin delivery (AID) system now offers food analysis through APIs from OpenAI, Anthropic and Google [8]
The linked app does seem to have a disclaimer, though:
> "AI nutritional estimates are approximations only. Always consult with your healthcare provider for medical decisions. Verify nutritional information whenever possible. Use at your own risk."
From the paper they're using structured JSON schema mode opposed to freeform answers, so it can't. Models do typically caveat their answer for questions like this, in my experience.
They’re algorithms and they were designed this way.
This is targeted at people with diabetes because there are AI carb counting apps appearing in app stores
> If you’re using AI carb counting in a diabetes app
These apps are probably not even using the mainstream models used in the study because they would be too expensive for cheap or free apps, and they’re probably forcing structured output to get a response without any of the warnings that an LLM might include if you ask it directly.
Like, are people actually using LLMs for this? Please do not, it won't work.
Does the model say it can't do that when asked? No, it answers confidentely.
Also it's easy to trust it if you don't know how it works
I came across a LinkedIn post a couple days ago where someone had asked ChatGPT, "What are the top things you get asked about $NICHE_INDUSTRY_THING_I_AM_SELLING?"
As if there is introspection like that at the meta level, where ChatGPT could actually provide hard numbers around its own usage and request patterns.
The fact that these products work with natural language beguiles people into thinking they are, indeed, magic oracles.
Anthropic's trillion dollar valuation hinges on the idea that it is just that, a magic oracle that can replace any worker for any type of task. Any programmer, any author, any musician, any kind of clerical work. All we've asked here is "sudo evaluate me a sandwich", the sort of estimation task that humans with internet resources might reasonably be expected to do, and it's given up?
(It would be fun to compare this to sending the picture out on Mechanical Turk and asking humans to eyeball the calorie count of said sandwich...)
Cal AI, which claims to generate a nutritional breakdown based off a photo, has $30 million in annual recurring revenue.
https://techcrunch.com/2025/03/16/photo-calorie-app-cal-ai-d...
As far as consumers know, LLMs can identify the towns pictures were taken (without metadata), can summarize entire movies, generate clips of your kid flying a rocket to the moon, can translate images from any language imaginable, but somehow they cannot estimate the calories in a cheese sandwich.
The supposed professional posting about an LLM deleting their prod database for their non-existent company asked the AI to explain itself. That's the level of LLM knowledge you should expect from most people that actually work with these tools.
Truth is the LLM is good at making intelligent decisions. But in order to make intelligent decision, you need context.
If you give proper context -> ask the LLM -> get almost perfect result every time.
Anything else is rolling dice, a very special type of dice, but dice anyhow. Not magic.
And a person with sufficient knowledge could easily give a rough estimate of the calories. A slice of store bought sandwich bread of a given thickness generally has calories within a certain range. So do cheese slices. It's elementary school health class material. We all learn how to calculate calories in a meal. Packaging on food also always has calories, so clearly people know how to estimate it fairly accurately.
If a fifth grader can calculate it but an AI can't, that says a lot about how bad these AIs are. We'll get another series of paid and bought articles saying "AI analyzed IMPOSSIBLE math problem beyond human comprehension and solved it with FACTS and LOGIC", while at the same time being told "bro no you can't expect an ai to calculate calories in a sandwich bro that's impossible bro if you even try that then you're insane for even thinking ai should be used that way bro". These companies need to decide: is AI smart enough to solve hard questions, or is it too useless to calculate something any kid could do by googling calories in a slice of bread and doing some basic arithmetic?
That's not done by looking at it and guessing (or at least it _shouldn't_ be; manufacturers have been known to do this but it's bad practice and may cause them regulatory problems). In an ideal world it's done with one of these: https://en.wikipedia.org/wiki/Calorimeter ; less ideally it can be estimated based on the ingredients.
But does training llms to be better at this, improves their world model or does it only make changes at the surface?
The problem itself is unsolvable given the data provided.
You could conceivable make it better at making guesses, but they will inherently always be guesses that will sometimes be wildly off.
https://www-users.york.ac.uk/~ss44/joke/3.htm "There is at least one field, containing at least one sheep, of which at least one side is black."
Extreme example perhaps, but no, you can't just turn pixels into calories. Right now I'd be impressed if we could reliably estimate volume to within 30% from a photo, but even with that correct the contents of the food can easily be way off without visible sign.
I'm sure one could produce a CV model that was a lot better at guessing here than these LLMs are, but fundamentally it is still guessing.
They are surprised and upset when the Oracle is not perfect
Go ahead and search around on hacker news you’ll see precisely the same pattern with people who are ostensibly engineers and hackers
It’s actually pretty mind boggling but then again humans never fail to surprise and disappoint
Some people have a very poor understanding of what LLMs are good for. Some people do see them as magic oracles.
Well firstly the average IQ is 100. And also because people market products to consumers that claim to be able to count carbs from images. If you don't know the limitations of LLMs then there would be little reason to doubt it for an uniformed or below average intelligence person, of which there are hundreds of millions.
----
Wikipedia for Crema catalana:
Crema catalana (Catalan for 'Catalan cream'), or crema cremada ('burnt cream'), is a Catalan dessert consisting of a custard topped with a layer of caramelized sugar.[1] It is "virtually identical"[2] to the French crème brûlée. It is made from milk, egg yolks, and sugar. Crema catalana and crème brûlée are made in the same way.
---
Oh no, my AI can't detect that an obscure clone of a famous dish is indeed the obscure clone, and not the commonly know version.
They are both covered by burned sugar and therefore indistinguishable(!) visually.
Yes, people are using LLMs for this kid of thing. Lots of people. All the time. I've met plenty of them and there loads of apps that offer this kind of "service". The authors are well aware that people are doing this and probably anticipated the result.
Why do the study at all? Because it's important to demonstrate and measure things, even obvious ones. Because it's not obvious to everyone, like the people who are already consulting LLMs for dietary information to manage their health. Because it's easier to enact official policies when there's hard evidence.
It'd be really interesting if it evaluated humans on the exact same image sets. The correct answer is just to feed in more data, such as the exact food itself, but the post makes it sound like it's using a model that is the only risk in this approach to counting carbs.
Any nutrition facts these models might use are vectoring either to data from this database or FatSecret. Anything custom, like estimating meals at most restaurants, is going to involve adding and multiplying stuff, and we know how great LLMs are at that.
Was it always correct? Certainly not. But it helped me lose 30kg of weight since keeping even some track of calories was so much easier with LLM than any app I had used before.
Also of course it didn’t matter if I was exactly on point since it wasn’t about any kind of medicine
Seems that in this case a traditional approach would be more precise and more environmentally efficient to get to the same results.
Much easier for me to take pictures of the packets while making the food, the weight the final bulk product and then when I eat just weight the plate and say “500g of casserole” and the LLM spits out the calories and keeps track of the daily consumption
Curious, what model are you using? I have found Qwen Flash to be really great for this - tool calling works well, it's smart enough, and very cheap.
> 42.9 units of insulin from a single photo. That’s not a rounding error. That’s a potential fatality.
“What is the armour value for the Leather Shirt” in the game Stravaeger?”
It confidently got it wrong.
“You can find the game at https://stravaeger.com”
Different confident answers, also wrong.
“You’ll find it in a table on this page: https://stravaeger.com/docs.html?inventory_item=LEATHER_SHIR...“
Oh, sorry. I was inferring from other similar games. Here is a different confidently wrong number.
“It’s also in the .json file linked on that page”
And another wrong value. Random numbers should have got it right by now, but no. And the confident, authoritative tone never changed. Every model I tried was the same story.
> I Asked AI to Count My Carbs 27,000 Times. It Couldn’t Give Me the Same Answer Twice.
If you look at the image https://www.diabettech.com/i-asked-ai-to-count-my-carbs-2700... it clearly shows some repeated values. I guess AI like multiples of 5 or 10 or something. It would be nice to look at the raw tables.
> A cheese sandwich on a plate. Here’s one that should be easy. Two slices of thick white bread (carbs on the packet: 20g per slice) plus cheddar cheese (negligible carbs). Reference value: 40g. Simple, unambiguous, packet-label accuracy.
Real cheese of fake cheese that is actually flour paste with gum and colorant? Does it have mayo? I like mayo! Real mayo or fake mayo that is actually flour paste with less gum and another colorant? Does it has a slice of jam that is totally covered by the bread? Real jam or ilegal fake jam that is actually some grounded pork with flour paste with more gum and yet another colorant.
> The models don’t always know what they’re looking at. [...] Crema catalana: Three of four models called it “creme brulee” 100% of the time. Only Gemini 3.1 Pro got “crema catalana” — in 3.4% of queries.
Can someone from Europe tell me the difference? I like it (at least one of them), and I eat it from time to time (like once a year, in a restaurant), but looking at the Wikipedia page of both I can't tell the difference.
It’s tractable I think, but not from a pic alone.
There is already a solution to this that would be very hard to beat (and one can choose to use or not use an LLM to assist): prepare food yourself and use the information provided by the manufacturer.
However for diabetes accuracy is likely preferred and I’m not sure any computer vision would be palatable.
No you wouldn't, not if you have a basic understanding of how LLMs work and what "temperature" is. They are stochastic algorithms picking the next token based on a highly structured (and often very useful) coin flip.
Besides AI grossly over/under estimating values even when you give it a photo of the packaging with nutritional table and tell it weight you used.
The other thing that surprised me, at least until I read up on how LLMs are actually working. Was how it would confidently BS you for your daily total.
Even when the chat/messages are just "Ate ABC with XYZ values, what's my daily total?"
While I guess new chat for each day, or some MCP for storing and retrieval of record/meals would've helped with those daily totals.
The total would still be wrong - unless you explicitly specified each of the values you need to track (e.g. carbs, fat, protein, kcal) to be put into records.
At which point of course - you're not really using AI/LLM but basically an CRUD application.
That is why I believe this piece from Tim is remarkable: it shows the limitations in a language the diabetes community can understand, and this is why I posted it.
Already the first paragraph highlights the issue; unless you set temperature=0.0 and the model can actually do reproducible inference, none of the "answers" you get are deterministic!
But it's a very common misconception that "same question gets same answer" would be true, when it's almost by accident you get the same answer for the same question. The part that people expect this, is the problem, as most platforms are not built to provide that experience. Of course you'd get different responses, it's on purpose!
i'd be ok with it if i was generating a picture of X, or some word salad about Y, but not for code. Never for code.
But, if what you're doing right now works for you, do continue as-is if you so wish, I have no stake in if people use LLMs or not, just hope people make choices based on good information :)
I use AI to estimate calories / macros multiple times per week. I always ask both ChatGPT and Gemini, and then I use my brain to decide what I actually want to log in my calorie tracking app.
About 80% of the time, ChatGPT and Gemini give estimates that are very close to one another.
And this... really was and (will be) the only way for this to ever work.
In real world the acceptable failure rates in many cases are lot lower than we now accept. One in thousand could be too high if you process say thousand times. So in reality good enough error rate should be in one in million or lot rarer...
https://kg.enzom.dev/
You specify your foods in grams with plaintext (no pictures).
I never liked the "take a picture to measure calories" approach, as you could have 10 table spoons of olive oil which would drastically change the calories but would not show in a picture.
The reported variance in Sonnet 4.6's estimates here are actually quite low, and in general terms, not so bad across models. Damn paella.
This does seem like a task well suited to a for-purpose training run against a bunch of labelled data. Is there any reason they wouldn't improve at it?
Maybe they should ask: what are the worst case and best case numbers for this lunch?
This idea is seriously being implemented in a production app? And people are using that app to make health choices? Oh god...
We should not allow companies to lie blatantly to the customers.
Edit: r/blame/lie/
No shit sherlock, but the AI gurus are just telling people that this fucking parrot CAN DO EVERY FUCKING THING.
Why wouldn't an ordinary guy just ask these question to an AI when everybody is telling him that AI is intelligent enough to answer accurately?
There general interest across a variety of disciplines to kick the tires of LLMs with respect to their competence in DOMAIN_X. This is good in general terms, but, especially with larger studies, they tend to be out-of-date by the time of publication, and super out-of-date by the time they hit the media circuit. Out-of-date here in terms of testing against models 1 or 2 or more generations back from SOTA.
The DOMAIN_X experts do have a lot to offer in terms of defining success criteria across domain tasks, but the studies (snapshots in time) could be much more impactful if they were instead packaged as benchmarks (that could track model progress over time, and even steer it).
AI community / industry could probably do some outreach work to streamline or standardize methods for general researchers to produce reusable benchmarks.
Shit like this is why you shouldn't involve AI output in your writing process. It's especially ironic in an article about LLMs being unreliable... but it's pointless when the pre-print seems just fine at least to my eyes.
I mean these models are inherently probabilistic.
If you run enough samples you'll get results matching the learned probability distribution, the more you sample the higher the chances that you'll land on an unlikely response.
1. If I feed the exact same image in, it does not deterministically give me the exact same result every time.
2. Or is this about calories, because even if a package label says "200 Calories", if you were to measure every package, each one would all be different. 198,199,200,201,202. Plus/Minus a pretty big range.
>>> answered own question. " It’s the same photo, the same model, the same question. But you won’t get the same answer"
Also, if LLMs worked as they are often advertised, they should have easily been able to answer "there isn't enough information in this picture to give you an accurate estimate. Try taking a picture of the label, or at least of the inside of the sandwich, or list the ingredients used".
The GP, and much of the thread, is basically acting as if it should be obvious to anyone who is not an idiot that this is not possible to do precisely; and that the researcher just made up some use case that LLMs can't do and wrote a paper about it to disparage AI.
Why assume trick ingredients?