Most active commenters
  • joshdickson(5)
  • Cheer2171(3)

←back to thread

311 points joshdickson | 24 comments | | HN request time: 0.523s | source | bottom

Hi HN!

Today I’m excited to launch OpenNutrition: a free, ODbL-licenced nutrition database of everyday generic, branded, and restaurant foods, a search engine that can browse the web to import new foods, and a companion app that bundles the database and search as a free macro tracking app.

Consistently logging the foods you eat has been shown to support long-term health outcomes (1)(2), but doing so easily depends on having a large, accurate, and up-to-date nutrition database. Free, public databases are often out-of-date, hard to navigate, and missing critical coverage (like branded restaurant foods). User-generated databases can be unreliable or closed-source. Commercial databases come with ongoing, often per-seat licensing costs, and usage restrictions that limit innovation.

As an amateur powerlifter and long-term weight loss maintainer, helping others pursue their health goals is something I care about deeply. After exiting my previous startup last year, I wanted to investigate the possibility of using LLMs to create the database and infrastructure required to make a great food logging app that was cost engineered for free and accessible distribution, as I believe that the availability of these tools is a public good. That led to creating the dataset I’m releasing today; nutritional data is public record, and its organization and dissemination should be, too.

What’s in the database?

- 5,287 common everyday foods, 3,836 prepared and generic restaurant foods, and 4,182 distinct menu items from ~50 popular US restaurant chains; foods have standardized naming, consistent numeric serving sizes, estimated micronutrient profiles, descriptions, and citations/groundings to USDA, AUSNUT, FRIDA, CNF, etc, when possible.

- 313,442 of the most popular US branded grocery products with standardized naming, parsed serving sizes, and additive/allergen data, grounded in branded USDA data; the most popular 1% have estimated micronutrient data, with the goal of full coverage.

Even the largest commercial databases can be frustrating to work with when searching for foods or customizations without existing coverage. To solve this, I created a real-time version of the same approach used to build the core database that can browse the web to learn about new foods or food customizations if needed (e.g., a highly customized Starbucks order). There is a limited demo on the web, and in-app you can log foods with text search, via barcode scan, or by image, all of which can search the web to import foods for you if needed. Foods discovered via these searches are fed back into the database, and I plan to publish updated versions as coverage expands.

- Search & Explore: https://www.opennutrition.app/search

- Methodology/About: https://www.opennutrition.app/about

- Get the iOS App: https://apps.apple.com/us/app/opennutrition-macro-tracker/id...

- Download the dataset: https://www.opennutrition.app/download

OpenNutrition’s iOS app offers free essential logging and a limited number of agentic searches, plus expenditure tracking and ongoing diet recommendations like best-in-class paid apps. A paid tier ($49/year) unlocks additional searches and features (data backup, prioritized micronutrient coverage for logged foods), and helps fund further development and broader library coverage.

I’d love to hear your feedback, questions, and suggestions—whether it’s about the database itself, a really great/bad search result, or the app.

1. Burke et al., 2011, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3268700/

2. Patel et al., 2019, https://mhealth.jmir.org/2019/2/e12209/

1. Cheer2171 ◴[] No.43571183[source]
> Final nutritional data is generated by providing a reasoning model with a large corpus of grounding data. The LLM is tasked with creating complete nutritional values, explicitly explaining the rationale behind each value it generates. Outputs undergo rigorous validation steps, including cross-checking with advanced auditing models such as OpenAI’s o1-pro, which has proven especially proficient at performing high-quality random audits. In practice, o1-pro frequently provided clearer and more substantive insights than manual audits alone.

This is not a dataset. This is an insult to the very idea of data. This is the most anti-scientific post I have ever seen voted to the top of HN. Truth about the world is not derived from three LLMs stacked on top of each other in a trenchcoat.

replies(5): >>43571372 #>>43571620 #>>43572927 #>>43573851 #>>43574297 #
2. rmah ◴[] No.43571452[source]
It doesn't matter how accurate the models are, it's not a "data set" (in the scientific sense), it's more of a conclusion set. Maybe the conclusions are spot on. Maybe not. I have no idea.
replies(3): >>43571574 #>>43571860 #>>43583026 #
3. joshdickson ◴[] No.43571574{3}[source]
I envisioned many lines of inquiry from HN but the idea that a compressed TSV of nutritional data is not a "dataset" (definition: a collection of related sets of information that is composed of separate elements but can be manipulated as a unit by a computer) was unexpected.
replies(6): >>43571753 #>>43571858 #>>43572048 #>>43572719 #>>43573108 #>>43582379 #
4. tmpz22 ◴[] No.43571620[source]
Imagine how much more efficient government would be if we just generate all the data with LLMs.
replies(1): >>43572039 #
5. Cheer2171 ◴[] No.43571753{4}[source]
Your response is such a perfect example of why the "data science" movement is a cancer on actual science. So many graduate from programs and boot camps (or just read blog posts) that teach them all the technical mechanics of working with data, but nothing about actual science.
replies(1): >>43572115 #
6. ratmice ◴[] No.43571858{4}[source]
FWIW, I like that you include water content, libraries like google's health connect seem to have completely separate data structures for nutrition and hydration.
replies(1): >>43571930 #
7. Cheer2171 ◴[] No.43571860{3}[source]
Right. At my most generous, this is a dataset about LLM behavior when asked to infer nutritional value. It is in no way a nutrition dataset. It is perhaps useful as half of a benchmark for accuracy, compared to actual ground truth. Unlike a scientist, you're not motivated or resourced enough to create the ground truth dataset. So you took a shortcut and hid it from the landing page.

This workflow, this motivation, this business model, this marketing is an affront to truth itself.

8. joshdickson ◴[] No.43571930{5}[source]
Thank you :)
9. NewJazz ◴[] No.43572039[source]
Stop. Giving. Them. Ideas.

https://www.reddit.com/r/ABoringDystopia/comments/1jq8kzl/th...

10. TechDebtDevin ◴[] No.43572048{4}[source]
Ignore them. Congratz on finishing your project!
11. thi2 ◴[] No.43572056[source]
Tried it with unsweetened oat milk and the info was off in nearly every col.

Not representable because I dont have US food but since its AI enhanced I cant compare my stuff with the stuff in the "dataset" and be sure thats an Us vs germany thing..

replies(1): >>43572246 #
12. TechDebtDevin ◴[] No.43572115{5}[source]
You sound like you're having a bad day. Go take a walk, its just someones side project on HN. They arent trying to destroy science for you, they were simply sharing something they enjoyed building. You dont have to use it or like it, but it has nothing to do with "science". Its not that deep bro.
13. joshdickson ◴[] No.43572246{3}[source]
Would you mind posting/messaging me in some way (links in bio) what you expected it to show?

It looks like for unsweetened oat milk:

https://www.opennutrition.app/search/unsweetened-oat-milk-mt...

...it is leaning into a citation from the Australian Nutrient Database (e.g. Oat beverage, fluid, unfortified. Australian Nutrient Database. Public Food Key F006132. ), which is what I instructed it to do if it thought there was an exact match from a governmental database.

It's possible this is a poor general source for oat milk or that's not the beverage intended for the entry to stand for. I'll check it out, thank you for the report.

replies(1): >>43572337 #
14. thi2 ◴[] No.43572337{4}[source]
I'll check it later to give more constructive feedback, also it seems like you are hammering a backend request with each keystroke (?), cant verify it on mobile but you might consider debouncing the user input a bit to ease off the load
15. csdvrx ◴[] No.43572719{4}[source]
There are many HN users who are opposed to LLM.

Some of them are fundamentalists, and no amount of reason will reach them (read the comments on the Ghibli-style images to get a sample), others are opposed for very self-interested reasons: "It is difficult to get a man to understand something when his income depends on his not understanding it"

Yesterday, I vibe coded a DNS server in python from scratch in half a day (!) and it works extremely well after spending a few minutes on manually improving a specific edge case for reverse DNS using AAAA records: dig -x requests use the exploded form in the ip6.arpa, while I think it's better for the AAAA entries to keep using the compressed form, and I wanted to generate the reverse algorithmically from AAAA and A records.

Just ignore them, as your approach is sound: I have experience creating, curating and improving datasets with LLMs.

Like vibe coding, it works very well if you know what you are doing: here, you just have to use statistics to leverage the non deterministic aspects of AI in your favor.

Good luck with your app!

replies(1): >>43577531 #
16. creativeCak3 ◴[] No.43572927[source]
I agree so much with you. This is not a dataset. This is the vomit of an LLM making stuff up. Like...why couldn't you just collect the data that already exist?? Why do you need an LLM?

Adding an LLM to this just adds a unnecessary layer of complexity, for what benefit? For street cred?

replies(1): >>43573242 #
17. Mordisquitos ◴[] No.43573108{4}[source]
> a compressed TSV of nutritional data

What is the source of that nutritional data?

18. joshdickson ◴[] No.43573242[source]
There's an in-depth review of the reasoning for undertaking this project in general and this approach in particular in the Methodology/About section below, see "Current State of Nutritional Data".

Millions of people use food logging apps to drive behavioral change and help adhere to healthy lifestyles. I believe there is immense societal good in continuing to offer improved tools to accomplish this, especially for free, and that's why I created the project and chose to open source the data.

https://www.opennutrition.app/about#current-state-of-nutriti...

19. ZunarJ5 ◴[] No.43573851[source]
As soon as I saw "AI enhanced for Accuracy" I laughed and wondered if this was a belated April Fools joke.
20. justsid ◴[] No.43574297[source]
I find this actually very upsetting. My wife does calorie counting and all of the apps for it are horrible, especially the market leaders. But those have one thing going for them: Databases of nutritional information, which can be used for easy meal calorie counting. Just enter the ingredients (usually you can scan a barcode) and how much you ate of the total and it tells you where you are standing on caloric and nutritional intake. But even those datasets aren’t always bang on, especially here in Canada where some products share bar codes with US products but they have different nutritional values. Reading the title, I was very excited about the ability to make my wife a better app to support her needs. Unfortunately this is not at all usable for this use case or really any? What’s the point of having data that you just can’t trust at all?
replies(1): >>43574649 #
21. blooalien ◴[] No.43577531{5}[source]
> Like vibe coding, it works very well if you know what you are doing (emphasis mine)

This is true of so very many things involving computers (and tools in general, really) and LLMs are no exception. Just like any tool, "knowing what you are doing" is the really important part, but so many folks are convinced that these "AI" things can do the thinking part for them, and it's just not the case (yet). You gotta know what you're doing and how to properly use the tool to avoid a lotta the "foot-guns" and get the most benefit outta these things.

22. rendaw ◴[] No.43578117{3}[source]
By 100+ comment discussion I assume you mean this HN post in its whole? People here aren't checking the facts, so the fact that only one person found an issue doesn't mean much.
23. jcgl ◴[] No.43582379{4}[source]
The problem is that it’s _not_ simply data. Definition: is information collected from the world.

This is data from the world that has altered and augmented with stuff from a model. The informational content has been altered by stuff not from the world. Therefore it’s no longer data, according to the above definition.

That isn’t to say that it can’t be useful, or anything like that. But it’s _not_ information collected from the world. And that’s why people who care about science and a strict definition of data would be offended by calling this a dataset.

24. pmichaud ◴[] No.43583026{3}[source]
I think there is a real conversation to be had about “data” in a post LMM world, but I actually don’t care about debating definitions here, I care about whether the product works within a reasonable margin of error.