{"id":17000030,"url":"https://github.com/chambliss/foodbert","last_synced_at":"2026-03-07T09:33:14.642Z","repository":{"id":99078650,"uuid":"303014132","full_name":"chambliss/foodbert","owner":"chambliss","description":"FoodBERT: Food Extraction with DistilBERT","archived":false,"fork":false,"pushed_at":"2020-10-31T20:58:38.000Z","size":150,"stargazers_count":53,"open_issues_count":1,"forks_count":8,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-04T08:22:10.984Z","etag":null,"topics":["bert","bert-model","distilbert","food","food-extraction","information-extraction","natural-language-processing","nlp","nlp-machine-learning","python","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chambliss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-11T00:35:19.000Z","updated_at":"2025-03-26T13:18:56.000Z","dependencies_parsed_at":null,"dependency_job_id":"f893fc6b-211e-49cc-bc32-97d1d0179d02","html_url":"https://github.com/chambliss/foodbert","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/chambliss/foodbert","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chambliss%2Ffoodbert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chambliss%2Ffoodbert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chambliss%2Ffoodbert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chambliss%2Ffoodbert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chambliss","download_url":"https://codeload.github.com/chambliss/foodbert/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chambliss%2Ffoodbert/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30210852,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-07T09:02:10.694Z","status":"ssl_error","status_checked_at":"2026-03-07T09:02:08.429Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","bert-model","distilbert","food","food-extraction","information-extraction","natural-language-processing","nlp","nlp-machine-learning","python","transformers"],"created_at":"2024-10-14T04:10:49.912Z","updated_at":"2026-03-07T09:33:14.618Z","avatar_url":"https://github.com/chambliss.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FoodBERT: Food Extraction with DistilBERT\n\n## The first-ever deep learning model for automatic food detection and extraction!*\n\n\\* (to my knowledge, as of Oct 2020)\n\n## Quickstart\n\n### Setup\n\n1. Clone the repo\n\n    ```bash\n    git clone git@github.com:chambliss/foodbert.git\n    ```\n\n2. Set up and activate the environment\n\n    ```bash\n    cd foodbert\n    conda env create -f environment.yml\n    conda activate hf-nlp\n    ```\n\n3. Pip install the modules\n   \n    ```bash\n    pip install -e .\n    ```\n\n### Load the trained model from the `transformers` model zoo\n\nLoading the trained model from HuggingFace can be done in a single line:\n```python\nfrom food_extractor.food_model import FoodModel\nmodel = FoodModel(\"chambliss/distilbert-for-food-extraction\")\n```\n\nThis downloads the model from HF's S3 bucket and means you will always be using the best-performing/most up-to-date version of the model.\n\nYou can also load a model from a local directory using the same syntax.\n\n\n\n### Extract foods from some text\n\nThe model is especially good at extracting ingredients from lists of recipe ingredients, since there are many training examples of this format:\n\n```python\n\u003e\u003e\u003e examples = \"\"\"3 tablespoons (21 grams) blanched almond flour\n... ¾ teaspoon pumpkin spice blend\n... ⅛ teaspoon baking soda\n... ⅛ teaspoon Diamond Crystal kosher salt\n... 1½ tablespoons maple syrup or 1 tablespoon honey\n... 1 tablespoon (15 grams) canned pumpkin puree\n... 1 teaspoon avocado oil or melted coconut oil\n... ⅛ teaspoon vanilla extract\n... 1 large egg\"\"\".split(\"\\n\")\n\n\u003e\u003e\u003e model.extract_foods(examples[0])\n[{'Product': [], 'Ingredient': [{'text': 'almond flour', 'span': [34, 46], 'conf': 0.9803279439608256}]}]\n\n\u003e\u003e\u003e model.extract_foods(examples)\n[{'Product': [], 'Ingredient': [{'text': 'almond flour', 'span': [34, 46], 'conf': 0.9803279439608256}]}, \n{'Product': [], 'Ingredient': [{'text': 'pumpkin spice blend', 'span': [11, 30], 'conf': 0.8877270460128784}]}, \n{'Product': [], 'Ingredient': [{'text': 'baking soda', 'span': [11, 22], 'conf': 0.89898481965065}]}, \n{'Product': [{'text': 'Diamond Crystal kosher salt', 'span': [11, 38], 'conf': 0.7700592577457428}], 'Ingredient': []}, \n... (further results omitted for brevity)\n]\n```\n\nIt also works well on standard prose:\n```python\n\u003e\u003e\u003e text = \"\"\"Swiss flavor company Firmenich used artificial intelligence (AI) in partnership with Microsoft to optimize flavor combinations and create a lightly grilled beef taste for plant-based meat alternatives, according to a release.\"\"\"\n\n\u003e\u003e\u003e model.extract_foods(text)\n[{'Product': [], \n'Ingredient': [{'text': 'beef', 'span': [156, 160], 'conf': 0.9615312218666077}, \n{'text': 'plant', 'span': [171, 176], 'conf': 0.8789700269699097}, \n{'text': 'meat', 'span': [183, 187], 'conf': 0.9639666080474854}]}]\n```\n\nTo get raw predictions, you can also use `model.predict` directly. But note that `extract_foods` has a couple of heuristics added to remove low-quality predictions, so `model.predict` is likely to give slightly worse performance.\n\nThat said, it is useful for examining the raw labels/probabilities/etc. from the forward pass.\n\n```python\n# Using the same text as the previous example\n\u003e\u003e\u003e predictions = model.predict(text)[0]\n\n# All data available from the example\n\u003e\u003e\u003e predictions.keys()\ndict_keys(['tokens', 'labels', 'offsets', 'probabilities', 'avg_probability', 'lowest_probability', 'entities'])\n\n\u003e\u003e\u003e for t, p in zip(predictions['tokens'], predictions['probabilities']):\n...     print(t, round(p, 3))\nSwiss 0.991\nflavor 0.944\ncompany 0.998\nFi 0.952\n...\n\n# Get the token the model was least confident in predicting\n\u003e\u003e\u003e least_confident = predictions['probabilities'].index(predictions['lowest_probability'])\n\u003e\u003e\u003e predictions[0]['tokens'][least_confident]\n'plant'\n\n# Get the dict of ingredients and products\n\u003e\u003e\u003e predictions['entities']\n{'Product': [],\n 'Ingredient': [{'text': 'beef',\n   'span': [156, 160],\n   'conf': 0.9615312218666077},\n   ...\n\n```\n\n\n### Larger-scale prediction\n\nTo predict on many examples, you can use `food_model.do_preds`. I usually use this for generating model predictions to correct in [LabelStudio](https://labelstud.io/), the tagging platform used for this project. Calling it looks like this:\n```python\nfrom food_extractor.food_model import do_preds\n\ndo_preds(\"chambliss/distilbert-for-food-extraction\", # model path\n        texts, # your examples - a list of strings\n        \"./whatever.json\", # output file\n        format=\"json\") # format - JSON, LabelStudio, and BIO are supported\n```\n\nNote that this will run each example through the model individually rather than batching. This results in better performance (the model is more confident in its predictions on shorter sequences, so by not having to pad the examples to be the same length as the longest example in the batch, the accuracy is increased slightly). \n\nSince we're using DistilBERT, prediction is still very fast (especially on GPU), but if you prefer to batch the examples, it should be relatively easy to amend the code in `do_preds` to do so.\n\nAlso note that these are **raw predictions** from the model, not the quality-filtered predictions you will get from `extract_foods`. \n\n### Model Stats\n\n\n**Model Type**: DistilBERT, base-cased\n\n**Model Size**: 260.8MB\n\n**Inference Time**: 0.03s for batch size 1 on CPU, 0.06s for batch size 9 on CPU\n\n**Performance**\n\nThe model performs best on the Ingredient tag, reaching over 90% relaxed precision and over 75% relaxed recall. Products were not common in the training data, and thus have significantly worse performance. \n\nIf you have a production use case in mind for this, the model should perform well enough (with some data cleaning) to systematically extract ingredients, but I would not recommend using the Product results for production use cases at the moment.\n\n|Label    |p_strict|p_loose|r_strict|r_loose|\n|----------|--------|-------|--------|-------|\n|Ingredient|0.787   |0.912  |0.681   |0.789  |\n|Product   |0.211   |0.649  |0.171   |0.529  |\n\nDescription of metrics:\n* p_strict: Strict, exact-match precision. \n* p_loose: Relaxed precision, where \"partial overlap\" errors are allowed. For this task, it is usually more useful to look at relaxed precision rather than strict.\n* r_strict: Strict, exact-match recall. For products in particular, it is probably most useful to look at strict recall.\n* r_loose: Relaxed recall, where if part of an ingredient/product was retrieved, it was counted.\n\nQuick example to clarify the difference between **strict** and **loose** precision and recall:\n\n* If the model predicted \"blanched almond flour,\" but the actual ingredient label was \"almond flour,\" this would count AGAINST strict precision, but it would be an allowable prediction for loose precision. \n* Similarly, if the actual product was \"Raspberry Red ice cream\" and the model only predicted \"Red ice cream,\" this would be allowable for measuring loose recall, but it would NOT count for strict recall. \n\n----\n## Training and Evaluation Data\n\n### Training \n\nThe model was trained on 715 examples, most of them on the shorter side (many were extracted ingredient entries from web-scraped recipes). The data is BIO-formatted (begin-inside-outside), and looks like this:\n```text\nG       B-Ingredient\n##ar    I-Ingredient\n##lic   I-Ingredient\nis      O\nextremely       O\nhealthy O\nand     O\ncan     O\nbe      O\nused    O\nin      O\na       O\nvariety O\nof      O\nrecipes O\n.       O\n```\n\nThe training data is small enough to be included in this repo, but of course you should not store any future training data with git. Ideally, use a data version control system such as DVC. \n\n### Evaluation \n\nThe evaluation data is provided in [LabelStudio](https://labelstud.io/) format, because that is what I used to label it. (I would highly recommend LS for solo labeling projects, by the way.) It has 138 examples and looks like this:\n\n```text\n[\n  {\n    \"completions\": [\n      {\n        \"created_at\": 1600712341,\n        \"honeypot\": true,\n        \"id\": 1209001,\n        \"lead_time\": 10.012,\n        \"result\": [\n          {\n            \"from_name\": \"label\",\n            \"id\": \"AnlIUSC81r\",\n            \"parent_id\": null,\n            \"source\": \"$text\",\n            \"to_name\": \"text\",\n            \"type\": \"labels\",\n            \"value\": {\n              \"end\": 12,\n              \"labels\": [\n                \"Ingredient\"\n              ],\n              \"start\": 9,\n              \"text\": \"oil\"\n            }\n          },\n          ...\n```\n\nIf you want to import it directly into your own [LabelStudio](https://labelstud.io/) project, this is the config I used in my project:\n```xml\n\u003cView\u003e\n  \u003cLabels name=\"label\" toName=\"text\"\u003e\n    \u003cLabel value=\"Ingredient\" background=\"#5CC8FF\"/\u003e\n    \u003cLabel value=\"Product\" background=\"#7D70BA\"/\u003e\n\u003c/Labels\u003e\n\n  \u003cText name=\"text\" value=\"$text\"/\u003e\n\u003c/View\u003e\n```\n\n### Labeling Rules\n\nLabeling for this task was surprisingly difficult, but there are a few rules that I tried to abide by.\n\n* Ingredients should be stripped down to their basic form. For example, prefer \"almond flour\" over \"blanched almond flour.\" \n* Avoid including modifiers unless it would result in information loss (prefer \"crimini mushrooms\" over just \"mushrooms,\" for example).\n* Labeled products should include both the brand name and the actual food, for example \"CLIF energy bars\" rather than just \"CLIF.\"\n* Labeling parts of words was allowed, for example \"[plant]-based\" or \"[meat]less.\"\n\n----\n\n## Going Further\n\n### Training your own model\n\nYou can easily train a new model or fine-tune this one using the [training script](https://github.com/chambliss/foodbert/blob/master/food_extractor/train.py#L8). You will need to label some data and convert it to BIO format. A utility function for converting LabelStudio data to BIO format is provided in the [data_utils](https://github.com/chambliss/foodbert/blob/master/food_extractor/data_utils.py#L112) module.\n\n### Evaluating a model\n\nI've created a set of evaluation utilities [eval_utils.py](https://github.com/chambliss/foodbert/blob/master/food_extractor/eval_utils.py#L216) that can do a comprehensive evaluation for you. From the `eval_utils.evaluate_model` definition:\n```python\ndef evaluate_model(\n    model_path: str, eval_file_path: str, no_product_labels: bool = False\n):\n\n    \"\"\"\n    Standalone function that takes a model path, eval data path, and save \n    directory, and fully evaluates the model's performance on the data.\n    The metrics are then saved in a directory under `data/performance/{model_path}`.\n    Note that an existing directory with the same name will be overwritten.\n\n    Outputs include: \n    - a CSV of the precision/recall/F1 on each label (\"eval_PRF1.csv\"),\n    - raw counts of which mistake types were made (\"eval_raw_stats.csv\"),\n    - the raw counts in percentage format (\"eval_percentages.csv\"),\n    - a log file enumerating each example and the model's mistakes on the example\n    (\"preds.log\")\n    \"\"\"\n```\n\nThe `preds.log` file looks like this, and makes it easier to qualitatively understand what kinds of mistakes your model is making.\n```text\nTEXT: 1.Cut leaf from stem of green bok choy. Cut leaves into length of about 5 cm. Cut stem lengthwise into six equal parts and immerse in water separately. If dirt is found in the root part of stem, scrape out with tip of bamboo skewer (PHOTO A). Remove core from garlic and slice thinly.\nPartial overlap: engulfs_true_label. Predicted [green bok choy], actual entity was [bok choy]\nNot a named entity: [bamboo] (label: Ingredient)\nMissed entity: bok choy\nMissed entity: water\n```\n\n### Developing on the code\n\nIf you want to make changes to the code and ensure that things still work, I've included some basic tests to run that will surface any major errors with training, evaluation, data preprocessing, or generating predictions from the model. \n\nThey are not meant to be comprehensive, so there may still be some \"silent failures\" you'll need to debug yourself, but they should be a good first line of defense against potentially breaking changes.\n\nTo run the tests, make sure `pytest` is installed, then run `pytest test` from the root directory.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchambliss%2Ffoodbert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchambliss%2Ffoodbert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchambliss%2Ffoodbert/lists"}