{"id":13787129,"url":"https://github.com/notAI-tech/Anuvaad","last_synced_at":"2025-05-12T00:30:43.925Z","repository":{"id":57410727,"uuid":"314321061","full_name":"notAI-tech/Anuvaad","owner":"notAI-tech","description":"State of the art open-source translation for Indic languages.","archived":false,"fork":false,"pushed_at":"2021-04-11T06:16:31.000Z","size":65,"stargazers_count":5,"open_issues_count":3,"forks_count":0,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-28T20:46:18.991Z","etag":null,"topics":["hindi","india","indic-languages","kannada","malayalam","marathi","mt5","multilingual","nlp","tamil","telugu","transformer","transformers","translation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/notAI-tech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-11-19T17:26:46.000Z","updated_at":"2024-12-09T10:15:13.000Z","dependencies_parsed_at":"2022-08-28T01:20:10.654Z","dependency_job_id":null,"html_url":"https://github.com/notAI-tech/Anuvaad","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notAI-tech%2FAnuvaad","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notAI-tech%2FAnuvaad/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notAI-tech%2FAnuvaad/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notAI-tech%2FAnuvaad/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/notAI-tech","download_url":"https://codeload.github.com/notAI-tech/Anuvaad/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253655761,"owners_count":21943068,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hindi","india","indic-languages","kannada","malayalam","marathi","mt5","multilingual","nlp","tamil","telugu","transformer","transformers","translation"],"created_at":"2024-08-03T20:00:29.319Z","updated_at":"2025-05-12T00:30:43.505Z","avatar_url":"https://github.com/notAI-tech.png","language":"Python","funding_links":[],"categories":["**Tools, Libraries, Models**"],"sub_categories":["Translation"],"readme":"# Anuvaad\nState of the art open-source translation models for Indic languages.\n\n\n# Installation\n\n```bash\n# CPU pytorch will be installed if torch is not installed\npip install --upgrade anuvaad\n```\n\n# Usage\n\n**As a Python module**\n\n```python\nfrom anuvaad import Anuvaad\nanu = Anuvaad(\"english-telugu\")\n\n# Single sentence translation\n# beam_size is optional and defaults to 4\nanu.anuvaad(\"YS Jagan is the chief minister of Andhra Pradesh.\")\n# \"వైఎస్ జగన్ ఆంధ్రప్రదేశ్ ముఖ్యమంత్రి.\"\n\n# Batch translation\nanu.anuvaad([\"YS Jagan is the chief minister of Andhra Pradesh.\",\n            \"Nara Lokesh suffered a humiliating defeat in Mangalagiri.\"])\n# ['వైఎస్ జగన్ ఆంధ్రప్రదేశ్ ముఖ్యమంత్రి.', 'మంగళగిరిలో నారా లోకేష్కు అవమానకరమైన ఓటమి ఎదురైంది.']\n\n```\n\n**As a service**\n```bash\n# Starting the api service\ndocker run -it -e BATCH_SIZE=1 -p 8080:8080 notaitech/anuvaad:english-telugu\n\n# Running a prediction\ncurl -d '{\"data\": [\"YS Jagan is the chief minister of Andhra Pradesh.\"]}' -H \"Content-Type: application/json\" -X POST http://localhost:8080/sync\n```\n\n\n|Available Models   |   Anuvaad BLEU    |  Google BLEU    |\n|--------|:--------------:|--------|\n|english-telugu |   12.721173743764009   |  6.841437460383768 |\n|english-tamil | 12.737036149214694 | 5.558450942590664\n|english-malayalam |   17.785746646721996    | 19.569069412553812  |\n|english-kannada |   7.888886041933815    | 3.2803251953567893  |\n|english-marathi |    23.02755955392518   |  12.888112016722792 |\n|english-hindi |   29.175892213216954    |  18.130893478614375 |\n|english-bengali |       |   |\n|english-punjabi |       |   |\n|english-gujarati |       |   |\n\n\n - Google BLEU is calculated from translations generated by the [GOOGLETRANSLATE() function](https://support.google.com/docs/answer/3093331?hl=en) on google sheets.\n\n - The testing scripts and data from Tatoeba is present at https://github.com/notAI-tech/Anuvaad-testing-scripts\n\n - https://docs.google.com/spreadsheets/d/1tYYZObELj-k6mJCnM6uf7xg3JChbSkjs8YvZsOgHacQ/edit?usp=sharing is the sheet containing the predictions from Anuvaad and GOOGLETRANSLATE function on the Tatoeba data from which the above scores are calculated.\n\n# My thoughts on the evaluation/accuracy of the model(s):\n\n1. Unlike classification/ sequence labelling tasks, for open-domain translation or summarization systems it is very hard to quantify the accuracy through numbers.\n2. This is because, most accuracy measurements actually measure the overlap of character/word n-grams between the expected output and predicted output.\n3. These scores definitely help when evaluating/comparing multiple models on a particular dataset, but the number don't translate well for open-domain models.\n4. For example, Anuvaad translates the sentence ***An advance is placed with the Medical Superintendents of such hospitals who then provide assistance on a case to case basis.*** (taken from http://data.statmt.org/pmindia/v1/parallel corpus) to ***ऐसे अस्पतालों के चिकित्सा अधीक्षकों के साथ एडवांस रखा जाता है, जिसके बाद मामले के आधार पर सहायता प्रदान की जाती है।*** where as the expected translation of the sentence from the dataset is ***अग्रिम धन राशि इन अस्पतालों को चिकित्सा निरीक्षकों को दी जाएगी, जो हर मामले को देखते हुए सहायता प्रदान करेंगे।***.\n5. In the above example, Although Anuvaad's translation is correct (in the sense that translation conveys the same thing as the original sentence), the BLEU score with n=3 will be 0.\n6. Similarly, a model trained on the pmindia dataset will have bad score on a different dataset which uses a different style of writing, even if the translation is semantically correct.\n7. Our aim in building Anuvaad is to build a general purpose, open-domain translation module that can flexibly translate text from various domains.\n8. https://docs.google.com/spreadsheets/d/1_TTtBEvVgemQfGbRBSZYkECMMt5r7L9-dt0FGVUbmOY/edit?usp=sharing is a sheet comparing translations from Anuvaad, ilmulti (https://github.com/jerinphilip/ilmulti) and Google Translate (=GOOGLETRANSLATE(text, \"en\", \"language\") function on google sheets) on 100 randomly selected English sentences from Tatoeba. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FnotAI-tech%2FAnuvaad","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FnotAI-tech%2FAnuvaad","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FnotAI-tech%2FAnuvaad/lists"}