{"id":17191652,"url":"https://github.com/thammegowda/junkdetect","last_synced_at":"2025-03-25T06:20:40.357Z","repository":{"id":62573316,"uuid":"255238204","full_name":"thammegowda/junkdetect","owner":"thammegowda","description":"Junk-not-junk, a detector that supports 100 natural languages. ","archived":false,"fork":false,"pushed_at":"2020-06-18T23:52:33.000Z","size":29,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-04T01:37:17.946Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://pypi.org/project/junkdetect/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thammegowda.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-13T05:20:46.000Z","updated_at":"2020-06-20T19:38:21.000Z","dependencies_parsed_at":"2022-11-03T18:33:41.549Z","dependency_job_id":null,"html_url":"https://github.com/thammegowda/junkdetect","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thammegowda%2Fjunkdetect","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thammegowda%2Fjunkdetect/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thammegowda%2Fjunkdetect/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thammegowda%2Fjunkdetect/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thammegowda","download_url":"https://codeload.github.com/thammegowda/junkdetect/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245408647,"owners_count":20610379,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-15T01:26:58.194Z","updated_at":"2025-03-25T06:20:40.324Z","avatar_url":"https://github.com/thammegowda.png","language":"Python","readme":"# Junk, Not-Junk Detector\n\nThis tool is built to do just one simple task: detect junk and not-junk texts from a variety of languages.\nJust like that famous [hotdog not-hotdog](https://www.youtube.com/watch?v=pqTntG1RXSY), but applied on natural language text.\nIt can be very useful to test tools that extract, decompress, and/or decrypt natural language texts.\n\n\n# Setup\nUses [fairseq](https://github.com/pytorch/fairseq)\n\n```bash\n# Optionally create a brand new conda environment for this\n#conda create -n junkdetect python=3.7\n#conda activate junkdetect\n\n# Install: use only one of these methods\n# 1. from pypi; recommended\npip install junkdetect\n\n# 2. latest master branch\npip install git+https://github.com/thammegowda/junkdetect\n\n# 3. for development\ngit clone https://github.com/thammegowda/junkdetect \\\n     \u0026\u0026 cd junkdetect \\\n     \u0026\u0026 pip install --editable .\n```\n## How to use\nOnce you install it via pip, `junkdetect` or `python -m junkdetect` can be used to invoke from commandline\n```bash\nprintf \"This is a good sentence. \\nT6785*\u0026^T is 747658 you T\u0026*^\\n\" | junkdetect\n0.999824\tThis is a good sentence.\n0.0747487\tT6785*\u0026^T is 747658 you T\u0026*^\n```\nThe output is one line per input, with two column separated  by `\\t`. \nThe first column has `perplexity`: a lower value (i.e close to 0.0) means junk and an higher value (close to 1.0) means not-junk. If you dont want input sentences back in the output, please cut them out -- just use `junkdetect | cut -f1 \u003e scores.txt`\n\n# How does this work\n**[junkdetect](https://github.com/thammegowda/junkdetect)** looks like only a few lines of python code, but under the hood, it hides a great deal of complexity.  \nIt uses perplexity from neural (masked/auto-regressive) language models that were trained on tera bytes of web text from 100s of languages.   \nSpecifically, it uses Facebookresearch's [XML-R](https://github.com/facebookresearch/XLM/) retrieved from [torch.hub](https://pytorch.org/hub/).\nQuoting the original developers of XML-R and [their paper, (see Table 6)](https://arxiv.org/pdf/1911.02116.pdf)\n\u003e XLM-R handles the following 100 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.\n\n\n## Back Story and Acknowledgements:\n- This idea came out of discussion with [Tim Allison](https://twitter.com/_tallison).\nHe said it was hard to tell whether text was correctly extracted or not from files like PDFs using Apache Tika.\nThanks to him for making me think of something like this.\n- I had read Facebook's very nice [XML-R paper of Conneau et al](https://arxiv.org/abs/1911.02116) and it was top of my mind. \nAlthough XLM folks [didnt help me get perplexity, and I had to dug it out of their code by myself](https://github.com/facebookresearch/XLM/issues/272), \n I still like to thank them for making such useful pretrained models available for easy to use via `torch.hub`.\n\n## Developers:\n- [Thamme Gowda](https://twitter.com/thammegowda)  (wrote the version 0.1)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthammegowda%2Fjunkdetect","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthammegowda%2Fjunkdetect","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthammegowda%2Fjunkdetect/lists"}