{"id":14964656,"url":"https://github.com/pro-genai/autopuredata","last_synced_at":"2026-03-05T05:32:03.654Z","repository":{"id":254828447,"uuid":"846907558","full_name":"Pro-GenAI/AutoPureData","owner":"Pro-GenAI","description":"Automated Filtering of Undesirable Web Data to Update LLM Knowledge","archived":false,"fork":false,"pushed_at":"2024-10-11T13:38:02.000Z","size":888,"stargazers_count":7,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-07T06:21:38.124Z","etag":null,"topics":["ai","ai-research","artificial-intelligence","arxiv","automated-data-analysis","continuous-learning","continuous-training","gen-ai","genai","generative-ai","language-model","large-language-models","llms","natural-language-processing","nlp","python","rag","research","research-paper","research-project"],"latest_commit_sha":null,"homepage":"https://www.onlinescientificresearch.com/articles/autopuredata-automated-filtering-of-undesirable-web-data-to-update-llm-knowledge.pdf","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Pro-GenAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-24T09:39:21.000Z","updated_at":"2025-03-05T17:55:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"6a9aad52-0a65-4e4d-9823-fa354b8acaf2","html_url":"https://github.com/Pro-GenAI/AutoPureData","commit_stats":{"total_commits":3,"total_committers":2,"mean_commits":1.5,"dds":"0.33333333333333337","last_synced_commit":"3d0969452667cd239975ea359ef0a42d0de41692"},"previous_names":["pro-genai/autopuredata"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Pro-GenAI/AutoPureData","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pro-GenAI%2FAutoPureData","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pro-GenAI%2FAutoPureData/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pro-GenAI%2FAutoPureData/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pro-GenAI%2FAutoPureData/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Pro-GenAI","download_url":"https://codeload.github.com/Pro-GenAI/AutoPureData/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pro-GenAI%2FAutoPureData/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271123414,"owners_count":24703225,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-19T02:00:09.176Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ai-research","artificial-intelligence","arxiv","automated-data-analysis","continuous-learning","continuous-training","gen-ai","genai","generative-ai","language-model","large-language-models","llms","natural-language-processing","nlp","python","rag","research","research-paper","research-project"],"created_at":"2024-09-24T13:33:35.216Z","updated_at":"2026-03-05T05:32:03.601Z","avatar_url":"https://github.com/Pro-GenAI.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- Copyright (c) 2024 Praneeth Vadlapati --\u003e\n\n# \u003cimg src=\"./files/logo_small.png\" align=\"left\" width=\"200\" alt=\"AutoPureData\" /\u003e Auto*Pure*Data\n\nAutomated Filtering of Undesirable Web Data to Update LLM Knowledge\n\n[![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-yellow.svg?style=for-the-badge)](./LICENSE.md)\n[![DOI](https://img.shields.io/badge/DOI-10.47363%2FJMCA%2F2024%283%29E121-darkgreen?style=for-the-badge)](https://doi.org/10.47363/JMCA/2024(3)E121)\n[![Python](https://img.shields.io/badge/Python-3776AB?style=for-the-badge\u0026logo=python\u0026logoColor=ffdd54)](https://www.python.org/)\n\u003c!-- [![arxiv 2406.19271](https://img.shields.io/badge/arXiv-2406.19271-B31B1B?logo=arxiv\u0026style=for-the-badge)](https://arxiv.org/abs/2406.19271) --\u003e\n\nCreated by Praneeth Vadlapati ([@prane-eth](https://github.com/prane-eth))\n\n\u003e [!NOTE]\n\u003e Please star :star: the repository to show your support. \u003cbr\u003e\n\n#### Why AutoPureData?\nLLMs (Generative AI) like ChatGPT do not have the latest updated information.\nThe reason for not auto-updating with the latest data is a lot of unsafe or unwanted text around the web.\n\nThis project is to automatically collect the data and filter unwanted text using AI and LLMs.\nThe auto-filtered data can be used to automatically update knowledge of LLMs.\n\n\n#### _What are filtered:_\n- **Unsafe content** :biohazard:: Toxic, threat, insult, discrimination, political, self-harm,\n\treligious, violence, sexual, profanity, flirtation, spam, scam, misleading, and more\n- **Content from unreliable sources** :newspaper:: Unsafe websites and unindexed domains (that are not crawled by search engines)\n- **Personal details** :bust_in_silhouette:: Phone, address, credit card, SSN, IP address, and more\n- **Attacks** :shield:: Adversarial attack attempts (with Data Poisoning)\n\nLanguages supported: Only English for now (more languages will be added when contributors are available)\n\n\n## :page_facing_up: Research Paper\nA published research paper is available at [JMCA/2024(3)E121](https://doi.org/10.47363/JMCA/2024(3)E121) \u003cbr\u003e\n\n\n## :bookmark_tabs: Citation\nTo use my paper for reference, please cite it as below:\n```bibtex\n@article{vadlapati2024autopuredata,\n\ttitle={{AutoPureData: Automated Filtering of Undesirable Web Data to Update LLM Knowledge}},\n\tauthor={{Praneeth Vadlapati}},\n\tjournal={{Journal of Mathematical \\\u0026 Computer Applications}},\n\tvolume={3},\n\tnumber={4},\n\tpages={1--4},\n\tyear={2024},\n\tmonth={July},\n\tdoi={10.47363/JMCA/2024(3)E121},\n\tissn={2754-6705}\n}\n```\n\n\n## :rocket: Quick Start\n```bash\npip install -r requirements.txt\ncp .env.example .env\n```\nNow, edit the `.env` file and add your API keys. \u003cbr\u003e\nRun the file [Data_flagging.ipynb](Data_flagging.ipynb)\n\tto collect and filter the latest web data.\nRun the file [Analytics_and_Filtering.ipynb](Analytics_and_Filtering.ipynb)\n\tto manually correct the flagging.\n\nAfter the filtering process, the data can be used with an LLM as mentioned in [Usage_with_LLMs.ipynb](Usage_with_LLMs.ipynb)\n- This file pushes the filtered data to Pinecone DB and uses it with an LLM.\n\n\n## :computer: More Projects\nFor more projects, open the profile: **[@Pro-GenAI](https://github.com/Pro-GenAI)** \u003cbr\u003e\n\n\n## :hammer_and_wrench: Contributing\nContributions are welcome! Feel free to create an issue for any bug reports or suggestions. \u003cbr\u003e\nPlease contribute to the code by adding more filters and making the code more efficient. \u003cbr\u003e\nTo contribute, star :star: the repository and create an Issue. If I can't solve it, I will allow anyone to create a pull request.\u003cbr\u003e\n\n\n## :identification_card: License\nCopyright (c) 2024 Praneeth Vadlapati \u003cbr\u003e\nPlease refer to the [LICENSE](./LICENSE.md) file for more information.\n\n\n## :warning: Disclaimer\nThe code is not intended for use in production environments.\nThis code is for educational and research purposes only.\n\nNo author is responsible for any misuse or damage caused by this code.\nUse it at your own risk. The code is provided as is without any guarantees or warranty.\n\n# Note: The results were not updated using Llama 3.1, as the same accuracy was achieved using Llama 3.\n\n## :globe_with_meridians: Acknowledgements\n- Special thanks to **Groq** (https://groq.com/) for a fast Llama 3 inference engine\n- Dataset: HuggingFace **FineWeb** https://huggingface.co/datasets/HuggingFaceFW/fineweb\n- Unsafe text detections: Meta **Llama Guard 2** https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md\n- Unwanted text detections using LLM: Meta **Llama 3** (70B) https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md\n- Analytics page: Gradio https://gradio.app/\n- Vector DB: Pinecone https://www.pinecone.io/\n\n\n## :email: Contact\nFor personal queries, please find my contact details here: [linktr.ee/prane.eth](https://linktr.ee/prane.eth)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpro-genai%2Fautopuredata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpro-genai%2Fautopuredata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpro-genai%2Fautopuredata/lists"}