{"id":31144126,"url":"https://github.com/idalin6127/Module3-Mini-Pretraining-Data-Local-Voice-Assistant-OCR-Web-ASR-LLM-TTS","last_synced_at":"2025-12-30T21:22:08.546Z","repository":{"id":309630506,"uuid":"1036995007","full_name":"idalin6127/Ida_Lin_Module3","owner":"idalin6127","description":"Week 3 project combining a mini pretraining data pipeline (web scraping, OCR, cleaning, deduplication) and a local real-time voice assistant (ASR, LLM, TTS).","archived":false,"fork":false,"pushed_at":"2025-09-14T22:10:49.000Z","size":1995,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-15T00:18:22.913Z","etag":null,"topics":["asr","cozyvoice","data-cleaning","deduplication","fastapi","llama3","nlp","ocr","python3","surya","tesseract","tts","voice-agent","web-scraping","whisper"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/idalin6127.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-12T22:48:39.000Z","updated_at":"2025-09-14T22:10:52.000Z","dependencies_parsed_at":"2025-08-13T01:10:52.254Z","dependency_job_id":"e6f7faba-4171-4889-b158-5df52f8f5dc2","html_url":"https://github.com/idalin6127/Ida_Lin_Module3","commit_stats":null,"previous_names":["idalin6127/ida_lin_module3"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/idalin6127/Ida_Lin_Module3","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idalin6127%2FIda_Lin_Module3","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idalin6127%2FIda_Lin_Module3/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idalin6127%2FIda_Lin_Module3/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idalin6127%2FIda_Lin_Module3/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/idalin6127","download_url":"https://codeload.github.com/idalin6127/Ida_Lin_Module3/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idalin6127%2FIda_Lin_Module3/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275781327,"owners_count":25527352,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-18T02:00:09.552Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asr","cozyvoice","data-cleaning","deduplication","fastapi","llama3","nlp","ocr","python3","surya","tesseract","tts","voice-agent","web-scraping","whisper"],"created_at":"2025-09-18T14:15:39.946Z","updated_at":"2025-12-30T21:22:08.541Z","avatar_url":"https://github.com/idalin6127.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\u003cp align=\"left\"\u003e\n  \u003cimg src=\"logo/logo.png\" alt=\"Project Logo\" width=\"300\"/\u003e\n\u003c/p\u003e\n# Module 3 Project: Pretraining Data Pipeline \u0026 Voice Agent Development\n\n## 🚀 Quick Summary\nBuilt a **two-part project**:  \n1. A **pretraining data pipeline** that scrapes scientific papers, extracts text via OCR, and cleans/deduplicates data.  \n2. A **real-time voice agent** supporting 5-turn multi-round conversations using ASR (Whisper), LLM (LLaMA 3), and TTS (CozyVoice).  \nDeliverables include a **clean dataset** for LLM training and a **local FastAPI server** for interactive voice dialogue.  \nDemonstrates skills in **data engineering, NLP preprocessing, multimodal pipelines, and conversational AI development**.  \n\n---\n\n## 📖 Project Description\nThis project was designed to simulate **real-world AI workflows** in two areas:  \n\n1. **Pretraining Data Pipeline** – Building a scalable, high-quality dataset for LLM pretraining, emphasizing **data quality, deduplication, and multi-source diversity**.  \n2. **Voice Agent Development** – Creating a lightweight local voice assistant capable of **real-time dialogue**, integrating speech recognition, language modeling, and speech synthesis.  \n\nThe project highlights the importance of **data quality for model performance** and showcases the integration of multiple AI components into a single interactive system.  \n\n---\n\n## 🎯 Objectives\n\n### Pretraining Data Pipeline\n- Scrape scientific papers from arXiv on selected topics (e.g., NLP, AI safety).  \n- Extract text from PDFs using OCR tools (Tesseract, Surya, GPT-4o Vision API).  \n- Clean and filter data:  \n  - Deduplicate with MinHash  \n  - Remove PII (emails, phone numbers, credit cards)  \n  - Filter non-English and low-quality text  \n- Produce a **clean, diverse dataset** simulating state-of-the-art LLM training data.  \n\n### Voice Agent Development\n- Build a FastAPI server for audio input/output.  \n- Use **Whisper** for Automatic Speech Recognition (ASR).  \n- Integrate **LLaMA 3** for dialogue generation with conversation state tracking.  \n- Synthesize speech with **CozyVoice** for natural TTS output.  \n- Support **5-turn multi-round conversations** with history preservation.  \n\n---\n\n## 🛠️ Tech Stack\n- **Programming Language**: Python  \n- **Web/Data**: requests, BeautifulSoup, scrapy, pandas, regex, langdetect  \n- **OCR**: Tesseract, pytesseract, Surya  \n- **Deduplication**: datasketch (MinHash)  \n- **ASR**: Whisper  \n- **Dialogue Generation**: LLaMA 3  \n- **TTS**: CozyVoice  \n- **Server Framework**: FastAPI, Uvicorn  \n- **Testing Tools**: curl, Postman  \n\n---\n\n## 🔥 Architecture / Workflow Diagram \nflowchart LR\n  subgraph Data Pipeline\n    A[Scrape PDFs] --\u003e B[OCR (Tesseract/Surya)]\n    B --\u003e C[Cleaning (langdetect/regex)]\n    C --\u003e D[MinHash Dedup]\n  end\n  subgraph Voice Agent\n    E[Audio Upload] --\u003e F[ASR(Whisper)]\n    F --\u003e G[LLM(LLaMA-3)+State]\n    G --\u003e H[TTS(Co zyVoice)]\n  end\n\n---\n\n## 📂 Deliverables\n- `clean_dataset/` → pretraining-ready text corpus (deduplicated, PII-free).  \n- `scraper/` → arXiv scraping and cleaning scripts.  \n- `ocr_pipeline/` → PDF-to-text OCR processing scripts.  \n- `voice_agent/` → FastAPI-based real-time voice assistant code.  \n- Example outputs:  \n  - `stats.md` → dataset statistics (token counts, % removed).  \n  - Conversation transcripts (JSON).  \n\n---\n\n\n## 🔥 How to Run / Quick Start \n# Data pipeline\npip install -r requirements.txt\npython build_corpus.py --topic \"AI safety\" --out dataset/\n\n# Voice agent\nuvicorn voice_agent.api:app --reload --port 8001\n# Test\ncurl -X POST -F \"file=@sample.wav\" http://localhost:8001/talk\n---\n\n## 🌟 Highlights\n- **End-to-end pretraining pipeline** for scientific text.  \n- **Multi-modal integration**: web, PDFs, audio → unified text corpus.  \n- **Privacy-aware cleaning** with PII removal and deduplication.  \n- **Modular voice agent**: supports async processing, scalable to UI or custom voices.  \n- Combines **research-oriented data engineering** with **applied conversational AI**.  \n\n---\n\n## 🚀 Skills Demonstrated\n- **Data Engineering \u0026 NLP Preprocessing** – scraping, OCR, deduplication, and cleaning.  \n- **Pipeline Design** – building modular, end-to-end workflows.  \n- **Conversational AI Development** – ASR + LLM + TTS integration in real time.  \n- **System Deployment** – FastAPI server design, API testing with curl/Postman.  \n- **Research-to-Production Thinking** – simulating SOTA LLM pretraining workflows.  \n\n---\n\n## 🚀 Future Improvements\nVAD/endpointing；speaker profiles；RAG grounding for factuality；latency tuning。\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fidalin6127%2FModule3-Mini-Pretraining-Data-Local-Voice-Assistant-OCR-Web-ASR-LLM-TTS","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fidalin6127%2FModule3-Mini-Pretraining-Data-Local-Voice-Assistant-OCR-Web-ASR-LLM-TTS","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fidalin6127%2FModule3-Mini-Pretraining-Data-Local-Voice-Assistant-OCR-Web-ASR-LLM-TTS/lists"}