{"id":28463796,"url":"https://github.com/arjunravi26/rag_news_extractor","last_synced_at":"2026-04-09T17:35:18.729Z","repository":{"id":296972815,"uuid":"994878525","full_name":"arjunravi26/rag_news_extractor","owner":"arjunravi26","description":"A rag project that aims to summarize the news and also clarify questions regarding to it.","archived":false,"fork":false,"pushed_at":"2025-06-03T06:27:59.000Z","size":143,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-27T14:47:55.858Z","etag":null,"topics":["beautifulsoup","faiss","googlenews","groq","langchain","llama","python","requests","sentence-transformers","streamlit"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arjunravi26.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS_data/description_1.txt","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-02T16:10:01.000Z","updated_at":"2025-06-03T06:29:38.000Z","dependencies_parsed_at":"2025-06-03T18:08:21.602Z","dependency_job_id":"a8316efd-91de-4315-8130-b3c93a77337c","html_url":"https://github.com/arjunravi26/rag_news_extractor","commit_stats":null,"previous_names":["arjunravi26/rag_news_extractor"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/arjunravi26/rag_news_extractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arjunravi26%2Frag_news_extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arjunravi26%2Frag_news_extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arjunravi26%2Frag_news_extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arjunravi26%2Frag_news_extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arjunravi26","download_url":"https://codeload.github.com/arjunravi26/rag_news_extractor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arjunravi26%2Frag_news_extractor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279005033,"owners_count":26083827,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-10T02:00:06.843Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","faiss","googlenews","groq","langchain","llama","python","requests","sentence-transformers","streamlit"],"created_at":"2025-06-07T05:00:42.545Z","updated_at":"2025-10-10T19:40:01.176Z","avatar_url":"https://github.com/arjunravi26.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RAG News Extractor\n\nA Retrieval-Augmented Generation (RAG) system that extracts, ingests, and serves relevant news content from Google News. Leveraging web scraping, text chunking, embedding, and a vector database, this project allows users to ask about any news topic, receive a summarized overview, and follow up with streaming answers.\n\n## Table of Contents\n\n- [Features](#features)\n- [Architecture \u0026 Workflow](#architecture--workflow)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Configuration](#configuration)\n- [Dependencies](#dependencies)\n- [Project Structure](#project-structure)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Features\n\n- **Automated News Extraction**\n  • Query Google News for a given topic\n  • Scrape individual news articles using Beautiful Soup\n\n- **Document Processing \u0026 Chunking**\n  • Clean and normalize raw HTML content\n  • Chunk documents using LangChain’s `RecursiveCharacterTextSplitter`\n\n- **Dense Embedding \u0026 Vector Storage**\n  • Embed each text chunk with Sentence-Transformers\n  • Store embeddings in a FAISS vector database for fast retrieval\n\n- **Retrieval \u0026 Summarization**\n  • On user query, retrieve the top-k relevant chunks from FAISS\n  • Summarize retrieved chunks via Groq LLaMA model (streaming output)\n\n- **Follow-Up Questions**\n  • Maintain context to answer follow-up queries about the same news topic\n  • Stream responses as the model generates them\n\n- **Web Interface (Streamlit)**\n  • Simple, interactive UI to enter topics and view summaries\n  • Real-time, streaming answer display\n\n## Architecture \u0026 Workflow\n\n1. **User Input**\n   - User enters a news topic in the Streamlit UI.\n\n2. **News Retrieval**\n   1. Query Google News for top headlines/links matching the topic.\n   2. For each news link, scrape the article text using Beautiful Soup.\n   3. Store raw text locally.\n\n3. **Text Chunking**\n   - Use LangChain’s `RecursiveCharacterTextSplitter` to split each article into manageable chunks.\n\n4. **Embedding \u0026 Indexing**\n   1. For each text chunk, compute a dense embedding via a Sentence-Transformer model (e.g., `all-MiniLM-L6-v2`).\n   2. Insert embeddings into a FAISS index.\n\n5. **Querying \u0026 Summarization**\n   1. When the user asks “Summarize the latest news on [topic]”:\n      - Embed the user question.\n      - Retrieve the top-k closest chunks from FAISS.\n      - Concatenate/re-rank retrieved chunks as needed.\n      - Stream the summarization from Groq LLaMA (prompted via LangChain).\n   2. For follow-up questions:\n      - Context window includes system prompt + retrieved chunks.\n      - Generate a streaming reply via the same Groq LLaMA pipeline.\n\n6. **Streamlit Frontend**\n   - Displays:\n     • A text input for “Enter a news topic”\n     • A “Submit” button to trigger the RAG pipeline\n     • A live, streaming text area to show the LLaMA-generated summary/answer\n\n## Installation\n\n1. **Clone the repository**\n   ```bash\n   git clone https://github.com/arjunravi26/rag_news_extractor.git\n   cd rag_news_extractor\n\n2. **Create a Python virtual environment**\n\n   ```bash\n   python3 -m venv .venv\n   source .venv/bin/activate       # On Windows: .venv\\Scripts\\activate\n   ```\n\n3. **Install required packages**\n\n   ```bash\n   pip install --upgrade pip\n   pip install -r requirements.txt\n   ```\n\n4. **Set up environment variables** (if applicable)\n\n   * If your project requires API keys (e.g., custom Google News API, LLaMA credentials), create a `.env` file in the root directory:\n\n     ```dotenv\n     GROQ_API_KEY=your_openai_api_key_here\n     ```\n   * Ensure that `.env` is listed in `.gitignore` to avoid committing secrets.\n\n5. **Initialize FAISS Index**\n   \n   * If starting fresh, the first run of the Streamlit app will build the FAISS database from scratch.\n\n## Usage\n\n1. **Run the Streamlit App**\n\n   ```bash\n   streamlit run app.py\n   ```\n\n   * A browser window/tab will open at `http://localhost:8501/`.\n   * Enter a news topic (e.g., “Artificial Intelligence”) and click **Submit**.\n\n2. **Interact with the System**\n\n   * The app will display a streaming summary of the latest news related to your topic.\n   * After the initial summary, you can enter follow-up questions in a text box (e.g., “What are the main challenges?”).\n   * The answer will stream in real time, leveraging the existing context window + retrieved chunks.\n\n\n## Dependencies\n\n* **Core Libraries**\n\n  * [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/) — Web scraping\n  * [LangChain](https://github.com/langchain-ai/langchain) — Text splitting \u0026 prompt templates\n  * [Sentence Transformers](https://www.sbert.net/) — Dense text embeddings\n  * [FAISS](https://github.com/facebookresearch/faiss) — Vector database\n  * [Groq LLaMA](https://github.com/groq/groq-llama) — LLM for summarization \u0026 Q\\\u0026A\n  * [Streamlit](https://streamlit.io/) — Web interface\n\n\n* **Python Version**\n\n  * Tested with Python 3.9+. Higher versions may work but ensure compatibility with FAISS and Sentence-Transformers.\n\nInstall everything via:\n\n```bash\npip install -r requirements.txt\n```\n\n## Contributing\n\nContributions are welcome! If you’d like to:\n\n1. **Report a Bug**\n   • Open an issue and provide a clear description of the problem and steps to reproduce.\n\n2. **Request a Feature**\n   • Open an issue labeled “enhancement” explaining the feature goal and use case.\n\n3. **Submit a Pull Request**\n\n   1. Fork the repository.\n   2. Create a new branch:\n\n      ```bash\n      git checkout -b feature/your-feature-name\n      ```\n   3. Make sure to update tests (if any) and add documentation if your change affects the user interface or CLI.\n   4. Run a quick formatting check (e.g., `flake8` or `black`).\n   5. Submit a pull request against the `main` branch.\n\nThank you for helping improve this project!\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n\n---\n\n*Built by [arjunravi26](https://github.com/arjunravi26)*\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farjunravi26%2Frag_news_extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farjunravi26%2Frag_news_extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farjunravi26%2Frag_news_extractor/lists"}