{"id":30701467,"url":"https://github.com/happyrao78/scrap-articles","last_synced_at":"2026-05-17T17:02:51.311Z","repository":{"id":295117875,"uuid":"989173660","full_name":"happyrao78/scrap-articles","owner":"happyrao78","description":"A Python-powered web scraping tool with built-in article summarization, CLI and API interfaces, and Docker support. Ideal for extracting and managing web content at scale.","archived":false,"fork":false,"pushed_at":"2025-05-24T15:57:28.000Z","size":29,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-09-02T13:57:29.050Z","etag":null,"topics":["beautifulsoup","cli-app","docker","gemini","langchain-python","langgraph-python","langsmith-tracing","llm-embeddings","pineconedb","python","sqlite3-database","vectordb","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/happyrao78.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-23T17:13:49.000Z","updated_at":"2025-05-29T05:04:35.000Z","dependencies_parsed_at":"2025-05-23T18:37:32.457Z","dependency_job_id":"d8abb181-b5e6-4a0b-9173-e5d59ecdd965","html_url":"https://github.com/happyrao78/scrap-articles","commit_stats":null,"previous_names":["happyrao78/scrap-articles"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/happyrao78/scrap-articles","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/happyrao78%2Fscrap-articles","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/happyrao78%2Fscrap-articles/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/happyrao78%2Fscrap-articles/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/happyrao78%2Fscrap-articles/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/happyrao78","download_url":"https://codeload.github.com/happyrao78/scrap-articles/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/happyrao78%2Fscrap-articles/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33147339,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-17T09:28:26.183Z","status":"ssl_error","status_checked_at":"2026-05-17T09:27:52.702Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","cli-app","docker","gemini","langchain-python","langgraph-python","langsmith-tracing","llm-embeddings","pineconedb","python","sqlite3-database","vectordb","webscraping"],"created_at":"2025-09-02T13:49:00.589Z","updated_at":"2026-05-17T17:02:51.282Z","avatar_url":"https://github.com/happyrao78.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 📰 Scrap Articles\n\nA Python-based web scraping tool designed to **extract**, **summarize**, and **store** articles from various websites. Scrap Articles provides both **CLI** and **RESTful API** interfaces, supporting flexible usage across different environments.\n\n---\n\n##  Features\n\n| Feature             | Description                                                                 |\n|---------------------|-----------------------------------------------------------------------------|\n| Web Scraping        | Extracts titles, authors, and content from websites using BeautifulSoup.     |\n| Summarization       | Summarizes content using the Google Gemini API.                             |\n| Database Integration| Stores articles in SQLite via SQLAlchemy ORM.                               |\n| CLI Interface       | Command-line access to all major functionalities using Click.               |\n| API Interface       | FastAPI-powered REST endpoints for programmatic access.                     |\n| Docker Support      | Containerized deployment using Docker and Docker Compose.                   |\n\n---\n\n##  Tech Stack\n\n| Component          | Technology           |\n|--------------------|----------------------|\n| Backend Framework  | FastAPI              |\n| Scraping Library   | BeautifulSoup, Requests |\n| Database           | SQLite + SQLAlchemy  |\n| CLI Tool           | Click                |\n| Summarization API  | Google Gemini API    |\n| Containerization   | Docker, Docker Compose |\n| Env Management     | Python-dotenv        |\n\n---\n\n## 🔧 Installation\n\n### 1. Clone the Repository\n\n```bash\ngit clone https://github.com/happyrao78/scrap-articles.git\ncd scrap-articles\n````\n\n### 2. Set Up the Environment\n\n#### Using Virtual Environment\n\n```bash\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\npip install -r requirements.txt\n```\n\n#### Using Docker\n\n```bash\ndocker-compose up --build\n```\n\n---\n\n## ⚙️ Configuration\n\nCreate a `.env` file in the root directory with the following content or check out .env.example:\n\n```env\nDATABASE_URL=sqlite:///./articles.db\nGEMINI_API_KEY=your_google_gemini_api_key\n```\n\n---\n\n## 🖥 Usage\n\n### 1. CLI Commands\n\n| Command           | Description                      | Example                                                              |\n| ----------------- | -------------------------------- | -------------------------------------------------------------------- |\n| `init-database`   | Initialize the database          | `python cli.py init-database`                                        |\n| `scrape`          | Scrape articles from a given URL | `python cli.py scrape --url \"https://quotes.toscrape.com\" --limit 5` |\n| `list-articles`   | List all stored articles         | `python cli.py list-articles --limit 10`                             |\n| `get-summary`     | Get the summary of an article    | `python cli.py get-summary --id 1`                                   |\n| `delete-article`  | Delete a single article          | `python cli.py delete-article --id 1`                                |\n| `delete-articles` | Delete multiple articles         | `python cli.py delete-articles --ids 1 --ids 2  --ids 3`                          |\n| `test-gemini`     | Test Gemini API integration      | `python cli.py test-gemini`                                          |\n\n### 2. API Endpoints\n\n| Endpoint                       | Method | Description                       | Example                                                                                                                                                      |\n| ------------------------------ | ------ | --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |\n| `/api/v1/scrape-and-summarize` | POST   | Scrape and summarize from a URL   | `curl -X POST -H \"Content-Type: application/json\" -d '{\"url\": \"https://quotes.toscrape.com\", \"limit\": 5}' http://localhost:8000/api/v1/scrape-and-summarize` |\n| `/api/v1/articles`             | GET    | List all stored articles          | `curl -X GET http://localhost:8000/api/v1/articles?skip=0\u0026limit=10`                                                                                          |\n| `/api/v1/get-summary/{id}`     | GET    | Get summary of a specific article | `curl -X GET http://localhost:8000/api/v1/get-summary/1`                                                                                                     |\n| `/api/v1/articles/{id}`        | DELETE | Delete an article by ID           | `curl -X DELETE http://localhost:8000/api/v1/articles/1`                                                                                                     |\n| `/health`                      | GET    | Check application health          | `curl -X GET http://localhost:8000/health`                                                                                                                   |\n\n---\n\n##  How to Run\n\n### Using CLI\n\n```bash\npython cli.py init-database\npython cli.py scrape --url \"https://quotes.toscrape.com\" --limit 5\npython cli.py list-articles\n```\n\n### Using Docker\n\n```bash\ndocker-compose up --build\n```\n\n* API Root: `http://localhost:8000/`\n* API Docs: [http://localhost:8000/docs](http://localhost:8000/docs)\n\n---\n\n## 🔮 Future Enhancements\n\n| Feature       | Goal                                                     | Status            |\n| ------------- | -------------------------------------------------------- | ----------------- |\n| **LangChain** | Build QA agent for scraped articles                      | Research ongoing  |\n| **LangGraph** | Graph-based contextual representation of articles        | Exploring options |\n| **LangSmith** | Debugging and monitoring the QA agents                   | Tool integration  |\n| **Pinecone**  | Store and retrieve vector embeddings for semantic search | In progress       |\n\n---\n\n##  Key Benefits\n\n* **Versatile**: Use CLI locally or API for integrations.\n* **Scalable**: Dockerized setup for consistent deployment.\n* **Persistent**: SQLite ensures articles are stored across sessions.\n* **Extensible**: Modular design supports feature additions.\n\n---\n\nFor major features or bugs, [open an issue](https://github.com/happyrao78/scrap-articles/issues) first to discuss.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhappyrao78%2Fscrap-articles","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhappyrao78%2Fscrap-articles","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhappyrao78%2Fscrap-articles/lists"}