{"id":34851455,"url":"https://github.com/shiningflash/web-scraping","last_synced_at":"2026-04-20T20:02:52.824Z","repository":{"id":111950773,"uuid":"424351348","full_name":"shiningflash/web-scraping","owner":"shiningflash","description":"Web scraping with Python using the Beautiful Soup and Scrapy.","archived":false,"fork":false,"pushed_at":"2024-12-31T03:02:29.000Z","size":132,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-09T20:02:04.033Z","etag":null,"topics":["beautifulsoup","python","scrapy","selenium","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shiningflash.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-03T19:16:38.000Z","updated_at":"2024-12-31T03:02:32.000Z","dependencies_parsed_at":null,"dependency_job_id":"e1f18c27-85fe-4ff6-adf4-68d65328c7ba","html_url":"https://github.com/shiningflash/web-scraping","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/shiningflash/web-scraping","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shiningflash%2Fweb-scraping","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shiningflash%2Fweb-scraping/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shiningflash%2Fweb-scraping/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shiningflash%2Fweb-scraping/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shiningflash","download_url":"https://codeload.github.com/shiningflash/web-scraping/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shiningflash%2Fweb-scraping/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28035466,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-25T02:00:05.988Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","python","scrapy","selenium","web-scraping"],"created_at":"2025-12-25T19:20:01.872Z","updated_at":"2025-12-25T19:20:27.288Z","avatar_url":"https://github.com/shiningflash.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Scraping Projects\n\n## Overview\nThis repository demonstrates various web scraping projects using Python libraries such as: **Beautiful Soup**, **Scrapy**, **Playwright**.\n\nWeb scraping, also referred to as screen scraping, web harvesting, or web crawling, involves automating the process of extracting data from websites. These projects focus on implementing best practices for extracting and managing structured data efficiently.\n\n---\n\n## Tools \u0026 Technologies\n\n- **Python**: Core programming language used for scraping.\n- **Beautiful Soup**: Library for parsing HTML and XML documents.\n- **Scrapy**: Framework for building web crawlers.\n- **Playwright** : open-source automation library for browser testing and web scraping\n- **JSON**: Storing extracted data in structured formats.\n- **Pandas**: Data manipulation and cleaning (optional, for analysis).\n- **Requests**: HTTP library for interacting with web pages.\n- **Git**: Version control for project collaboration.\n\n---\n\n### Key Directories:\n\n1. **article_scraper**:\n   - A Scrapy-based project to scrape articles from sources like Wikipedia and Yahoo News.\n   - Includes spiders and configurations.\n\n2. **flipkart_scraper**:\n   - Scrapes laptop data from Flipkart.\n\n3. **timesjobs_scraper**:\n   - Scrapes job listings from TimesJobs.\n\n---\n\n## Installation\n\n1. Clone the repository:\n   ```bash\n   git clone https://github.com/shiningflash/web-scraping.git\n   cd web-scraping\n   ```\n\n2. Install dependencies:\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n---\n\n## Usage\n\n### Flipkart Scraper\n\nNavigate to the `flipkart_scraper` directory and run:\n```bash\npython main.py\n```\nOutput will be saved in `laptops.json`.\n\n### Article Scraper (Using Scrapy)\n\nNavigate to the `article_scraper` directory and follow these steps:\n\n1. **Create a new Scrapy project**:\n   ```bash\n   scrapy startproject article_scraper\n   ```\n\n2. **Generate a new spider**:\n   ```bash\n   scrapy genspider wikipedia https://en.wikipedia.org\n   ```\n\n3. **Run the spider**:\n   ```bash\n   scrapy runspider spiders/wikipedia_spider.py -o articles.json -t json\n   ```\n\n   **Notes:**\n   - `-o` specifies the output file.\n   - `-t` specifies the output format (JSON, CSV, XML, etc.).\n\n4. **Custom Settings in Spiders**:\n   ```python\n   custom_settings = {\n       \"FEED_URI\": \"articles.json\",\n       \"FEED_FORMAT\": \"json\"\n   }\n   ```\n\n### TimesJobs Scraper\n\nNavigate to the `timesjobs_scraper` directory and run:\n```bash\npython main.py\n```\nOutput will be saved in `jobs.json`.\n\n---\n\n## Best Practices for Scraping\n\n1. **Respect Robots.txt**:\n   - Always set `ROBOTSTXT_OBEY = True` in `settings.py`.\n\n2. **Use Pipelines**:\n   - Process data efficiently by implementing pipelines in Scrapy projects.\n\n3. **Use Pagination**:\n   - Ensure the scraper handles multiple pages effectively.\n\n4. **Error Handling**:\n   - Implement robust error handling for HTTP requests and parsing.\n\n5. **Data Storage**:\n   - Store data in structured formats like JSON or CSV for analysis.\n\n---\n\n## Future Improvements\n\n1. Implement cloud-based scraping solutions.\n2. Add database support (e.g., MySQL, MongoDB) for data storage.\n3. Integrate with CI/CD pipelines for automated scraping.\n\n---\n\n## Contact\n\nFor any inquiries or suggestions:\n\n**Author**: Amirul Islam Al Mamun\n**GitHub**: [shiningflash](https://github.com/shiningflash)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshiningflash%2Fweb-scraping","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshiningflash%2Fweb-scraping","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshiningflash%2Fweb-scraping/lists"}