{"id":21047139,"url":"https://github.com/nerohin/millions-crawler","last_synced_at":"2026-02-09T07:06:00.020Z","repository":{"id":152680562,"uuid":"611114970","full_name":"NeroHin/millions-crawler","owner":"NeroHin","description":"Homework III of NCKU course WEB RESOURCE DISCOVERY AND EXPLOITATION , I've used the distribute crawler to crawling over miliion web page.","archived":false,"fork":false,"pushed_at":"2023-10-28T07:41:59.000Z","size":697,"stargazers_count":8,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-26T10:36:27.646Z","etag":null,"topics":["crawler","distributed","scrapy","spider","web-crawler"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NeroHin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-08T06:20:09.000Z","updated_at":"2024-12-10T06:51:20.000Z","dependencies_parsed_at":null,"dependency_job_id":"f96d88bd-a0f1-4264-b8ca-9c6acf04cd7f","html_url":"https://github.com/NeroHin/millions-crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeroHin%2Fmillions-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeroHin%2Fmillions-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeroHin%2Fmillions-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeroHin%2Fmillions-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NeroHin","download_url":"https://codeload.github.com/NeroHin/millions-crawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248594143,"owners_count":21130313,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","distributed","scrapy","spider","web-crawler"],"created_at":"2024-11-19T14:35:36.940Z","updated_at":"2026-02-09T07:06:00.015Z","avatar_url":"https://github.com/NeroHin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# millions-crawler\n\nThis the NCKU course WEB RESOURCE DISCOVERY AND EXPLOITATION homework III, targe is create a crawler application to crawling millions webpage.\n\n![](/image/What%20is%20a%20Web%20Crawler.jpg)\n[image source](https://www.simplilearn.com/what-is-a-web-crawler-article)\n\n## Part of the homework:\n[Medium Article](https://medium.com/@NeroHin/%E7%88%AC%E8%9F%B2%E6%9C%89%E5%B0%88%E6%94%BB-%E5%88%9D%E6%8E%A2-scrapy-%E7%88%AC%E8%9F%B2-%E4%BB%A5%E7%88%AC%E5%8F%96-15-%E8%90%AC%E7%AD%86%E7%B7%9A%E4%B8%8A%E9%86%AB%E7%99%82%E5%92%A8%E8%A9%A2-qa-%E7%82%BA%E4%BE%8B%E5%AD%90-39a6383a2de4)\n\n# Homework Scope\n\n1. **Crawl millions of webpages**\n2. **Remove non-HTML pages**\n3. **Performance optimization**\n   - How many page can crawl per hour\n   - Total time to crawl millions of pages\n\n# Project architecture\n\n### Distributed architecture\n\n![distributed_architecture](./image/scrapy-redis.png)\n\n### Each spider\n![spider](./image/Scrapy_architecture.png)\n\n### Spider with [台灣 E 院](https://sp1.hso.mohw.gov.tw/doctor/Index1.php)\n\n![tweh_parse_flowchat](./image/%E8%87%BA%E7%81%A3%20E%20%E9%99%A2%E7%88%AC%E8%9F%B2%E7%B5%90%E6%A7%8B.png)\n\n### Spider with [問 8 健康諮詢](https://tw.wen8health.com/)\n\n![w8h_parse_flowchat](./image/%E5%95%8F%208%20%E5%81%A5%E5%BA%B7%E5%92%A8%E8%A9%A2%E7%88%AC%E8%9F%B2%E7%B5%90%E6%A7%8B.png)\n\n### Spider with [Wiki](https://en.wikipedia.org/wiki/Main_Page)\n\n![wiki_parse_flowchat](./image/Wiki%20%E7%88%AC%E8%9F%B2%E7%B5%90%E6%A7%8B.png)\n\n### Anti-Anti-Spider\n\n1. Skip robot.txt\n\n```bash\n# edit settings.py\nROBOTSTXT_OBEY = False\n```\n\n2. Use random user-agent\n\n```bash\npip install fake-useragent\n```\n\n```python\n# edit middlewares.py\nclass FakeUserAgentMiddleware(UserAgentMiddleware):\n    def __init__(self, user_agent=''):\n        self.user_agent = user_agent\n\n    def process_request(self, request, spider):\n        ua = UserAgent()\n        request.headers['User-Agent'] = ua.random\n```\n\n```python\nDOWNLOADER_MIDDLEWARES = {\n   \"millions_crawler.middlewares.FakeUserAgentMiddleware\": 543,\n}\n```\n\n# Result\n\n## single spider in 2023/03/21\n\n| Spider | Total Page | Total Time (hrs) | Page per Hour |\n| :----: | :--------: | :--------------: | :-----------: |\n|  tweh  |  152,958   |       1.3        |    117,409    |\n|  w8h   |   4,759    |       0.1        |    32,203     |\n|  wiki*  | 13,000,320 |       43       |    30,240    |\n\n\n## distributed spider (4 spider) in 2023/03/24\n| Spider | Total Page | Total Time (hrs) | Page per Hour |\n| :----: | :--------: | :--------------: | :-----------: |\n|  tweh  |  153,288   |       0.52       |    -    |\n|  w8h   |   4,921    |       0.16        |    -     |\n|  wiki*  | 4,731,249 |       43.2       |    109,492    |\n\n\n# How to use\n\n0. create a .env file\n\n```bash\nbash create_env.sh\n```\n\n1. Install [Redis](https://redis.io/)\n\n```bash\nsudo apt-get install redis-server\n```\n\n2. Install [MongoDB](https://www.mongodb.com/)\n\n```bash\nsudo apt-get install mongodb\n```\n\n3. Run Redis\n\n```bash\nredis-server\n```\n4. run MongoDB\n\n```bash\nsudo service mongod start\n```\n\n5. Run spider\n\n```bash\ncd millions-crawler\nscrapy crawl [$spider_name] # $spider_name = tweh, w8h, wiki\n```\n\n# Requirement\n\n```bash\npip install -r requirements.txt\n```\n\n# Reference\n\n1. [GitHub | fake-useragent](https://github.com/fake-useragent/fake-useragent)\n2. [GitHub | scrapy](https://github.com/scrapy/scrapy)\n3. [【Day 20】反反爬蟲](https://ithelp.ithome.com.tw/articles/10224979) \n4. [Documentation of Scrapy](https://docs.scrapy.org/en/latest/index.html)\n5. [解决 Redis 之 MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist o...](https://www.jianshu.com/p/3aaf21dd34d6)\n6. [Ubuntu Linux 安裝、設定 Redis 資料庫教學與範例](https://officeguide.cc/ubuntu-linux-redis-database-installation-configuration-tutorial-examples/)\n7. [如何連線到遠端的 Linux + MongoDB 伺服器？](https://magiclen.org/mongodb-remote)\n8. [Scrapy-redis 之終結篇](https://www.twblogs.net/a/5ef9b649952deac88f79c670)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnerohin%2Fmillions-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnerohin%2Fmillions-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnerohin%2Fmillions-crawler/lists"}