Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nerohin/millions-crawler

Homework III of NCKU course WEB RESOURCE DISCOVERY AND EXPLOITATION , I've used the distribute crawler to crawling over miliion web page.
https://github.com/nerohin/millions-crawler

crawler distributed scrapy spider web-crawler

Last synced: about 2 months ago
JSON representation

Homework III of NCKU course WEB RESOURCE DISCOVERY AND EXPLOITATION , I've used the distribute crawler to crawling over miliion web page.

Host: GitHub
URL: https://github.com/nerohin/millions-crawler
Owner: NeroHin
License: mit
Created: 2023-03-08T06:20:09.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-10-28T07:41:59.000Z (about 1 year ago)
Last Synced: 2023-10-28T08:26:47.989Z (about 1 year ago)
Topics: crawler, distributed, scrapy, spider, web-crawler
Language: Python
Homepage:
Size: 681 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # millions-crawler

This the NCKU course WEB RESOURCE DISCOVERY AND EXPLOITATION homework III, targe is create a crawler application to crawling millions webpage.

![](/image/What%20is%20a%20Web%20Crawler.jpg)

[image source](https://www.simplilearn.com/what-is-a-web-crawler-article)

## Part of the homework:

[Medium Article](https://medium.com/@NeroHin/%E7%88%AC%E8%9F%B2%E6%9C%89%E5%B0%88%E6%94%BB-%E5%88%9D%E6%8E%A2-scrapy-%E7%88%AC%E8%9F%B2-%E4%BB%A5%E7%88%AC%E5%8F%96-15-%E8%90%AC%E7%AD%86%E7%B7%9A%E4%B8%8A%E9%86%AB%E7%99%82%E5%92%A8%E8%A9%A2-qa-%E7%82%BA%E4%BE%8B%E5%AD%90-39a6383a2de4)

# Homework Scope

1. **Crawl millions of webpages**

2. **Remove non-HTML pages**

3. **Performance optimization**

   - How many page can crawl per hour

   - Total time to crawl millions of pages

# Project architecture

### Distributed architecture

![distributed_architecture](./image/scrapy-redis.png)

### Each spider

![spider](./image/Scrapy_architecture.png)

### Spider with [台灣 E 院](https://sp1.hso.mohw.gov.tw/doctor/Index1.php)

![tweh_parse_flowchat](./image/%E8%87%BA%E7%81%A3%20E%20%E9%99%A2%E7%88%AC%E8%9F%B2%E7%B5%90%E6%A7%8B.png)

### Spider with [問 8 健康諮詢](https://tw.wen8health.com/)

![w8h_parse_flowchat](./image/%E5%95%8F%208%20%E5%81%A5%E5%BA%B7%E5%92%A8%E8%A9%A2%E7%88%AC%E8%9F%B2%E7%B5%90%E6%A7%8B.png)

### Spider with [Wiki](https://en.wikipedia.org/wiki/Main_Page)

![wiki_parse_flowchat](./image/Wiki%20%E7%88%AC%E8%9F%B2%E7%B5%90%E6%A7%8B.png)

### Anti-Anti-Spider

1. Skip robot.txt

```bash

# edit settings.py

ROBOTSTXT_OBEY = False

```

2. Use random user-agent

```bash

pip install fake-useragent

```

```python

# edit middlewares.py

class FakeUserAgentMiddleware(UserAgentMiddleware):

    def __init__(self, user_agent=''):

        self.user_agent = user_agent

    def process_request(self, request, spider):

        ua = UserAgent()

        request.headers['User-Agent'] = ua.random

```

```python

DOWNLOADER_MIDDLEWARES = {

   "millions_crawler.middlewares.FakeUserAgentMiddleware": 543,

}

```

# Result

## single spider in 2023/03/21

| Spider | Total Page | Total Time (hrs) | Page per Hour |

| :----: | :--------: | :--------------: | :-----------: |

|  tweh  |  152,958   |       1.3        |    117,409    |

|  w8h   |   4,759    |       0.1        |    32,203     |

|  wiki*  | 13,000,320 |       43       |    30,240    |

## distributed spider (4 spider) in 2023/03/24

| Spider | Total Page | Total Time (hrs) | Page per Hour |

| :----: | :--------: | :--------------: | :-----------: |

|  tweh  |  153,288   |       0.52       |    -    |

|  w8h   |   4,921    |       0.16        |    -     |

|  wiki*  | 4,731,249 |       43.2       |    109,492    |

# How to use

0. create a .env file

```bash

bash create_env.sh

```

1. Install [Redis](https://redis.io/)

```bash

sudo apt-get install redis-server

```

2. Install [MongoDB](https://www.mongodb.com/)

```bash

sudo apt-get install mongodb

```

3. Run Redis

```bash

redis-server

```

4. run MongoDB

```bash

sudo service mongod start

```

5. Run spider

```bash

cd millions-crawler

scrapy crawl [$spider_name] # $spider_name = tweh, w8h, wiki

```

# Requirement

```bash

pip install -r requirements.txt

```

# Reference

1. [GitHub | fake-useragent](https://github.com/fake-useragent/fake-useragent)

2. [GitHub | scrapy](https://github.com/scrapy/scrapy)

3. [【Day 20】反反爬蟲](https://ithelp.ithome.com.tw/articles/10224979) 

4. [Documentation of Scrapy](https://docs.scrapy.org/en/latest/index.html)

5. [解决 Redis 之 MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist o...](https://www.jianshu.com/p/3aaf21dd34d6)

6. [Ubuntu Linux 安裝、設定 Redis 資料庫教學與範例](https://officeguide.cc/ubuntu-linux-redis-database-installation-configuration-tutorial-examples/)

7. [如何連線到遠端的 Linux + MongoDB 伺服器？](https://magiclen.org/mongodb-remote)

8. [Scrapy-redis 之終結篇](https://www.twblogs.net/a/5ef9b649952deac88f79c670)