https://github.com/secretdeveloperisme/comic-crawler
comic crawler
https://github.com/secretdeveloperisme/comic-crawler
Last synced: 11 months ago
JSON representation
comic crawler
- Host: GitHub
- URL: https://github.com/secretdeveloperisme/comic-crawler
- Owner: secretdeveloperisme
- Created: 2024-12-29T06:58:03.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-01T14:02:57.000Z (over 1 year ago)
- Last Synced: 2025-05-07T11:33:10.005Z (about 1 year ago)
- Language: Java
- Size: 15.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Comic Website Crawler
## Motivation
When reading comics on a website, I encounter numerous visible and hidden advertisements.
Additionally, each chapter has few images. To address these inconveniences, I created a small program that crawls content from multiple chapters and displays the output in a single HTML file.
## How to use the program
### 1. Starting Image Proxy server
Because the target server restricts requests from untrusted hosts, it uses the `Referer` header to validate the request. However, the browser has a policy preventing modification of the `Referer` header before sending a request.
Therefore, I used an HTTP proxy server to modify and forward the request to the target server.
#### Execute the command to run server
```bash
java -jar .\imageproxy-1.0.0.jar
```
### 2. Crawling content of chapters
The server uses a rate-limiting method to prevent large-scale crawling of comic content.
Therefore, in the program, I have to limit the number of requests sent and retry sending a request if it detects too many requests.
#### Execute the command to start crawler program
```bash
java -jar .\comic-crawler-1.0-jar-with-dependencies.jar --start 1 --end 10
```
**How to use the crawler program**
```bash
Usage: ComicCrawler
-e,--end end chapter number
-s,--start start chapter number
```