https://github.com/techut30/multithreadwebcrawler

This is a multi-threaded web crawler program coded in Java. Kindly change the webpages you want to crawl in the Main class of the program.
https://github.com/techut30/multithreadwebcrawler

Last synced: over 1 year ago
JSON representation

This is a multi-threaded web crawler program coded in Java. Kindly change the webpages you want to crawl in the Main class of the program.

Host: GitHub
URL: https://github.com/techut30/multithreadwebcrawler
Owner: techut30
Created: 2023-10-01T12:29:24.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-09-16T21:53:04.000Z (almost 2 years ago)
Last Synced: 2025-01-20T21:47:20.496Z (over 1 year ago)
Language: Java
Size: 405 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Java WebCrawler

This is a multithreaded web crawler written in Java that uses the JSoup library to scrape websites. The program takes a URL as input and crawls through web pages up to a maximum depth of 5, printing out the title of each web page it visits. Multiple web crawlers run in parallel using threads to speed up the process.

## Features

- **Multithreaded**: Each web crawler runs on its own thread, allowing for parallel web scraping.
- **Depth-limited crawling**: The crawler stops after reaching a specified depth of 5 to avoid infinite loops.
- **Page Title Extraction**: The crawler prints the title of each web page it visits.
- **Unique URL Visits**: The program ensures that no URL is visited more than once in a given session.

## Project Structure

The project contains two main classes:

1. **`WebCrawl`**: Implements the crawling logic. Each instance runs on its own thread and follows links up to a depth of 5.
2. **`Main`**: The entry point of the program. It creates multiple web crawlers and manages their execution.

## Installation

### Prerequisites

- Java Development Kit (JDK) version 8 or higher
- Maven or Gradle (optional, for dependency management)
- [JSoup Library](https://jsoup.org/), version 1.13.1 or higher

### Steps

1. **Clone the repository or copy the source code**:
```bash
git clone https://github.com/yourusername/JavaWebCrawler.git
```
2. **Download JSoup**:
If using Maven, add the following dependency in your `pom.xml` file:
```xml

org.jsoup
jsoup
1.13.1

```

If using Gradle, add this line to your `build.gradle`:
```groovy
implementation 'org.jsoup:jsoup:1.13.1'
```

Alternatively, download the JSoup JAR manually and add it to your project’s classpath.

3. **Compile and run the program**:
```bash
javac -cp jsoup-1.13.1.jar WebCrawler/*.java
java -cp .:jsoup-1.13.1.jar WebCrawler.Main
```

## How It Works

### WebCrawl Class

- **Constructor**:
- Takes the starting URL and a unique identifier for the crawler.
- Starts a new thread to run the crawling process.

- **crawl(int level, String url)**:
- Recursively crawls through the links found on the given URL until the maximum depth (5) is reached.

- **request(String url)**:
- Fetches the document from the given URL using JSoup.
- Prints the status code, title, and adds the URL to the visited list if the page was successfully fetched.

### Main Class

- **Main Method**:
- Initializes multiple web crawlers with different starting URLs.
- Manages the execution of each web crawler by calling `join()` to ensure the main thread waits for each crawler to finish before exiting.

## Example

In the `Main` class, three instances of `WebCrawl` are created, each starting at different websites:
- `https://www.wikipedia.org/`
- `https://timesofindia.indiatimes.com/`
- `https://www.cricbuzz.com/`

Each crawler will explore up to a depth of 5 and print the titles of the web pages it visits.

### Output Example:
```text
WebCrawler Created Successfully
WebCrawler Created Successfully
WebCrawler Created Successfully

Bot ID: 1 Recieved webpage link at : https://www.wikipedia.org/
Wikipedia

Bot ID: 1 Recieved webpage link at : https://www.wikibooks.org/
Wikibooks

Bot ID: 2 Recieved webpage link at : https://timesofindia.indiatimes.com/
Times of India

Bot ID: 3 Recieved webpage link at : https://www.cricbuzz.com/
Cricbuzz

```

## Customization

To customize the starting URLs or add more web crawlers:

1. Open the `Main.java` file.
2. Add more instances of `WebCrawl` with the desired URLs:
```java
bot.add(new WebCrawl("https://example.com/", 4));
```

## Limitations

- The current implementation does not handle loops or duplicate links across different web crawlers.
- The program may run into issues with sites that block web crawlers (like CAPTCHA or IP rate-limiting).
- External links (URLs outside the base domain) are not filtered.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- [JSoup Library](https://jsoup.org/) for easy HTML parsing and web scraping in Java.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/techut30/multithreadwebcrawler

Awesome Lists containing this project

README