Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/shaharashe/url-crawler
https://github.com/shaharashe/url-crawler
crawler design-patterns http-requests java
Last synced: 5 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/shaharashe/url-crawler
- Owner: ShaharAshe
- Created: 2024-09-08T15:24:02.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-09-08T15:26:04.000Z (4 months ago)
- Last Synced: 2024-11-10T00:09:16.471Z (2 months ago)
- Topics: crawler, design-patterns, http-requests, java
- Language: Java
- Homepage:
- Size: 156 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
---
# πΈοΈ Web Crawler Project
## Table of Contents
- [About Me](#about-me)
- [Overview](#overview)
- [JavaDoc](#-javadoc)
- [Project Structure](#-project-structure)
- [Usage](#-usage)
- [Design Patterns](#-design-patterns)---
# About Me
- π Name: Shahar Asher
- π« Email: [[email protected]](mailto:[email protected])---
## Overview
This project is a Java-based multi-threaded web crawler. It reads URLs from a specified file, downloads content, and outputs results in different formats. The application allows for configuring output formats, thread pool size, and uses design patterns to maintain flexibility and scalability.
---
## π JavaDoc
To view the generated JavaDoc:1. Open the [doc](./doc) directory.
2. Locate the [index.html](./doc/index.html) file.
3. Open it with a web browser to explore the documentation.---
## π Project StructureThe project contains several Java classes with the following roles:
- **Main.java**: The main entry point for running the program. It handles command-line arguments and initializes the web crawler.
- **Controller.java**: Manages the crawling process, including reading URLs, setting up a thread pool, and handling output formats.
- **Downloader.java**: An abstract base class for downloading content from a given URL. It defines common behavior for handling HTTP requests and output processing.
- **ImageDownloader.java**: A concrete subclass of `Downloader` specialized for downloading image content.
- **FormatFactory.java**: A factory class responsible for creating instances of output formats based on provided types.
- **OutFormat.java**: An interface that defines a contract for various output formats.
- **SizeFormat**, **UrlFormat**, **TimeFormat**, **ImagTypeFormat**: Implementations of `OutFormat` that represent different output formats for the web crawler.
- **FileReaderComp.java**: A class that reads URLs from a specified file.---
## π Usage
To run the web crawler, follow these steps:
1. **Compile the Code**:
Compile all Java files to ensure the project is built properly.```bash
javac ex2/*.java
```2. **Run the Application**:
Execute the program with the appropriate command-line arguments: the output format, thread pool size, and the file name containing the URLs to be crawled.```bash
java ex2.Main
```- **Output Format**: A string indicating the desired output formats, such as 's' for size, 'u' for URL, 't' for time, and 'm' for image type.
- **Pool Size**: An integer representing the number of threads for multi-threaded execution.
- **File Name**: The name of the file that contains the list of URLs to be crawled.Example command:
```bash
java ex2.Main sutm 4 urls.txt
```---
## π‘ Design Patterns
This project uses several design patterns to enhance maintainability and flexibility:
- **Factory Pattern**: Implemented in `FormatFactory`, allowing dynamic creation of output formats. This pattern solves the problem of creating format-specific instances without changing the core logic.
- **Template Method Pattern**: Used in the `Downloader` class, which provides a common structure for downloading content, allowing subclasses like `ImageDownloader` to define specific behaviors. This pattern addresses the problem of code duplication by centralizing common behavior and providing a template for extensions.
- **Strategy Pattern**: Implemented with the `OutFormat` interface and its different implementations. It provides flexibility in choosing output formats at runtime, solving the problem of hard coding specific behaviors.
These patterns contribute to the scalability and maintainability of the code, enabling easy addition of new formats and download behaviors.
---