Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hkattt/gopher-web-crawler
Rust Gopher web crawler
https://github.com/hkattt/gopher-web-crawler
Last synced: 9 days ago
JSON representation
Rust Gopher web crawler
- Host: GitHub
- URL: https://github.com/hkattt/gopher-web-crawler
- Owner: hkattt
- Created: 2024-04-11T07:08:23.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-06-10T07:56:52.000Z (7 months ago)
- Last Synced: 2024-06-10T10:52:02.148Z (7 months ago)
- Language: Rust
- Homepage:
- Size: 266 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Gopher Web Crawler
Gopher web crawler implemented in Rust for COMP3310 Computer Networks at the ANU.
## Requirements
See the [Installation section](https://doc.rust-lang.org/book/ch01-01-installation.html) of The Rust Programming Language book for installation steps.The project uses the following crates:
* `chrono`: For date-time functionality.
* `debug_print`: For print functions which only trigger in debug mode.All networking functionality was done using standard library imports.
The Gopher crawler has been successfully tested on Linux and Windows.
## Usage
The usage for the program is:
```
gopher [-n ] [-p ] [-d]
```
Where
* `-n` specifies the name of the server to crawl
* `-p` specifies the port of the server to crawl
* `-d` flags that the output directory `out` should **not** be deletedwith default values `server_name=comp3310.ddns.net` and `server_port=70`.
To run the program in debug mode use
```
cargo run -- [-n ] [-p ] [-d]
```
in the root directory. In debug mode, the program will print additional information and error messages. To run the program in release mode use
```
cargo run --release -- [-n ] [-p ] [-d]
```
This will only print request information and the final crawl report.## Project Structure
```
├── Cargo.lock
├── Cargo.toml
├── imgs
│ └── wireshark-convo.png
├── README.md
└── src
├── crawler.rs
├── gopher
│ ├── request.rs
│ └── response.rs
├── gopher.rs
└── main.rs
```## External Servers
An external server is any referenced server that is on a different host or port to the default server. `comp3310.ddns.net:70` references two external servers. Further details can be found in the crawler report.## Invalid References
Files only contribute to the file count and file statistics if the Gopher transaction was completed successfully. Responses that we timed-out are not deemed successful transactions.As RFC 1436, text file and directory item types should be terminated with the last line `'.'CR-LF`. If the last line is missing, the transaction is not counted as successful.
The crawler identified 5 problematic internal references which had to be dealt with explicitly. The full details can be found in the crawler report.
## Crawler Report
This final crawler report for `comp3310.ddns.net:70` is shown below.
```
START CRAWLER REPORTNumber of Gopher directories: 41
comp3310.ddns.net:70:
comp3310.ddns.net:70: /acme
comp3310.ddns.net:70: /acme/products
comp3310.ddns.net:70: /acme/products/traps
comp3310.ddns.net:70: /maze/17
comp3310.ddns.net:70: /maze/18
comp3310.ddns.net:70: /maze/19
comp3310.ddns.net:70: /maze/20
comp3310.ddns.net:70: /maze/21
comp3310.ddns.net:70: /maze/22
comp3310.ddns.net:70: /maze/23
comp3310.ddns.net:70: /misc
comp3310.ddns.net:70: /misc/empty
comp3310.ddns.net:70: /misc/malformed1
comp3310.ddns.net:70: /misc/more
comp3310.ddns.net:70: /misc/nesta
comp3310.ddns.net:70: /misc/nestb
comp3310.ddns.net:70: /misc/nestc
comp3310.ddns.net:70: /misc/nestd
comp3310.ddns.net:70: /misc/neste
comp3310.ddns.net:70: /misc/nestf
comp3310.ddns.net:70: /misc/nestg
comp3310.ddns.net:70: /misc/nesth
comp3310.ddns.net:70: /misc/nesti
comp3310.ddns.net:70: /misc/nestj
comp3310.ddns.net:70: /misc/nestk
comp3310.ddns.net:70: /misc/nestl
comp3310.ddns.net:70: /misc/nestm
comp3310.ddns.net:70: /misc/nestn
comp3310.ddns.net:70: /misc/nesto
comp3310.ddns.net:70: /misc/nestp
comp3310.ddns.net:70: /misc/nestq
comp3310.ddns.net:70: /misc/nestr
comp3310.ddns.net:70: /misc/nests
comp3310.ddns.net:70: /misc/nestt
comp3310.ddns.net:70: /misc/nestu
comp3310.ddns.net:70: /misc/nestv
comp3310.ddns.net:70: /misc/nestw
comp3310.ddns.net:70: /misc/nestx
comp3310.ddns.net:70: /misc/nesty
comp3310.ddns.net:70: /misc/nonexistentNumber of simple text files: 11
comp3310.ddns.net:70: /acme/about
comp3310.ddns.net:70: /acme/contact
comp3310.ddns.net:70: /acme/products/anvils
comp3310.ddns.net:70: /acme/products/paint
comp3310.ddns.net:70: /acme/products/pianos
comp3310.ddns.net:70: /maze/floppy
comp3310.ddns.net:70: /maze/statuette
comp3310.ddns.net:70: /misc/empty.txt
comp3310.ddns.net:70: /misc/loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong
comp3310.ddns.net:70: /misc/nestz
comp3310.ddns.net:70: /rfc1436.txtNumber of binary files: 2
comp3310.ddns.net:70: /misc/binary
comp3310.ddns.net:70: /misc/encabulator.jpegSmallest text file: comp3310.ddns.net:70: /misc/empty.txt
Size: 0 bytes
Contents:Size of the largest text file: 37393 bytes
comp3310.ddns.net:70: /rfc1436.txtSize of the smallest binary file: 253 bytes
comp3310.ddns.net:70: /misc/binarySize of the largest binary file: 45584 bytes
comp3310.ddns.net:70: /misc/encabulator.jpegThe number of unique invalid references (error types): 2
List of external servers:
comp3310.ddns.net:71 did not connect
gopher.floodgap.com:70 connected successfullyReferences that have issues/errors:
Connection timed out comp3310.ddns.net:70 /misc/godot
Connection timed out comp3310.ddns.net:70 /misc/tarpit
File too long comp3310.ddns.net:70 /misc/firehose
Malformed response line 1Some menu - but on what host??? /misc/malformed1/file
Missing end-line comp3310.ddns.net:70 /misc/malformed2END CRAWLER REPORT
```