Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/reecejohnson/web-crawler

A command-line application to crawl all internal links for a specified URL. Built with: Java, Spring, JUnit, Mockito 🌍🕸.
https://github.com/reecejohnson/web-crawler

Last synced: 14 days ago
JSON representation

A command-line application to crawl all internal links for a specified URL. Built with: Java, Spring, JUnit, Mockito 🌍🕸.

Awesome Lists containing this project

README

        

# Web Crawler 🌎🕸
A command line application to crawl all internal links for a specified URL and print each URL visited with a list of links found on that page to the console.

![Example output file](src/main/resources/example.png)

### Rules
- Crawler will not follow external links, only internal
- No pre-built web-scraping frameworks to be used
- Smaller libraries are permitted (e.g. HTML parsing)

## Run Tests
`./gradlew test`

## Run Application
Run the application by providing arguments:
- The base URL to crawl
- The number of threads to run concurrently.

`./gradlew run --args='https://www.url-to-crawl.com 4'`

## Output
Running the application will produce a HTML file of the crawl results at `/output/results.html`

## Questions & Queries
📩 [[email protected]](mailto:[email protected]?subject=Web%20Crawler)