Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/reecejohnson/web-crawler
A command-line application to crawl all internal links for a specified URL. Built with: Java, Spring, JUnit, Mockito 🌍🕸.
https://github.com/reecejohnson/web-crawler
Last synced: 14 days ago
JSON representation
A command-line application to crawl all internal links for a specified URL. Built with: Java, Spring, JUnit, Mockito 🌍🕸.
- Host: GitHub
- URL: https://github.com/reecejohnson/web-crawler
- Owner: reecejohnson
- Created: 2021-05-07T06:28:07.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-06-23T09:52:31.000Z (over 3 years ago)
- Last Synced: 2024-01-13T03:27:43.999Z (about 1 year ago)
- Language: Java
- Homepage:
- Size: 510 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Web Crawler 🌎🕸
A command line application to crawl all internal links for a specified URL and print each URL visited with a list of links found on that page to the console.![Example output file](src/main/resources/example.png)
### Rules
- Crawler will not follow external links, only internal
- No pre-built web-scraping frameworks to be used
- Smaller libraries are permitted (e.g. HTML parsing)## Run Tests
`./gradlew test`## Run Application
Run the application by providing arguments:
- The base URL to crawl
- The number of threads to run concurrently.`./gradlew run --args='https://www.url-to-crawl.com 4'`
## Output
Running the application will produce a HTML file of the crawl results at `/output/results.html`
## Questions & Queries
📩 [[email protected]](mailto:[email protected]?subject=Web%20Crawler)