Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pbojinov/mastaphora
A web crawler for find the size of chunks of the web
https://github.com/pbojinov/mastaphora
Last synced: 5 days ago
JSON representation
A web crawler for find the size of chunks of the web
- Host: GitHub
- URL: https://github.com/pbojinov/mastaphora
- Owner: pbojinov
- License: mit
- Created: 2013-07-25T04:24:18.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2013-12-10T17:41:19.000Z (almost 11 years ago)
- Last Synced: 2024-10-12T03:09:14.204Z (about 1 month ago)
- Language: Java
- Size: 371 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Mastaphora Web Crawler
## About
A web crawler for finding the size of chunks of the web.
Mastaphora will find the size of a site and continue to all external a hrefs (based on link depth provided).
The size of the site is calculated by adding up the file size of html, css, js, and images.
## Compiling
javac -cp soup.jar Mastaphora.java
## Usagejava -cp .:jsoup.jar Mastaphora http://google.com 0
arg[0] = site, the root site to start your crawl onarg[1] = linkDepth, the depth to crawl from the root. passing in 0 will crawl the current site.
## Examplejava -cp .:jsoup.jar Mastaphora http://google.com 1
Mastaphora will find the size of Google and all external URLs on Google.
## Heads UpA link depth of greater than 2 on most sites will result in a huge amount of content to crawl as the amount of sites grows exponentially.
Eg. It took me 45+ minutes to crawl Facebook with a link depth of 4.
###
## Author
@pbojinov
## License
The MIT Licnese, do as you please :)
[![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/pbojinov/mastaphora/trend.png)](https://bitdeli.com/free "Bitdeli Badge")