https://github.com/wagoodman/simplewebcrawler
A simple java web crawler to visualize relations between web pages.
https://github.com/wagoodman/simplewebcrawler
Last synced: 7 months ago
JSON representation
A simple java web crawler to visualize relations between web pages.
- Host: GitHub
- URL: https://github.com/wagoodman/simplewebcrawler
- Owner: wagoodman
- Created: 2014-08-30T19:04:09.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2014-08-30T19:04:15.000Z (about 11 years ago)
- Last Synced: 2025-01-23T16:23:18.229Z (9 months ago)
- Language: Java
- Size: 129 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Simple Web Crawler
==================Given a web page this just finds all links in that page and visits them. Be careful though! This
blindly visits as many pages as possible to any link found.```
Compiler Compliance Level: 1.6 (Oracle JDK 1.6)
IDE Used: Eclipse Kepler for Java Developers (Build 20130614-0229)
```Import the Project
------------------
1) Open Eclipse
2) File > New > Java Project
3) Unslelect 'Use Default Location' and input the path to the SimpleWebCrawler directory
4) Select all projects from the detected project list
5) Click FinishUsage
-----
1) In Eclipse, right click "CrawlerApp.java" > RunAs... > Java Application
2) Input the following items:
- Search Mechanism: BreadthFirst, BestFirst, DepthFirst
- Seed URL
- Threads to use
- Maximum depth to search to
- Maximum URL count to search to
3) Click the button to startA tree will form in the first tab showing all URL relations; you can drag and zoom in/out.
A table and chart will be shown in the second tab with the details about each URL.
Progress is shown on the bottom bar with the search status in the bottom right corner.
To start another search, close the program and reopen it.
Note: when using multiple threads the group will need to discover existing nodes, give the
program about 15 seconds to get all threads in sync.