https://github.com/andrejewski/slinky
web crawler just for links
https://github.com/andrejewski/slinky
Last synced: 10 months ago
JSON representation
web crawler just for links
- Host: GitHub
- URL: https://github.com/andrejewski/slinky
- Owner: andrejewski
- License: isc
- Created: 2014-08-03T19:25:35.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2014-08-07T19:43:58.000Z (over 11 years ago)
- Last Synced: 2025-04-05T00:02:40.344Z (10 months ago)
- Language: JavaScript
- Size: 141 KB
- Stars: 11
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Slinky
======
Slinky is a web crawler, but just for the links between webpages. Slinky is intended to be used to visualize the routes and structure behind a website by collecting hyperlinks.
If you decide to print out the source code and drop it down a flight of stairs, you may not be disappointed either.
## Installation
```bash
npm install slinky
```
## Usage
Slinky is straightforward to use. Give Slinky a URL and it will index the webpages in that domain.
```javascript
var slinky = require('slinky');
slinky.index('http://example.com', function(error, links) {
if(error) throw error;
Array.isArray(links); // true
console.dir(links);
/*
[
"http://example.com/",
"http://example.com/about.html",
...
]
*/
});
```
## Slinky Class
Slinky is a class that accepts optional configuration options.
```javascript
var Slinky = require('slinky').Slinky;
new Slinky({ // `new` is optional
// default options
limit: 100, // limit the number of links returned
depth: 3, // limit recursion of the index
restrict: true, // limit indexing to the domain of the url
concurrency: 5 // how many async.queue workers to use
});
```
### Slinky#index()
- `#index(
url String,
done Callback(error Error, links Array[String]))`
- `#index(
url String,
each Callback(link String),
done Callback(error Error, links Array[String]))`
The `each` callback will receive each scraped link as they are processed. This is a method of streaming the links instead of waiting for the `done` callback.
The `#index()` is the only method that actually does anything. The other methods of the Slinky class are exposed purely for customization of Slinky.
While the source is there to be read, some overridable methods to note are `#scrapeLinks()` if anchor tags are not what you are targeting and `#validResponse()` if webpages do not have to be HTML. Again, everything is configurable.
## Contributing
Contributions are incredibly welcome as long as they are standardly applicable and pass the tests (or break bad ones). Tests are written in Mocha and assertions are done with the Node.js core `assert` module.
```bash
# running tests
npm run test
npm run test-spec # spec reporter
```
Follow me on [Twitter](https://twitter.com/compooter) for updates or just for the lolz and please check out my other [repositories](https://github.com/andrejewski) if I have earned it. I thank you for reading.