https://github.com/tmpfs/wget-parser
Parses the wget spider output
https://github.com/tmpfs/wget-parser
Last synced: 4 months ago
JSON representation
Parses the wget spider output
- Host: GitHub
- URL: https://github.com/tmpfs/wget-parser
- Owner: tmpfs
- Created: 2016-02-10T08:22:03.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-02-10T09:43:35.000Z (over 9 years ago)
- Last Synced: 2025-02-05T11:06:50.396Z (5 months ago)
- Language: JavaScript
- Size: 28.3 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Table of Contents
=================* [Spider parser](#spider-parser)
* [Usage](#usage)
* [wget-parser](#wget-parser)
* [wget-spider](#wget-spider)
* [Output](#output)
* [Developer](#developer)
* [Test](#test)
* [Cover](#cover)
* [Lint](#lint)
* [Clean](#clean)
* [Readme](#readme)Spider parser
=============[
](https://travis-ci.org/tmpfs/wget-parser)
[](https://npmjs.org/package/wget-parser)
[](https://coveralls.io/github/tmpfs/wget-parser?branch=master).
Parses the spider output from [wget](https://www.gnu.org/software/wget) into an object structure of links.
This object could then be processed further to create a tree structure of the hierarchy of a website such that sitemap generation could be implemented.
Tested using `wget v1.15` on linux.
## Usage
```javascript
var parser = require('wget-parser')
, buf = new Buffer(0); // buffer should contain the spider output
console.dir(parser(buf));
```* `parser.Parser`: The parser class.
* `parser.Link`: The class that represents a link.
* `parser.ParseStream`: Parse stream class.Streams support is available, see the [test spec](https://github.com/tmpfs/wget-parser/blob/master/test/spec/parser.js) for example usage.
### wget-parser
A program that reads from `stdin` and prints the result of the parse as JSON, exits with error code 1 if any broken links are found.
```
cat test/fixtures/mock.txt | wget-parser
cat test/fixtures/broken.txt | wget-parser; echo $?;
```### wget-spider
A program that performs a spider with [wget](https://www.gnu.org/software/wget) and pipes the output to `wget-parser`:
```
wget-spider http://google.com
```## Output
Example output from the parser:
```json
{
"links": [
{
"url": {
"protocol": "http:",
"slashes": true,
"auth": null,
"host": "google.com",
"port": null,
"hostname": "google.com",
"hash": null,
"search": null,
"query": null,
"pathname": "/",
"path": "/",
"href": "http://google.com/"
},
"link": "http://google.com/",
"line": "--2016-02-10 16:11:57-- http://google.com/"
},
{
"url": {
"protocol": "http:",
"slashes": true,
"auth": null,
"host": "www.google.co.id",
"port": null,
"hostname": "www.google.co.id",
"hash": null,
"search": "?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"query": "gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"pathname": "/",
"path": "/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"href": "http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ"
},
"link": "http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"line": "--2016-02-10 16:11:57-- http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ"
}
],
"broken": []
}
```## Developer
### Test
To run the test suite:
```
npm test
```### Cover
To generate code coverage run:
```
npm run cover
```### Lint
Run the source tree through [jshint](http://jshint.com) and [jscs](http://jscs.info):
```
npm run lint
```### Clean
Remove generated files:
```
npm run clean
```### Readme
To build the readme file from the partial definitions:
```
npm run readme
```Generated by [mdp(1)](https://github.com/tmpfs/mdp).
[wget]: https://www.gnu.org/software/wget
[jshint]: http://jshint.com
[jscs]: http://jscs.info