Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bitliner/generator-html-parser
Generate the basic structure of an html parser to be used for scraping purpose
https://github.com/bitliner/generator-html-parser
Last synced: about 20 hours ago
JSON representation
Generate the basic structure of an html parser to be used for scraping purpose
- Host: GitHub
- URL: https://github.com/bitliner/generator-html-parser
- Owner: bitliner
- License: mit
- Created: 2014-10-16T23:53:01.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2018-08-05T18:07:15.000Z (over 6 years ago)
- Last Synced: 2025-01-19T01:47:03.388Z (5 days ago)
- Language: JavaScript
- Homepage:
- Size: 62.5 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# generator-html-parser
A generator for [Yeoman](http://yeoman.io).
It generates the basic structure of an html parser in node.js.
Useful if you are doing scraping with node.js.
## Getting Started
### How to install it
To install generator-html-parser from npm, run:
```
$ npm install -g generator-html-parser
```### How to use it
1. `mkdir facebook-html-parser && cd $_`
2. `yo html-parser`That's it!
### How to customize it to parse any html string you need
The main file is `-html-parser.js`.
It contains two methods
1. `parse(html,url)`: it receives as input the html (string) to parse and an url (string), useful if you need to resolve some relative url with the node module *Url* (already imported)
2. `getNextPages(html,url)`: to get the urls of next pages to surf. Usually useful when you are scraping a list of pages. Still, it takes as input the html (string) to parse, and the url (string) to resolve eventually urls extracted from the html.### Test
The generated code contains code for testing as well.
Have a look at the folder `test/`### Details of implementation
It is based on [cheerio](https://www.npmjs.org/package/cheerio) to parse the html.
Cheerio is like jQuery, but faster.
```
$ = cheerio.load(html);$('.item').each(function() {
var el=$(this);
result.push(el.text());
})```
## License
[MIT License](http://en.wikipedia.org/wiki/MIT_License)