Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/titarenko/xstruct
Set of tools for structured data extraction from web.
https://github.com/titarenko/xstruct
Last synced: about 1 month ago
JSON representation
Set of tools for structured data extraction from web.
- Host: GitHub
- URL: https://github.com/titarenko/xstruct
- Owner: titarenko
- Created: 2013-06-01T07:07:40.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2015-08-25T21:22:43.000Z (over 9 years ago)
- Last Synced: 2024-11-13T21:45:48.351Z (about 2 months ago)
- Language: JavaScript
- Homepage:
- Size: 579 KB
- Stars: 2
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# xstruct
Set of tools for structured data extraction from web.
[![Build Status](https://secure.travis-ci.org/titarenko/xstruct.png?branch=master)](https://travis-ci.org/titarenko/xstruct) [![Coverage Status](https://coveralls.io/repos/titarenko/xstruct/badge.png)](https://coveralls.io/r/titarenko/xstruct)
[![NPM](https://nodei.co/npm/xstruct.png?downloads=true&stars=true)](https://nodei.co/npm/xstruct/)
## Installation
```bash
npm i xstruct --save
```## Example
Example of how easy it is to extract, for example, comments from [dou.ua forum](http://dou.ua/forum).
```js
var $ = require('xstruct');return $.getHtml('http://dou.ua/forums/topic/14416/')
.then(function (html) {
return html('.b-comment').map(function () {
var el = $.wrapHtml(this);
return {
author: el.find('.avatar').text(),
time: el.find('.comment-link').text(),
text: el.find('.text').contents().map(function () {
return $.wrapHtml(this).text();
}).get()
};
}).toArray();
})
.map(function (post) {
return {
author: $.cleanText(post, 'author'),
time: $.cleanText(post, 'time'),
text: $.cleanText(post, 'text', { singleline: true })
};
})
.done(console.log, console.log);
```## Description
### getHtml(url[, qs][, encoding])
Returns promise with downloaded and cheerio-wrapped HTML (optionally, if encoding is specified, document will be converted before passing it to cheerio). If qs (query string object) is specified, query string will be appended to url.
### getJson(url[, qs])
Returns promise with downloaded and parsed JSON. If qs (query string object) is specified, query string will be appended to url.
### postForm(url, form)
Returns promise with result of form posting. Activates cookie persistence.
### request(options)
Promised version of `request.js` root function.
### wrapHtml(cheerioElement)
Calls `cheerio(cheerioElement)` and returns result synchronously.
### format
Alias for `util.format`.
### cleanText(obj, path[, options])
Takes text from object using path and cleans it by removing heading and trailing spaces, removing space and period repetitions, converting to single-line text if `options.singleline` is specified, and also removing any characters from ones specified via `options.remove` (if specified). Returns null if result is empty string or nothing.
### cleanNumber(obj, path)
Acts like `cleanText`, but casts result to number in the end. If result is not-a-number, returns null.
### cleanDateTime(obj, path[, options])
Acts like `cleanText`, but casts result to date in the end (using moment.js). If result is not a valid date, returns null. You can optionally specify date-time format via `options.format`.
## cleanObject(obj)
Returns object as is or null if all its properties do not have value.
### _.*
Exposes all functions from `lodash`.
### limit(requests, period)
Limits library to do at most `requests` number of HTTP-requests per `period` in milliseconds.
## Building blocks
This library is built with heavy usage of `request`, `cheerio`, `lodash` and `bluebird`. Also it uses `iconv-lite`, `moment` and `util` as additional utils.
# License
MIT