Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/webmiddle/webmiddle
Node.js framework for modular web scraping and data extraction
https://github.com/webmiddle/webmiddle
data-extraction framework jsx jsx-components modular nodejs web-scraping
Last synced: 4 months ago
JSON representation
Node.js framework for modular web scraping and data extraction
- Host: GitHub
- URL: https://github.com/webmiddle/webmiddle
- Owner: webmiddle
- License: mit
- Created: 2016-05-30T11:57:37.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2022-12-09T17:10:15.000Z (about 2 years ago)
- Last Synced: 2024-10-03T13:59:07.957Z (4 months ago)
- Topics: data-extraction, framework, jsx, jsx-components, modular, nodejs, web-scraping
- Language: JavaScript
- Homepage: https://webmiddle.github.io/
- Size: 2.53 MB
- Stars: 14
- Watchers: 3
- Forks: 2
- Open Issues: 43
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# webmiddle
> Node.js framework for modular web scraping and data extraction
The building block of any webmiddle application is the [JSX](http://facebook.github.io/jsx/) component.
Each component executes one task or controls the execution of other tasks by composing other components.```jsx
const FetchPageLinks({ url, query, name }) = () =>
{rawHtml =>
el.text().toUpperCase().indexOf(query.toUpperCase()) !== -1),
$$.map({
url: $$.attr("href"),
text: $$.getFirst()
})
))
}
}/>
}
```The framework provides a set of core components for the most common operations, but there is no difference between a core component and a component that you may want to develop yourself.
Webmiddle applications can be quickly turned into REST APIs, allowing remote access via HTTP or WebSocket.
Use [webmiddle-devtools](https://github.com/webmiddle/webmiddle-devtools) for running and debugging your components and test them remotely.## Links
- [Getting Started](https://webmiddle.github.io/docs/introduction/getting-started)
- [Try it live](https://repl.it/@Maluen/webmiddle-try-it-out)
- [Starter App repository](https://github.com/webmiddle/webmiddle-starter-app)
- [Devtools repository](https://github.com/webmiddle/webmiddle-devtools)## Features
Built-in features provided by the core components:
- **[Concurrency](https://webmiddle.github.io/docs/control-flow/parallel)**, for executing multiple asynchronous components at the same time.
- **[HTTP](https://webmiddle.github.io/docs/fetching/httprequest)** requests.
- **[Puppeteer](https://webmiddle.github.io/docs/fetching/browser)** requests, for SPAs and pages using client-side generated content.
- **[Cookie JAR](https://webmiddle.github.io/docs/fetching/managercookie)**, for sharing cookies among different components and webmiddle objects.
- **[Caching](https://webmiddle.github.io/docs/storing/resume)**, for resuming work in case of crash.
- **[Error handling](https://webmiddle.github.io/docs/webmiddle/errorboundary)**, via customizable retries and catch options.
- **Resource transformations**
- **[HTML/XML to JSON](https://webmiddle.github.io/docs/transforming/cheeriotojson)**
- **[JSON to JSON](https://webmiddle.github.io/docs/transforming/jsonselecttojson)**## Core packages
Name
Description
webmiddle
webmiddle-manager-cookie
webmiddle-component-pipe
webmiddle-component-parallel
webmiddle-component-resume
webmiddle-component-http-request
webmiddle-component-browser
webmiddle-component-cheerio-to-json
webmiddle-component-jsonselect-to-json
webmiddle-server
webmiddle-client
## Open source ecosystem
Create your own components and publish them to npm!
One of the main philosophies of the framework is **reuse**, by creating an ecosystem where components can be published as separate npm modules to be usable in other projects.
**NOTE**: If you think that a component / feature is so common and general that it should be in the core, [open an issue](https://github.com/webmiddle/webmiddle/issues/new) or just do a pull request!
## Contributing
This is a monorepo, i.e. all the core components and the main webmiddle package are all in this single repository.
It uses [Yarn](https://yarnpkg.com) and [Lerna](https://github.com/lerna/lerna) for managing the monorepo, as you might have guessed from the lerna.json file.
Start by installing the root dependencies with:
```bash
yarn
```Then install all the packages dependencies and link the packages together by running:
```bash
yarn run lerna bootstrap
```Build all the packages by running:
```bash
yarn run build
```To run the tests for all the packages at once and get coverage info, execute:
```bash
yarn run test
```> **NOTE**: make sure to build before running the tests.
> **NOTE**: If you are on Windows, you might need to run the install and bootstrap commands as administrator.
Each [package](https://github.com/webmiddle/webmiddle/tree/master/packages) uses the same build / test system.
Once you are inside a package folder, you can build it by running `yarn run build` or `yarn run build:watch` (for rebuilding on every change).
Tests use [AVA](https://github.com/avajs/ava), thus they can be written in modern JavaScript, moreover they will also run concurrently. You can run the tests with `yarn run test`. To run the tests on every change you can use `yarn run test:watch`. The latter option is highly recommended while developing, as it also produces a much more detailed output.
For running the same npm command in all the packages, use `lerna run`, example:
```bash
yarn run lerna run build
```For running arbitrary commands, use `lerna exec`, example:
```bash
yarn run lerna -- exec -- rm -rf ./node_modules
```See [Lerna commands](https://github.com/lerna/lerna#commands) for more info.