Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/webmiddle/webmiddle

Node.js framework for modular web scraping and data extraction
https://github.com/webmiddle/webmiddle

data-extraction framework jsx jsx-components modular nodejs web-scraping

Last synced: 4 months ago
JSON representation

Node.js framework for modular web scraping and data extraction

Awesome Lists containing this project

README

        


Build Status
Coverage Status

# webmiddle

> Node.js framework for modular web scraping and data extraction

The building block of any webmiddle application is the [JSX](http://facebook.github.io/jsx/) component.
Each component executes one task or controls the execution of other tasks by composing other components.

```jsx
const FetchPageLinks({ url, query, name }) = () =>

{rawHtml =>
el.text().toUpperCase().indexOf(query.toUpperCase()) !== -1),
$$.map({
url: $$.attr("href"),
text: $$.getFirst()
})
))
}
}/>
}

```

The framework provides a set of core components for the most common operations, but there is no difference between a core component and a component that you may want to develop yourself.

Webmiddle applications can be quickly turned into REST APIs, allowing remote access via HTTP or WebSocket.
Use [webmiddle-devtools](https://github.com/webmiddle/webmiddle-devtools) for running and debugging your components and test them remotely.

## Links

- [Getting Started](https://webmiddle.github.io/docs/introduction/getting-started)
- [Try it live](https://repl.it/@Maluen/webmiddle-try-it-out)
- [Starter App repository](https://github.com/webmiddle/webmiddle-starter-app)
- [Devtools repository](https://github.com/webmiddle/webmiddle-devtools)

## Features

Built-in features provided by the core components:

- **[Concurrency](https://webmiddle.github.io/docs/control-flow/parallel)**, for executing multiple asynchronous components at the same time.
- **[HTTP](https://webmiddle.github.io/docs/fetching/httprequest)** requests.
- **[Puppeteer](https://webmiddle.github.io/docs/fetching/browser)** requests, for SPAs and pages using client-side generated content.
- **[Cookie JAR](https://webmiddle.github.io/docs/fetching/managercookie)**, for sharing cookies among different components and webmiddle objects.
- **[Caching](https://webmiddle.github.io/docs/storing/resume)**, for resuming work in case of crash.
- **[Error handling](https://webmiddle.github.io/docs/webmiddle/errorboundary)**, via customizable retries and catch options.
- **Resource transformations**
- **[HTML/XML to JSON](https://webmiddle.github.io/docs/transforming/cheeriotojson)**
- **[JSON to JSON](https://webmiddle.github.io/docs/transforming/jsonselecttojson)**

## Core packages



Name
Description




webmiddle
npm version


webmiddle-manager-cookie
npm version


webmiddle-component-pipe
npm version


webmiddle-component-parallel
npm version


webmiddle-component-resume
npm version


webmiddle-component-http-request
npm version


webmiddle-component-browser
npm version


webmiddle-component-cheerio-to-json
npm version


webmiddle-component-jsonselect-to-json
npm version


webmiddle-server
npm version


webmiddle-client
npm version

## Open source ecosystem

Create your own components and publish them to npm!

One of the main philosophies of the framework is **reuse**, by creating an ecosystem where components can be published as separate npm modules to be usable in other projects.

**NOTE**: If you think that a component / feature is so common and general that it should be in the core, [open an issue](https://github.com/webmiddle/webmiddle/issues/new) or just do a pull request!

## Contributing

This is a monorepo, i.e. all the core components and the main webmiddle package are all in this single repository.

It uses [Yarn](https://yarnpkg.com) and [Lerna](https://github.com/lerna/lerna) for managing the monorepo, as you might have guessed from the lerna.json file.

Start by installing the root dependencies with:

```bash
yarn
```

Then install all the packages dependencies and link the packages together by running:

```bash
yarn run lerna bootstrap
```

Build all the packages by running:

```bash
yarn run build
```

To run the tests for all the packages at once and get coverage info, execute:

```bash
yarn run test
```

> **NOTE**: make sure to build before running the tests.

> **NOTE**: If you are on Windows, you might need to run the install and bootstrap commands as administrator.

Each [package](https://github.com/webmiddle/webmiddle/tree/master/packages) uses the same build / test system.

Once you are inside a package folder, you can build it by running `yarn run build` or `yarn run build:watch` (for rebuilding on every change).

Tests use [AVA](https://github.com/avajs/ava), thus they can be written in modern JavaScript, moreover they will also run concurrently. You can run the tests with `yarn run test`. To run the tests on every change you can use `yarn run test:watch`. The latter option is highly recommended while developing, as it also produces a much more detailed output.

For running the same npm command in all the packages, use `lerna run`, example:

```bash
yarn run lerna run build
```

For running arbitrary commands, use `lerna exec`, example:

```bash
yarn run lerna -- exec -- rm -rf ./node_modules
```

See [Lerna commands](https://github.com/lerna/lerna#commands) for more info.