Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/alpacatravel/import-streams

A tool to describe streams for processing imports or data in YAML/JSON
https://github.com/alpacatravel/import-streams
Last synced: about 1 month ago
JSON representation
A tool to describe streams for processing imports or data in YAML/JSON
Host: GitHub
URL: https://github.com/alpacatravel/import-streams
Owner: AlpacaTravel
License: mit
Created: 2020-05-18T00:37:40.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2023-01-06T06:22:16.000Z (almost 2 years ago)
Last Synced: 2024-11-11T10:50:55.530Z (about 1 month ago)
Language: TypeScript
Homepage:
Size: 1.31 MB
Stars: 3
Watchers: 2
Forks: 0
Open Issues: 43
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # import-streams

[![Build Status](https://travis-ci.com/AlpacaTravel/import-streams.svg?branch=master)](https://travis-ci.com/AlpacaTravel/import-streams)[![Coverage Status](https://coveralls.io/repos/github/AlpacaTravel/import-streams/badge.svg?branch=master)](https://coveralls.io/github/AlpacaTravel/import-streams?branch=master)![MIT](https://img.shields.io/npm/l/@alpaca-travel/import-streams)[![npm](https://img.shields.io/npm/v/@alpaca-travel/import-streams)](https://www.npmjs.com/package/@alpaca-travel/import-streams)

This project is currently in a Alpha Preview release.

A simple tool that describes streams to process imports or data processing in YAML/JSON.

It is built to help transform data between formats and sources, such as reading in data from any source, transforming it to the required format, and writing it to another location.

## Getting Started

Let's start with creating an simple YAML file `stream.yaml` which will pipe a file obtained from this repository to the command line.

```yaml

version: 1.0

# Trivial stream

stream:

  # Read a URL source

  - type: fetch-stream

    options:

      url: https://raw.githubusercontent.com/AlpacaTravel/import-streams/master/packages/import-streams/tests/data/file.txt

  # Output the stream contents to the screen

  - process.stdout

```

There are all types of stream sources, transforms and ways to write out data, ranging from reading and writing files to working with HTTP API's, CMS's and AWS S3. You can compose complex stream pipelines easily in one file.

You can run import streams in a node environment (tested in latest LTS 10+). Your options are to run with `npx` which makes the command available without any installation as follows:

```shell

$ npx @alpaca-travel/import-streams stream.yaml

```

Otherwise, if you like it and perform regular import streams work, you can install it on your host using npm to install globally. The CLI `import-streams` will become available for you to point at any YAML file.

```shell

$ npm install -g @alpaca-travel/import-streams

$ import-streams stream.yaml

```

## Creating Pipelines

The import streams works with standard Readable, Transform and Writable streams.

Those streams can be composed into a pipeline that can follow a read ->

transform -> write.

With the `import-streams-compose` functionality, you can combine together

multiple read streams, transforms and writes together into a pipeline.

## Package Capabilities

There are a lot of already developed streams available for processing input and sending output. These form a "swiss-army" knife of transform functions making processing data from various sources easier.

- Input/Output Sqlite statements, Read from JSON:API, Fetch HTTP/HTTPS, AWS S3 and local filesystems

- Control flow, combine multiple read/write streams, transform individual values

- Parse/stringify URI, JSON, CSV

- basic type coercion, sprintf

- map, flatten, join, concat

- path selectors, filter expressions

- JSON schema validation

- expressive math, regex, control, combining, string manipulation, membership, existence type, equality

- prettier, html santize

- cipher, zlib/unzip

- ... and Drupal fields, entity references, etc

## Docs

See [docs](https://github.com/AlpacaTravel/import-streams/tree/master/docs)

## Goals

The following goals are behind the import-streams

- Describe simple imports and processing in a readable format (YAML/JSON)

- Provide a comprehensive toolset with the ability to expand your own

- Make mapping data between formats/sources easy

- Leverage streams and tools to handle processing flow

- MIT

## Implementation Overview

- You can describe imports either using YAML, JSON or programatically using typings and steams directly

- Describe the import using various stages, that can define read sources, map/lookup values for fields and write output

- Map fields and properties using 'selectors' as well as transforms that can change the data

- Supply to an exposed 'compose' function that crates the implementation and performs the actions

## Composeable Pipelines

This library offers a way to compose a pipeline of streams.

More complex pipelines can define combined read/write sources, or branch processing based on the intended sources. In order to make it easier to build more complex pipelines that could potentially be defined in a configuration file and unique to the runtime requirements, the `@alpaca-travel/import-streams-compose` was created (which is used by the core import-streams).

The library exposes a `compose` function that can read an object defining simple or complex pipelines you wish to process.

The pipeline can create a set of pipelines "in series" (following the usual read -> (transform) -> write), but also branch and combine from other pipelines that could combine multiple read sources and write to multiple output sources.

Any readable-stream is supported, and you can integrate your own streams into the composition.

This is the basis of creating reusable compositions that could be defined as JSON or YAML.

Features:

- Define stream pipelines

- Combine both read and write streams

- Supports complex branching

- Supports map-reduce patterns as well as other transformations

- Easily provide your own streams for further capabilities

### Importing from a CSV to a collection

```javascript

import compose from "@alpaca-travel/import-streams";

// Using YAML, but can use JSON etc.

const definition = /* YAML */ `

# Example of syncing places from a CSV file

version: 1.0

streams:

  # Obtain the CSV from a URL

  # Place your CSV up somewhere the script can access

  # You can configured options to include fetch options such as headers/auth etc

  - type: fetch-stream

    options:

      url: https://www.example.com/my-place-data.csv

  # Parse the CSV

  # This is a full CSV streaming behaviour, with lots of configurables

  - type: csv-parse

    options:

      columns: 

        # Example CSV File Structure, use your own

        - id

        - last_updated

        - title

        - coords

        # ... other fields etc

      quote: '"'

      ltrim: true

      rtrim: true

      delimiter: ,

  # Map fields into an item structure

  # We can map basics or more complex field and field types

  - type: map-selector

    options:

      template:

        # Build out a data structure supporting the Alpaca Schemas

        $schema: https://schemas.alpaca.travel/item-v1.0.0.schema.json

        resource:

          $schema: https://schemas.alpaca.travel/place-v1.0.0.schema.json

      mapping:

        # Map your columns to the desired locations

        # Map the title to the CSV column "title"

        title: title

        # Obtain a lng/lat from the lng-lat column; e.g. "lng,lat" 

        position:

          path: coords

          transform:

            - type: to-coordinate

            - options:

              # Depending if you use lng/lat or lat/lng

              # flip: true

              # The delimiter, if you use something other than ","

              # delimiter: ";"

        # Syncing fields, this can be used only to update affected records

        modified:

          path: last_updated

          transform:

            - to-date-format

        

        # Map a "custom://external-ref" to your ID to sync

        custom://external-ref: id

        # Assign the source (Recommended to avoid ID conflicts)

        custom://external-source:

          path: .

          transform:

            - type: replace

              options:

                value: https://www.example.com

        # More mapping here....

  # Sync only changed records to a collection, and hide missing

  # Note: You can also combine: to multiple write sources

  - type: sync-external-items

    options:

      apiKey: ...

      collection: alpaca://collection/123

      profile: alpaca://profile/XYZ

`;

const factory = ({ type, options }) => {

  // return my own streams here to mix them in

  // eg. parsing your own CSV values into different structures

};

// Compose a stream based on a struct

const stream = compose(definition, { factory }).on("finish", () =>

  console.log("Complete!")

);

```

### Example, sourcing from Drupal using JSON:API core module

The below examples hows leveraging the JSON:API which can be enabled in Drupal in order to read business listings into your Alpaca collection.

```javascript

import compose from "@alpaca-travel/import-streams";

// Example import configuration using import-streams

// You can use YAML, JSON or use exported typescript types

const definition = /* YAML */ `

# Example of syncing records from a Drupal site with the JSON:API core module enabled

# This can support all types of entities and media based on the already available drupal field-types

version: 1.0

streams:

  # Site one source

  # You can combine additional read sources using "combine: ..."

  - type: json-api-object

    options:

      url: https://www.site1.com/jsonapi/type

  # Transform data

  - type: map-selector

    options:

      mapping:

        # Map a json-api data record

        title: attributes.title

        # Parse basic types with transforms

        modified:

          path: attributes.changed

          transform:

            - to-date-format

        # Parse more complex values with pre-established transforms

        position:

          path: attributes.lngLat

          transform:

            - to-coordinate

        # Parse through multiple streams

        description:

          path:

            - attributes.description

            # support additional selectors with fall-over

          transform:

            # Transform individual values

            - html-sanitize

            - html-prettier

        

        # Support map/reduce on individual fields

        tags:

          path: relationships.field_types

          transform:

            # Leverage pre-existing transforms offered to map data

            - type: drupal.field-types.json-api.entity-reference

              options:

                iterate: true

                mapping:

                  title: attributes.title

            # and reduce...

            - flatten

        # Support alpaca attributes

        custom://external-ref: id

        # With more complex transforms

        custom://external-source:

          path: .

          transform:

            - type: replace

              options:

                value: https://www.site1.com

  # Stream additional phases with transformed docs, etc

  # ...

  # Sync only changed records to a collection, and hide missing

  # Note: You can also combine: to multiple write sources

  - type: alpaca-sync-external-items

    options:

      apiKey: ...

      collection: alpaca://collection/123

      profile: alpaca://profile/XYZ

`;

const factory = ({ type, options }) => {

  // return my own streams here to mix them in

};

// Compose a stream based on a struct

const stream = compose(definition, { factory }).on("finish", () =>

  console.log("Complete!")

);

```

## Adding your own streams

import-streams uses a type lookup which allows you to create and map your own

streams. These streams must follow the support of `readable-streams`.

```javascript

const compose = require("@alpaca-travel/import-streams").default;

const { factory } = require("@alpaca-travel/compatibility-import-streams");

const fs = require("fs");

const path = require("path");

// My custom stream class

const MyStream = require("./my-stream");

// Provide a stream wrapper to instantiate my custom stream

const factoryWrapper = ({ type, options }) => {

  if (type === "my-stream") {

    return new MyStream();

  }

  return factory({ type, options });

};

// Read in my stream pipeline file

const pipeline = fs.readFileSync(

  path.resolve(__dirname, "./stream.yaml"),

  "utf-8"

);

// Create the pipeline with our custom factory

compose(pipeline, {

  factory: factoryWrapper,

})

  .on("finish", () => console.log("Finished"))

  .on("error", console.error);

```

## AWS Lambda Runtime

AWS Lambda can provide a friendly runtime environment in order to host your regular ongoing import processes. You can build your import using the `serverless` framework, or leverage the following layer ARN

Layer ARN (version 0.0.86-alpha.0)

```

arn:aws:lambda:ap-southeast-2:353721752909:layer:import-streams:4

```

### Step by step guide to create a AWS lambda import-streams

It can be quick and easy to create a lambda leveraging the shared layer. This allows you to use a simple script and yaml file without any installation in order to run your import on a cron-like schedule.

#### Creating the lambda function

1. Log into Amazon Web Services

2. Go to Lambda

3. Click "Create Function"

4. Select "Author from scratch"

5. Enter the name of your lambda function, such as "example-import-stream"

6. Chose the runtime of "Node.js 10.x"

7. Select "Create Function"

8. Select "Layers" from the Designer

9. Select "Add a layer"

10. Select "Provide a layer version ARN"

11. Enter the Layer ARN (as shown above) into the field

12. Click "Add"

#### Adding the script

In the section titled "Function code", replace the index.js source code with:

```javascript

// This is available when the layer is added without

const compose = require("@alpaca-travel/import-streams").default;

// This is the default handler

module.exports.handler = async () => {

  // Obtain the stream path

  const resolved = require("path").resolve("./stream.yaml");

  try {

    // Include the source

    const source = require("fs").readFileSync(resolved, "utf-8");

    // Process

    await new Promise((success, fail) => {

      // This does out compose

      compose(source).on("finish", success).on("error", fail);

    });

  } catch (e) {

    console.error(e);

    throw e;

  }

};

```

Then, create a script beside the index.js file named "stream.yaml" and paste your stream yaml contents.

```yaml

version: 1.0

# Trivial stream

stream:

  # Read a URL source

  - type: fetch-stream

    options:

      url: https://raw.githubusercontent.com/AlpacaTravel/import-streams/master/packages/import-streams/tests/data/file.txt

  # Output the stream contents to the screen

  - process.stdout

```

Finally, save your lambda function and test it out. If you have used the above example, you should see the words "Hello import-streams, you are running!" once it is operating.

Your final steps may be; extending the script exection from 3 seconds to something longer (such as 5 minutes or more), or setting up a trigger (such as CloudWatch Events for a schedule). You can also consider configuring the network layer bottlenecks using environment variables. (See [docs/network.md](https://github.com/AlpacaTravel/import-streams/tree/master/docs/network.md))