https://github.com/datopian/factory

Datahub factory for dataflows
https://github.com/datopian/factory

Last synced: 15 days ago
JSON representation

Datahub factory for dataflows

Host: GitHub
URL: https://github.com/datopian/factory
Owner: datopian
Created: 2018-10-16T06:11:35.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2019-09-16T15:51:51.000Z (over 6 years ago)
Last Synced: 2025-09-09T23:27:34.003Z (5 months ago)
Language: Dockerfile
Size: 43.9 KB
Stars: 5
Watchers: 9
Forks: 1
Open Issues: 6
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          Datahub service for running dataflows

## Factory

The service is responsible for running the flows for datasets that are frequently updated and maintained by [Datahub](https://datahub.io/). Service is using [Datapackage Pipelines](https://github.com/frictionlessdata/datapackage-pipelines) is a framework for declarative stream-processing of tabular data, and [DataFlows](https://github.com/datahq/dataflows) to run the flows through pipelines to process the datasets.

## Install

You will need python 3.x

```

pip install -r requirements.txt

```

## Add dataset to factory

Each "folder" in `datasets` directory is named after publisher's username and each dataset in it is a standalone repository on GitHub and should be submoduled. To add a new datasets you will need to submodule your dataset repo into related directory (or create if not exists).

```

mkdir datasts/example

cd datasets/example

git submodule add https://github.com/example/my-awesome-dataset

```

Each dataset should have it's flows written as python script and `pipeline-spec.yaml` pointing to flow to run:

* `annual-prices.py` - script responsible for getting the data, tidy and normalisation

```

from dataflows import Flow, dump_to_path, load, add_metadata

def flow(parameters, datapackage, resources, stats):

    return Flow(load(load_source='http://www.exampel.com/my-data.csv'))

```

* `pipeline-spec.yaml` - metadata about pipelines. Here you should define which flows exactly to run and where the config file is saved

```

example-flow:

  pipeline:

  - flow: annual-prices

  - run: datahub.dump.to_datahub

    parameters:

      config: ~/.config/datahub/config.json.example

```

Factory server will read `pipeline-spec.yaml` for each dataset and run the flows and processors stated there. In the example above

1. Run the flows (`annual-prices.py`) and load the data from `http://www.exampel.com/my-data.csv`

2. Run the custom [`datahub.dump.to_datahub`](https://github.com/datahq/datapackage-pipelines-datahub) processors and push files to [datahub.io](https://datahub.io/)

### Config files

To publish dataset on Datahub, each user has it's own config file. We need this config file for each user who is subscribed to factory in order to push datasets under appropriate username.

Config files for Datahub are usually saved in `~/.config/datahub/config.json`. You will probably need to login with your datahub account if you can't find one. Login in and copy your config file in secrets directory.

```

data login

cp ~/.config/datahub/config.json secrets/config.json.example

```

In order to add new config file to the list, you will have to add `cinfig.json.example` to the `secrets/secrets.tar` Which is encrypted. Please contact if you are not the member of datahub developers team, else:

* Download and decrypt `secrets.tar.enc` from private GitLab repository

* extract `secrets.tar`

* add `cinfig.json.example` to the directory

* archive `secrets.tar`

* encrypt with `travis enctypr-file` and push back to github

  * In parallel encrypt with password (used when decrypting) and push back file to GitLab private repo

```

# Extract

tar xvf secrets.tar

# Add new Config

cp ~/.config/datahub/config.json secrets/config.json.example

# Archive again

tar cvf secrets.tar secrets/

# Encrypt

travis encrypt-file secrets.tar

# Commit and push

git add secrets.tar.enc

git commit -m"example user's config"

git push

```

## Developers

When working locally you will need to update all submodules

```

git submodule init && git submodule update

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/datopian/factory

Awesome Lists containing this project

README