https://github.com/dathere/datapump
Pump time-series data into the CKAN datastore using a simple filesystem-based queuing system.
https://github.com/dathere/datapump
ckan iot time-series
Last synced: 3 months ago
JSON representation
Pump time-series data into the CKAN datastore using a simple filesystem-based queuing system.
- Host: GitHub
- URL: https://github.com/dathere/datapump
- Owner: dathere
- License: mit
- Created: 2021-04-20T16:30:19.000Z (almost 5 years ago)
- Default Branch: main
- Last Pushed: 2021-05-25T13:45:58.000Z (almost 5 years ago)
- Last Synced: 2023-03-08T21:10:39.343Z (about 3 years ago)
- Topics: ckan, iot, time-series
- Language: Python
- Homepage:
- Size: 133 KB
- Stars: 1
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# datapump
Pump time-series data into the CKAN datastore using a simple filesystem-based queueing system.
Requires: Python 3.8
Installation:
=============
Linux, Mac, Windows Powershell:
```
python3 -m venv datapumpenv
. datapumpenv/bin/activate
cd datapumpenv
git clone https://github.com/dathere/datapump.git
cd datapump
pip install -r requirements.txt
```
Windows CMD:
```
python3 -m venv datapumpenv
datapumpvenv\Scripts\activate
cd datapumpenv
git clone https://github.com/dathere/datapump.git
cd datapump
pip install -r requirements.txt
```
Usage:
======
Linux, Mac, Windows Powershell:
```
. datapumpenv/bin/activate
cd datapumpenv/datapump
python datapump.py --config datapump.ini
```
Windows CMD:
```
datapumpenv\Scripts\activate
cd datapumpenv\datapump
python datapump.py --config datapump.ini
```
Command line parameters:
------------------------
```
python datapump.py --help
Usage: datapump.py [OPTIONS]
Pumps time-series data into CKAN using a simple filesystem-based queueing system.
Options:
--inputdir PATH The directory where the job files are located. [default: ./input]
--processeddir PATH The directory where successfully processed inputfiles are moved. [default: ./processed]
--problemsdir PATH The directory where unsuccessful inputfiles are moved. [default: ./problems]
--datecolumn TEXT The name of the datetime column. [default: DateTime]
--dateformats TEXT List of dateparser format strings to try one by one. See https://dateparser.readthedocs.io
[default: %y-%m-%d %H:%M:%S, %y/%m/%d %H:%M:%S, %Y-%m-%d %H:%M:%S, %Y/%m/%d %H:%M:%S]
--host TEXT CKAN host. [required]
--apikey TEXT CKAN api key to use. [required]
--verbose Show more information while processing.
--debug Show debugging messages.
--logfile PATH The full path of the main log file. [default: ./datapump.log]
--config FILE Read configuration from FILE.
--version Show the version and exit.
--help Show this message and exit.
```
Note that parameters are parsed and processed in priority order - through environment variables, a config file, or through the command line interface.
Environment variables should be all caps and prefixed with `DATAPUMP_`, for example:
```
export DATAPUMP_APIKEY="MYCKANAPIKEY"
export DATAPUMP_HOST="https://ckan.example.com"
```
Job JSON
--------
The input directory is scanned for `*-job.json` files in date descending order, executing each job per the JSON configuration.
For example:
```
{
"InputFile": "./samples/zone1_airquality_*.csv",
"TargetOrg": "etl-test",
"TargetPackage": "iot-test",
"TargetResource": "air-quality",
"PrimaryKey": "DateTime,Sensor_id",
"Dedupe": "last",
"Truncate": false
}
```
Note the `Dedupe` attribute specifies if datapump should automatically handle duplicate rows using the `PrimaryKey` attribute to determine duplication.
It can be set to `first`, `last` or ''.
`first` : Drop duplicates except for the first occurrence. - `last` : Drop duplicates except for the last occurrence. - '' : Do not drop duplicates.