Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/icij/datashare-tarentula
Cli toolbelt for Datashare.
https://github.com/icij/datashare-tarentula
cli csv datashare elasticsearch python
Last synced: 13 days ago
JSON representation
Cli toolbelt for Datashare.
- Host: GitHub
- URL: https://github.com/icij/datashare-tarentula
- Owner: ICIJ
- License: agpl-3.0
- Created: 2019-08-21T14:02:54.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2024-06-13T15:54:52.000Z (5 months ago)
- Last Synced: 2024-06-14T13:22:47.460Z (5 months ago)
- Topics: cli, csv, datashare, elasticsearch, python
- Language: Python
- Homepage: https://datashare.icij.org
- Size: 805 KB
- Stars: 6
- Watchers: 8
- Forks: 2
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Datashare Tarentula [![CircleCI](https://circleci.com/gh/ICIJ/datashare-tarentula.svg?style=svg)](https://circleci.com/gh/ICIJ/datashare-tarentula)
Cli toolbelt for [Datashare](https://datashare.icij.org).
```
/ \
\ \ ,, / /
'-.`\()/`.-'
.--_'( )'_--.
/ /` /`""`\ `\ \
| | >< | |
\ \ / /
'.__.'Usage: tarentula [OPTIONS] COMMAND [ARGS]...
Options:
--syslog-address TEXT localhost Syslog address
--syslog-port INTEGER 514 Syslog port
--syslog-facility TEXT local7 Syslog facility
--stdout-loglevel TEXT ERROR Change the default log level for stdout error handler
--help Show this message and exit
--version Show the installed version of TarentulaCommands:
aggregate
count
clean-tags-by-query
download
export-by-query
list-metadata
tagging
tagging-by-query
```---
- [Installation](#installation)
- [Usage](#usage)
- [Cookbook 👩🍳](#cookbook-)
- [Count](#count)
- [Clean Tags by Query](#clean-tags-by-query)
- [Download](#download)
- [Export by Query](#export-by-query)
- [Tagging](#tagging)
- [CSV formats](#csv-formats)
- [Tagging by Query](#tagging-by-query)
- [Aggregate](#aggregate)
- [Following your changes](#following-your-changes)
- [Configuration File](#configuration-file)
- [Testing](#testing)
- [Releasing](#releasing)
- [1. Create a new release](#1-create-a-new-release)
- [2. Upload distributions on pypi](#2-upload-distributions-on-pypi)
- [3. Build and publish the Docker image](#3-build-and-publish-the-docker-image)
- [4. Push your changes on Github](#4-push-your-changes-on-github)---
## Installation
You can insatll Datashare Tarentula with your favorite package manager:
```
pip3 install --user tarentula
```Or alternativly with Docker:
```
docker run icij/datashare-tarentula
```## Usage
Datashare Tarentula comes with basic commands to interact with a Datashare instance (running locally or on a remote server). Primarily focus on bulk actions, it provides you with both a cli interface and a python API.
### Cookbook 👩🍳
To learn more about how to use Datashare Tarentula with a list of examples, please refer to the Cookbook.
### Count
A command to just count the number of files matching a query.
```
Usage: tarentula count [OPTIONS]Options:
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)
--query TEXT The query string to filter documents
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
in the downloaded document from the index
--traceback / --no-traceback Display a traceback in case of error
--type [Document|NamedEntity] Type of indexed documents to download
--help Show this message and exit
```### Clean Tags by Query
A command that uses Elasticsearch `update-by-query` feature to batch untag documents directly in the index.
```
Usage: tarentula clean-tags-by-query [OPTIONS]Options:
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--wait-for-completion / --no-wait-for-completion
Create a Elasticsearch task to perform the
updateasynchronously
--query TEXT Give a JSON query to filter documents that
will have their tags cleaned. It can be
afile with @path/to/file. Default to all.
--help Show this message and exit
```### Download
A command to download all files matching a query.
```
Usage: tarentula download [OPTIONS]Options:
--apikey TEXT Datashare authentication apikey
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)--query TEXT The query string to filter documents
--destination-directory TEXT Directory documents will be downloaded
--throttle INTEGER Request throttling (in ms)
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...--path-format TEXT Downloaded document path template
--scroll TEXT Scroll duration
--source TEXT A comma-separated list of field to include
in the downloaded document from the index-f, --from INTEGER Passed to the search it will bypass the
first n documents
-l, --limit INTEGER Limit the total results to return
--sort-by TEXT Field to use to sort results
--order-by [asc|desc] Order to use to sort results
--once / --not-once Download file only once
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar
Display a progressbar
--raw-file / --no-raw-file Download raw file from Datashare
--type [Document|NamedEntity] Type of indexed documents to download
--help Show this message and exit.
```### Export by Query
A command to export all files matching a query.
```
Usage: tarentula export-by-query [OPTIONS]Options:
--apikey TEXT Datashare authentication apikey
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)--query TEXT The query string to filter documents
--output-file TEXT Path to the CSV file
--throttle INTEGER Request throttling (in ms)
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...--scroll TEXT Scroll duration
--source TEXT A comma-separated list of field to include
in the export--sort-by TEXT Field to use to sort results
--order-by [asc|desc] Order to use to sort results
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar
Display a progressbar
--type [Document|NamedEntity] Type of indexed documents to download
-f, --from INTEGER Passed to the search it will bypass the
first n documents
-l, --limit INTEGER Limit the total results to return
--size INTEGER Size of the scroll request that powers the
operation.--query-field / --no-query-field
Add the query to the export CSV
--help Show this message and exit.
```### Tagging
A command to batch tag documents with a CSV file.
```
Usage: tarentula tagging [OPTIONS] CSV_PATHOptions:
--datashare-url TEXT http://localhost:8080 Datashare URL
--datashare-project TEXT local-datashare Datashare project
--throttle INTEGER 0 Request throttling (in ms)
--cookies TEXT _Empty string_ Key/value pair to add a cookie to each request to the API. You can separate semicolons: key1=val1;key2=val2;...
--apikey TEXT None Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar Display a progressbar
--help Show this message and exit
```#### CSV formats
Tagging with a `documentId` and `routing`:
```csv
tag,documentId,routing
Actinopodidae,l7VnZZEzg2fr960NWWEG,l7VnZZEzg2fr960NWWEG
Antrodiaetidae,DWLOskax28jPQ2CjFrCo
Atracidae,6VE7cVlWszkUd94XeuSd,vZJQpKQYhcI577gJR0aN
Atypidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi
Barychelidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi
```Tagging with a `documentUrl`:
```csv
tag,documentUrl
Mecicobothriidae,http://localhost:8080/#/d/local-datashare/DbhveTJEwQfJL5Gn3Zgi/DbhveTJEwQfJL5Gn3Zgi
Microstigmatidae,http://localhost:8080/#/d/local-datashare/iuL6GUBpO7nKyfSSFaS0/iuL6GUBpO7nKyfSSFaS0
Migidae,http://localhost:8080/#/d/local-datashare/BmovvXBisWtyyx6o9cuG/BmovvXBisWtyyx6o9cuG
Nemesiidae,http://localhost:8080/#/d/local-datashare/vZJQpKQYhcI577gJR0aN/vZJQpKQYhcI577gJR0aN
Paratropididae,http://localhost:8080/#/d/local-datashare/vYl1C4bsWphUKvXEBDhM/vYl1C4bsWphUKvXEBDhM
Porrhothelidae,http://localhost:8080/#/d/local-datashare/fgCt6JLfHSl160fnsjRp/fgCt6JLfHSl160fnsjRp
Theraphosidae,http://localhost:8080/#/d/local-datashare/WvwVvNjEDQJXkwHISQIu/WvwVvNjEDQJXkwHISQIu
```### Tagging by Query
A command that uses Elasticsearch `update-by-query` feature to batch tag documents directly in the index.
To see an example of input file, refer to [this JSON](tests/fixtures/tags-by-content-type.json).
```
Usage: tarentula tagging-by-query [OPTIONS] JSON_PATHOptions:
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query
--throttle INTEGER Request throttling (in ms)
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar Display a progressbar
--wait-for-completion / --no-wait-for-completion
Create a Elasticsearch task to perform the
updateasynchronously
--help Show this message and exit
```### List Metadata
You can list the metadata from the mapping, optionally counting the number of occurrences of each field in the index, with the `--count` parameter. Counting the fields is disabled by default.
It includes a `--filter_by` parameter to narrow retrieving metadata properties of specific sets of documents. For instance it can be used to get just emails related properties with: `--filter_by "contentType=message/rfc822"`
```
$ tarentula list-metadata --help
Usage: tarentula list-metadata [OPTIONS]Options:
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a lot
of results)
--type [Document|NamedEntity] Type of indexed documents to get metadata
--filter_by TEXT Filter documents by pairs concatenated by
coma of field names and values separated by
=.Example "contentType=message/rfc822,content
Type=message/rfc822"
--count / --no-count Count or not the number of docs for each
property found--help Show this message and exit.
```
### Aggregate
You can run aggregations on the data, the ElasticSearch aggregations API is partially enabled with this command.
The possibilities are:- count: grouping by a given field different values, and count the num of docs.
- nunique: returns the number of unique values of a given field.
- date_histogram: returns counting of monthly or yearly grouped values for a given date field.
- sum: returns the sum of values of number type fields.
- min: returns the min of values of number type fields.
- max: returns the max of values of number type fields.
- avg: returns the average of values of number type fields.
- stats: returns a bunch of statistics for a given number type fields.
- string_stats: returns a bunch of string statistics for a given string type fields.```
$ tarentula aggregate --help
Usage: tarentula aggregate [OPTIONS]Options:
--apikey TEXT Datashare authentication apikey
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)
--query TEXT The query string to filter documents
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--traceback / --no-traceback Display a traceback in case of error
--type [Document|NamedEntity] Type of indexed documents to download
--group_by TEXT Field to use to aggregate results
--operation_field TEXT Field to run the operation on
--run [count|nunique|date_histogram|sum|stats|string_stats|min|max|avg]
Operation to run
--calendar_interval [year|month]
Calendar interval for date histogram
aggregation
--help Show this message and exit.
```### Following your changes
When running Elasticsearch changes on big datasets, it could take a very long time. As we were curling ES to see if the task was still running well, we added a small utility to follow the changes. It makes a live graph of a provided ES indicator with a specified filter.
It uses [mathplotlib](https://matplotlib.org/) and python3-tk.
If you see the following message :
```
$ graph_es
graph_realtime.py:32: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure
```Then you have to install [tkinter](https://docs.python.org/3/library/tkinter.html), i.e. python3-tk for Debian/Ubuntu.
The command has the options below:
```
$ graph_es --help
Usage: graph_es [OPTIONS]Options:
--query TEXT Give a JSON query to filter documents. It can be
a file with @path/to/file. Default to all.
--index TEXT Elasticsearch index (default local-datashare)
--refresh-interval INTEGER Graph refresh interval in seconds (default 5s)
--field TEXT Field value to display over time (default "hits.total")
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query (default http://elasticsearch:9200)
```## Configuration File
Tarentula supports several sources for configuring its behavior, including an ini files and command-line options.
Configuration file will be searched for in the following order (use the first file found, all others are ignored):
* `TARENTULA_CONFIG` (environment variable if set)
* `tarentula.ini` (in the current directory)
* `~/.tarentula.ini` (in the home directory)
* `/etc/tarentula/tarentula.ini`It should follow the following format (all values bellow are optional):
```
[DEFAULT]
apikey = SECRETHALONOPROCTIDAE
datashare_url = http://here:8080
datashare_project = local-datashare[logger]
syslog_address = 127.0.0.0
syslog_port = 514
syslog_facility = local7
stdout_loglevel = INFO
```## Testing
To test this tool, you must have Datashare and Elasticsearch running on your development machine.
After you [installed Datashare](https://datashare.icij.org/), just run it with a test project/user:
```
datashare -p test-datashare -u test
```In a separate terminal, install the development dependencies:
```
make install
```Finally, run the test
```
make test
```## Releasing
The releasing process uses [bumpversion](https://pypi.org/project/bumpversion/) to manage versions of this package, [pypi](https://pypi.org/project/tarentula/) to publish the Python package and [Docker Hub](https://hub.docker.com/) for the Docker image.
### 1. Create a new release
```
make [patch|minor|major]
```### 2. Upload distributions on pypi
_To be able to do this, you will need to be a maintainer of the [pypi](https://pypi.org/project/tarentula/) project._
```
make distribute
```### 3. Build and publish the Docker image
To build and upload a new image on the [docker repository](https://hub.docker.com/repository/docker/icij/datashare-tarentula) :
_To be able to do this, you will need to be part of the ICIJ organization on docker_
```
make docker-publish
```**Note**: Datashare Tarentula is a multi-platform build. You might need to setup your environment for
multi-platform using the `make docker-setup-multiarch` command. Read more
[on Docker documentation](https://docs.docker.com/build/building/multi-platform/).### 4. Push your changes on Github
Git push release and tag :
```
git push origin master --tags
```