https://github.com/icij/datashare-tarentula

Cli toolbelt for Datashare.
https://github.com/icij/datashare-tarentula

cli csv datashare elasticsearch python

Last synced: 2 months ago
JSON representation

Cli toolbelt for Datashare.

Host: GitHub
URL: https://github.com/icij/datashare-tarentula
Owner: ICIJ
License: agpl-3.0
Created: 2019-08-21T14:02:54.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2024-11-28T17:39:29.000Z (7 months ago)
Last Synced: 2025-04-09T09:45:02.897Z (2 months ago)
Topics: cli, csv, datashare, elasticsearch, python
Language: Python
Homepage: https://datashare.icij.org
Size: 936 KB
Stars: 6
Watchers: 7
Forks: 2
Open Issues: 9
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Datashare Tarentula [![CircleCI](https://circleci.com/gh/ICIJ/datashare-tarentula.svg?style=svg)](https://circleci.com/gh/ICIJ/datashare-tarentula)

Cli toolbelt for [Datashare](https://datashare.icij.org).

```
/ \
\ \ ,, / /
'-.`\()/`.-'
.--_'( )'_--.
/ /` /`""`\ `\ \
| | >< | |
\ \ / /
'.__.'

Usage: tarentula [OPTIONS] COMMAND [ARGS]...

Options:
--syslog-address TEXT localhost Syslog address
--syslog-port INTEGER 514 Syslog port
--syslog-facility TEXT local7 Syslog facility
--stdout-loglevel TEXT ERROR Change the default log level for stdout error handler
--help Show this message and exit
--version Show the installed version of Tarentula

Commands:
aggregate
count
clean-tags-by-query
download
export-by-query
list-metadata
tagging
tagging-by-query
```

---

- [Installation](#installation)
- [Usage](#usage)
- [Cookbook 👩‍🍳](#cookbook-)
- [Count](#count)
- [Clean Tags by Query](#clean-tags-by-query)
- [Download](#download)
- [Export by Query](#export-by-query)
- [Tagging](#tagging)
- [CSV formats](#csv-formats)
- [Tagging by Query](#tagging-by-query)
- [Aggregate](#aggregate)
- [Following your changes](#following-your-changes)
- [Configuration File](#configuration-file)
- [Testing](#testing)
- [Releasing](#releasing)
- [1. Create a new release](#1-create-a-new-release)
- [2. Upload distributions on pypi](#2-upload-distributions-on-pypi)
- [3. Build and publish the Docker image](#3-build-and-publish-the-docker-image)
- [4. Push your changes on Github](#4-push-your-changes-on-github)

---

## Installation

You can insatll Datashare Tarentula with your favorite package manager:

```
pip3 install --user tarentula
```

Or alternativly with Docker:

```
docker run icij/datashare-tarentula
```

## Usage

Datashare Tarentula comes with basic commands to interact with a Datashare instance (running locally or on a remote server). Primarily focus on bulk actions, it provides you with both a cli interface and a python API.

### Cookbook 👩‍🍳

To learn more about how to use Datashare Tarentula with a list of examples, please refer to the Cookbook.

### Count

A command to just count the number of files matching a query.

```
Usage: tarentula count [OPTIONS]

Options:
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)
--query TEXT The query string to filter documents
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
in the downloaded document from the index
--traceback / --no-traceback Display a traceback in case of error
--type [Document|NamedEntity] Type of indexed documents to download
--help Show this message and exit
```

### Clean Tags by Query

A command that uses Elasticsearch `update-by-query` feature to batch untag documents directly in the index.

```
Usage: tarentula clean-tags-by-query [OPTIONS]

Options:
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--wait-for-completion / --no-wait-for-completion
Create a Elasticsearch task to perform the
updateasynchronously
--query TEXT Give a JSON query to filter documents that
will have their tags cleaned. It can be
afile with @path/to/file. Default to all.
--help Show this message and exit
```

### Download

A command to download all files matching a query.

```
Usage: tarentula download [OPTIONS]

--query TEXT The query string to filter documents
--destination-directory TEXT Directory documents will be downloaded
--throttle INTEGER Request throttling (in ms)
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...

--path-format TEXT Downloaded document path template
--scroll TEXT Scroll duration
--source TEXT A comma-separated list of field to include
in the downloaded document from the index

-f, --from INTEGER Passed to the search it will bypass the
first n documents
-l, --limit INTEGER Limit the total results to return
--sort-by TEXT Field to use to sort results
--order-by [asc|desc] Order to use to sort results
--once / --not-once Download file only once
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar
Display a progressbar
--raw-file / --no-raw-file Download raw file from Datashare
--type [Document|NamedEntity] Type of indexed documents to download
--help Show this message and exit.
```

### Export by Query

A command to export all files matching a query.

```
Usage: tarentula export-by-query [OPTIONS]

--query TEXT The query string to filter documents
--output-file TEXT Path to the CSV file
--throttle INTEGER Request throttling (in ms)
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...

--scroll TEXT Scroll duration
--source TEXT A comma-separated list of field to include
in the export

--sort-by TEXT Field to use to sort results
--order-by [asc|desc] Order to use to sort results
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar
Display a progressbar
--type [Document|NamedEntity] Type of indexed documents to download
-f, --from INTEGER Passed to the search it will bypass the
first n documents
-l, --limit INTEGER Limit the total results to return
--size INTEGER Size of the scroll request that powers the
operation.

--query-field / --no-query-field
Add the query to the export CSV
--help Show this message and exit.
```

### Tagging

A command to batch tag documents with a CSV file.

```
Usage: tarentula tagging [OPTIONS] CSV_PATH

Options:
--datashare-url TEXT http://localhost:8080 Datashare URL
--datashare-project TEXT local-datashare Datashare project
--throttle INTEGER 0 Request throttling (in ms)
--cookies TEXT _Empty string_ Key/value pair to add a cookie to each request to the API. You can separate semicolons: key1=val1;key2=val2;...
--apikey TEXT None Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar Display a progressbar
--help Show this message and exit
```

#### CSV formats

Tagging with a `documentId` and `routing`:

```csv
tag,documentId,routing
Actinopodidae,l7VnZZEzg2fr960NWWEG,l7VnZZEzg2fr960NWWEG
Antrodiaetidae,DWLOskax28jPQ2CjFrCo
Atracidae,6VE7cVlWszkUd94XeuSd,vZJQpKQYhcI577gJR0aN
Atypidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi
Barychelidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi
```

Tagging with a `documentUrl`:

```csv
tag,documentUrl
Mecicobothriidae,http://localhost:8080/#/d/local-datashare/DbhveTJEwQfJL5Gn3Zgi/DbhveTJEwQfJL5Gn3Zgi
Microstigmatidae,http://localhost:8080/#/d/local-datashare/iuL6GUBpO7nKyfSSFaS0/iuL6GUBpO7nKyfSSFaS0
Migidae,http://localhost:8080/#/d/local-datashare/BmovvXBisWtyyx6o9cuG/BmovvXBisWtyyx6o9cuG
Nemesiidae,http://localhost:8080/#/d/local-datashare/vZJQpKQYhcI577gJR0aN/vZJQpKQYhcI577gJR0aN
Paratropididae,http://localhost:8080/#/d/local-datashare/vYl1C4bsWphUKvXEBDhM/vYl1C4bsWphUKvXEBDhM
Porrhothelidae,http://localhost:8080/#/d/local-datashare/fgCt6JLfHSl160fnsjRp/fgCt6JLfHSl160fnsjRp
Theraphosidae,http://localhost:8080/#/d/local-datashare/WvwVvNjEDQJXkwHISQIu/WvwVvNjEDQJXkwHISQIu
```

### Tagging by Query

A command that uses Elasticsearch `update-by-query` feature to batch tag documents directly in the index.

To see an example of input file, refer to [this JSON](tests/fixtures/tags-by-content-type.json).

```
Usage: tarentula tagging-by-query [OPTIONS] JSON_PATH

Options:
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query
--throttle INTEGER Request throttling (in ms)
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar Display a progressbar
--wait-for-completion / --no-wait-for-completion
Create a Elasticsearch task to perform the
updateasynchronously
--help Show this message and exit
```

### List Metadata

You can list the metadata from the mapping, optionally counting the number of occurrences of each field in the index, with the `--count` parameter. Counting the fields is disabled by default.

It includes a `--filter_by` parameter to narrow retrieving metadata properties of specific sets of documents. For instance it can be used to get just emails related properties with: `--filter_by "contentType=message/rfc822"`

```
$ tarentula list-metadata --help
Usage: tarentula list-metadata [OPTIONS]

Options:
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a lot
of results)
--type [Document|NamedEntity] Type of indexed documents to get metadata
--filter_by TEXT Filter documents by pairs concatenated by
coma of field names and values separated by
=.Example "contentType=message/rfc822,content
Type=message/rfc822"
--count / --no-count Count or not the number of docs for each
property found

--help Show this message and exit.

```

### Aggregate

You can run aggregations on the data, the ElasticSearch aggregations API is partially enabled with this command.
The possibilities are:

- count: grouping by a given field different values, and count the num of docs.
- nunique: returns the number of unique values of a given field.
- date_histogram: returns counting of monthly or yearly grouped values for a given date field.
- sum: returns the sum of values of number type fields.
- min: returns the min of values of number type fields.
- max: returns the max of values of number type fields.
- avg: returns the average of values of number type fields.
- stats: returns a bunch of statistics for a given number type fields.
- string_stats: returns a bunch of string statistics for a given string type fields.

```
$ tarentula aggregate --help
Usage: tarentula aggregate [OPTIONS]

Options:
--apikey TEXT Datashare authentication apikey
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)
--query TEXT The query string to filter documents
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--traceback / --no-traceback Display a traceback in case of error
--type [Document|NamedEntity] Type of indexed documents to download
--group_by TEXT Field to use to aggregate results
--operation_field TEXT Field to run the operation on
--run [count|nunique|date_histogram|sum|stats|string_stats|min|max|avg]
Operation to run
--calendar_interval [year|month]
Calendar interval for date histogram
aggregation
--help Show this message and exit.
```

### Following your changes

When running Elasticsearch changes on big datasets, it could take a very long time. As we were curling ES to see if the task was still running well, we added a small utility to follow the changes. It makes a live graph of a provided ES indicator with a specified filter.

It uses [mathplotlib](https://matplotlib.org/) and python3-tk.

If you see the following message :

```
$ graph_es
graph_realtime.py:32: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure
```

Then you have to install [tkinter](https://docs.python.org/3/library/tkinter.html), i.e. python3-tk for Debian/Ubuntu.

The command has the options below:

```
$ graph_es --help
Usage: graph_es [OPTIONS]

Options:
--query TEXT Give a JSON query to filter documents. It can be
a file with @path/to/file. Default to all.
--index TEXT Elasticsearch index (default local-datashare)
--refresh-interval INTEGER Graph refresh interval in seconds (default 5s)
--field TEXT Field value to display over time (default "hits.total")
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query (default http://elasticsearch:9200)
```

## Configuration File

Tarentula supports several sources for configuring its behavior, including an ini files and command-line options.

Configuration file will be searched for in the following order (use the first file found, all others are ignored):

* `TARENTULA_CONFIG` (environment variable if set)
* `tarentula.ini` (in the current directory)
* `~/.tarentula.ini` (in the home directory)
* `/etc/tarentula/tarentula.ini`

It should follow the following format (all values bellow are optional):

```
[DEFAULT]
apikey = SECRETHALONOPROCTIDAE
datashare_url = http://here:8080
datashare_project = local-datashare

[logger]
syslog_address = 127.0.0.0
syslog_port = 514
syslog_facility = local7
stdout_loglevel = INFO
```

## Testing

To test this tool, you must have Datashare and Elasticsearch running on your development machine.

After you [installed Datashare](https://datashare.icij.org/), just run it with a test project/user:

```
datashare -p test-datashare -u test
```

In a separate terminal, install the development dependencies:

```
make install
```

Finally, run the test

```
make test
```

## Releasing

The releasing process uses [bumpversion](https://pypi.org/project/bumpversion/) to manage versions of this package, [pypi](https://pypi.org/project/tarentula/) to publish the Python package and [Docker Hub](https://hub.docker.com/) for the Docker image.

### 1. Create a new release

```
make [patch|minor|major]
```

### 2. Upload distributions on pypi

_To be able to do this, you will need to be a maintainer of the [pypi](https://pypi.org/project/tarentula/) project._

```
make distribute
```

### 3. Build and publish the Docker image

To build and upload a new image on the [docker repository](https://hub.docker.com/repository/docker/icij/datashare-tarentula) :

_To be able to do this, you will need to be part of the ICIJ organization on docker_

```
make docker-publish
```

**Note**: Datashare Tarentula is a multi-platform build. You might need to setup your environment for
multi-platform using the `make docker-setup-multiarch` command. Read more
[on Docker documentation](https://docs.docker.com/build/building/multi-platform/).

### 4. Push your changes on Github

Git push release and tag :

```
git push origin master --tags
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/icij/datashare-tarentula

Awesome Lists containing this project

README