Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/arangodb/feed

A tool to feed generated random data of various types into ArangoDB and run other load
https://github.com/arangodb/feed

Last synced: 3 months ago
JSON representation

A tool to feed generated random data of various types into ArangoDB and run other load

Awesome Lists containing this project

README

        

# feed - ArangoDB random data and load generator

This project is about putting data and load into and on an ArangoDB
instance. Its purpose is to cover:

- large data generation
- quick data import
- high parallel load of write and read operations
- multiple different collection types and graphs (smart, hybrid, satellite)
- different data scenarios (large/small documents, few/many indexes,
search yes/no)
- different write scenarios (insert/replace/update/remove/truncate)
- different read scenarios (bulk read/random read/index read/search read)
- graph traversals
- all is covered by a domain specific language which covers parallel
and sequential loads
- all operations automatically measure throughput and latency and report

## Example

```
[
normal create database=xyz collection=c numberOfShards=3 replicationFactor=3 drop=true
normal insert database=xyz collection=c parallelism=10
]
```

first creates a normal collection and then inserts some data into it
with 10 parallel go-routines (threads).

## Command line options

Here is the usage page of the tool. Endpoints can be of type `http` or
`https`, they can be separated by commas or the option can be given
multiple times, or both. The endpoints should be different coordinator
endpoints of the same cluster.

```
The 'feed' tool feeds ArangoDB with generated data, quickly.

Usage:
[flags]

Flags:
--endpoints strings Endpoint of server where data should be written. (default [http://localhost:8529])
--execute string Filename of program to execute. (default "doit.feed")
-h, --help help for this command
--jsonOutputFile string File name of JSON report on performance which is written.
--jwt string Verbose output
--metricsPort int Metrics port (0 for no metrics) (default 8888)
--password string Password for database access
--protocol string Protocol (http1, http2, vst) (default "vst")
--username string User name for database access. (default "root")
-v, --verbose Verbose output
```

## Misc cases

- `wait`: A no-op waiting program.

## Data cases

This is only an overview, see below for detailed instructions for lists
of subcommands and then later for detailed instructions.

- `normal`: This is a normal (vertex) collection.
- `graph`: This is for graphs.
- `replayAQL`: This replays a previously recorded list of AQL queries
against a new deployment.

### Operation cases for `normal`

This is only an overview, see below for detailed instructions for each
subcommand.

- `create`: create collection or graph
- `drop`: drop a collection
- `truncate`: trunacte a collection
- `insert`: bulk insert data
- `createIdx`: create an index
- `dropIdx`: drop an index
- `randomRead`: read documents randomly in parallel
- `randomUpdate`: perform updates in documents randomly in parallel
- `randomReplace`: perform replacements of documents in parallel
- `dropDatabase`: drop a database
- `queryOnIdx`: run an AQL query using an index (including primary index)
- `createView`: create a view (and potentially some analyzers)
- `dropView`: drop a view
- `nastyViewData`: create data in a collection for a particularly
nasty search view

### Operation cases for `graph`

This is only an overview, see below for detailed instructions for each
subcommand.

- `insertvertices`: for graph cases, insert vertex data
- `insertedges`: for graph cases, insert edge data
- `randomTraversal`: for graph cases, run a traversal from a random
starting point

## `feedlang` reference

Rules:

- Input is line based.
- Curly braces allow grouping for parallel execution.
- Square brackets allow grouping for sequential execution.
- Braces and brackets need to be on a line by themselves.
- White space at the end and the beginning of a line are ignored.
- Lines in which the first non-white space character is `#` are comments.
- Empty lines are ignored.

Example:

```
# Comment

[
{
wait 1
wait 2
wait 3
}
{
wait 4
wait 3
}
]
```

This executes the first three `wait` statements concurrently, and when
all three are done (i.e. after 3 seconds), moves to the last two, which
are also executed concurrently (i.e. they end after another 4 seconds).

## Subcommand `create` (for `normal`)

Creates a collection.

Example of usage:

```
normal create database=xyz collection=c numberOfShards=3 replicationFactor=2
```

Possible parameters for usage:

- `database`: name of the database where the collection will be created
(default: `_system`)
- `collection`: name of the collection to be created
(default: `batchimport`)
- `numberOfShards`: number of shards of the collection (default: `3`)
- `replicationFactor`: replication factor (number of replicas for each
shard) (default: `3`)
- `waitForSync`: boolean flag, if set to true, all write operations on this
collections will wait until all data is persisted to disk (default: `false`)
- `drop`: boolean flag, if set to true, drops collection if it exists
before recreating it (default: `false`)
- `replicationVersion`: set replication Version 1 (default) or replication
Version 2 (when creating the database)

## Subcommand `drop` (for `normal`)

Drops a collection.

Example of usage:

```
normal drop database=xyz collection=c
```

Possible parameters for usage:

- `database`: name of the database where the collection will be dropped
(default: `_system`)
- `collection`: name of the collection which will be dropped
(default: `batchimport`)

## Subcommand `truncate` (for `normal`)

Truncates a collection.

Example of usage:

```
normal truncate database=xyz collection=c
```

Possible parameters for usage:

- `database`: name of the database where the collection is truncated
(default: `_system`)
- `collection`: name of the collection which is truncated
(default: `batchimport`)

## Subcommand `dropDatabase` (for `normal`)

Drops a database.

Example of usage:

```
normal dropDatabase database=xyz
```

Possible parameters for usage:

- `database`: name of the database which will be dropped

## Subcommand `createIdx` (for `normal`)

Creates an index.

Example of usage:

```
normal createIdx database=xyz collection=c withGeo=false numberFields=4 idxName="myIdx"
```

Possible parameters for usage:

- `database`: name of the database where the index will be created
(default: `_system`)
- `collection`: name of the collection where the index will be created
(default: `batchimport`)
- `withGeo`: whether or not it's a geo index (default: `false`)
- `numberFields`: number of fields the index will cover (default: `1`)
- `idxName`: user-defined name for the index. Leading and trailing
quotes are ignored. (default: `"idx" + random number ex. idx123`)
- `replicationVersion`: set replication Version 1 (default) or replication
Version 2 (when creating the database)

## Subcommand `dropIdx` (for `normal`)

Drops an index.

Example of usage:

`normal dropIdx database=xyz collection=c idxName=myIdx`

Possible parameters for usage:

- `database`: name of the database where the index will be dropped (default: `_system`)
- `collection`: name of the collection where the index will be dropped (default: `batchimport`)
- `idxName`: user-defined name for the index. Leading and trailing quotes are ignored.

## Subcommand `insert` (for `normal`)

Inserts data into a collection with batches.

Example of usage:

```
normal insert database=xyz collection=c parallelism=10 size=5G documentSize=300 withGeo=false withWords=5 parallelism=16 numberFields=5 batchSize=5000
```

Possible parameters for usage:

- `database`: name of the database where random replaces will be
executed (default: `_system`)
- `collection`: name of the collection where the random replaces will
be executed (default: `batchimport`)
- `parallelism`: number of threads to execute the replaces concurrently
(default: `16`)
- `batchSize`: size of the batches (default: `1000`)
- `startDelay`: delay in milliseconds between starts of different
threads, this is used to stagger the startup of the different threads
(go routines) (default: `5`)
- `timeout`: timeout in seconds for each batch insert, this is by
default very high, but can be lowered for special experiments
(default: `3600`)
- `retries`: number of retries after an error, errors with successful
retries are counted as errors, but do not lead to an abortion of the
operation, by default, no retries are done (default: `0`)
- `size`: total size of data to be inserted, one can use suffixes `G`
for gigabytes (1024^3), `T` for terabytes (1024^4), `M` for megabytes
(1024^2) and `K` for kilobytes (1024), (default: `16G`)
- `documentSize`: size in bytes of an individual document, this is an
approximation, but should be relatively OK, if you use `withGeo`
and/or `withWords`, the size will be slightly larger (default: `128`)
- `withGeo`: if set to `true`, a special field called `geo` will be
generated with a random polygon in GEOJson format (default: `false`)
- `withWords`: if set to `true`, a special field called `words` will
be generated with a list of some random words from a finite pool
(default: `false`)
- `keySize`: size in bytes of the key generated, the key will always be
a SHA256 value of a stringified integer in the range from 0 to N-1
for some value of N-1 determined by the total size of the data,
`keySize` is then used to take a prefix of that SHA256 value (hex
stringification) of the appropriate number of bytes (default: `32`)
- `numberFields`: number of payload fields to generate, the randomly
generated string data is distributed across that many fields called
`payload0`, `payload1` and so on (default: `1`)
- `useAql`: If set to `true`, any CRUD operation (create, replace,
update, delete) will be executed through AQL queries instead of
using the regular document HTTP API. Can only be used if `addFromTo`
is not set to `true`. (default: `false`)
- `oneShard`: If set to `true` the database will be created as a one
shard database (default `false`)
- `replicationVersion`: set replication Version 1 (default) or replication
Version 2 (when creating the database)

The following are for edge collections to produce random `_from` and
`_to` values. Do not use them in actual graph insertion commands:

- `addFromTo`: if set to `true`, add random values for the `_from` and
`_to` attributes to allow insertion into an edge collection (default:
`false`)
- `smart`: if set to `true`, add the necessary things for a smart edge
collection, this is only relevant, if `addFromTo` is set to `true`,
it changes the format of these attribute values by adding a smart
graph attribute value, furthermore, the generated `_key` is adjusted
(default: `false`)
- `vertexCollName`: set this string value to the name of the vertex
collection, which should be used in `_from` and `_to` attributes,
this is only relevant if `addFromTo` is set to `true` (default: `V`)

## Subcommand `queryOnIdx` (for `normal`)

Runs queries that will use a specific index in parallel, meaning the
query will perform an operation that uses an attribute that is covered
by the index so that it is actively used during the query execution.

Example of query:

```
FOR doc IN c
SORT c.attrCoveredByIdx
FILTER c.attrCoveredByIdx >= 0
LIMIT 1
RETURN doc
```

Example of usage:

```
normal queryOnIdx database=xyz collection=c limit=1 parallelism=1 idxName=primary
```

Possible parameters for usage:

- `database`: name of the database where the index queries will be
executed (default: `_system`)
- `collection`: name of the collection where the index queries will be
executed (default: `batchimport`)
- `idxName`: index name. If set to `primary`, the primary index will be
used on the query. If it's a string, the queries will be performed on
an attribute of the index with the specified name. If not present, the
query will be performed in the first index that is not primary that was
looked up form the collection's indexes.
- `parallelism`: number of threads to execute the queries concurrently
(default: `16`)
- `loadPerThread`: number of times each thread executes the query on
the same index (default: `50`)
- `queryLimit`: number of documents returned from the query (default: `1`),
this controls whether it is a single document or a batch request.

## Subcommand `randomRead` (for `normal`)

Reads single documents randomly in parallel.

Example of usage:

`normal randomRead database=xyz collection=c parallelism=10 loadPerThread=50`

Possible parameters for usage:

- `database`: name of the database where random reads will be executed
(default: `_system`)
- `collection`: name of the collection where the random reads will be
executed (default: `batchimport`)
- `parallelism`: number of threads to execute the reads concurrently
(default: `16`)
- `loadPerThread`: number of times each thread executes random reads
(default: `50`)

## Subcommand `randomUpdate` (for `normal`)

Executes updates on random documents.

Example of usage:

```
normal randomUpdate database=xyz collection=c parallelism=10 loadPerThread=50
```

Possible parameters for usage:

- `database`: name of the database where random updates will be
executed (default: `_system`)
- `collection`: name of the collection where the random updates will be
executed (default: `batchimport`)
- `parallelism`: number of threads to execute the updates concurrently
(default: `16`)
- `loadPerThread`: number of times each thread executes random updates
(default: `50`)
- `batchSize`: size of the batches (default: `1000`)

For the documents being created as updates the same parameters can
be used as in the `insert` case above.

## Subcommand `randomReplace` (for `normal`)

Executes replacements of random documents.

Example of usage:

```
normal randomReplace database=xyz collection=c parallelism=10 loadPerThread=50
```

Possible parameters for usage:

- `database`: name of the database where random replaces will be
executed (default: `_system`)
- `collection`: name of the collection where the random replaces will
be executed (default: `batchimport`)
- `parallelism`: number of threads to execute the replaces concurrently
(default: `16`)
- `loadPerThread`: number of times each thread executes random replaces
(default: `50`)
- `batchSize`: size of the batches (default: `1000`)

For the documents being created as replacement the same parameters can
be used as in the `insert` case above.

## Subcommand `createView` (for `normal`)

Creates a view (with proper links) and potentially analyzers.

Example of usage:

```
normal createView database=xyz viewDefFile=view.json analyzersDefFile=analyzers.json view=v drop=true
```

Possible parameters for usage:

- `database`: name of the database where the collection will be created
(default: `_system`)
- `replicationVersion`: set replication Version 1 (default) or replication
Version 2 (when creating the database)
- `view`: name of view to create (default: `v`)
- `drop`: boolean flag, if set to true, drops view if it exists
before recreating it (default: `false`)
- `viewDefFile`: name of file with a JSON description of the view, see
below for an example, in particular, this contains the collections
for which links are generated
- `analyzersDefFile`: name of file with a JSON description of the
analyzers, which ought to be created, see below for an example,
analyzers are created before the view, so that they can immediately
be used in the view definition

Here is an example of a JSON file to create some analyzers, note in
particular that it needs to be an array of objects, one for each
analyzer:

```json
[
{
"name": "segmentation",
"type": "segmentation",
"properties": {
"case": "lower",
"break": "alpha"
},
"features": [
"frequency",
"position",
"norm"
]
},
{
"name": "identity",
"type": "identity",
"properties": {},
"features": [
"frequency",
"norm"
]
},
{
"name": "text_en",
"type": "text",
"properties": {
"locale": "en",
"case": "lower",
"stopwords": [],
"accent": false,
"stemming": true
},
"features": [
"frequency",
"position",
"norm"
]
}
]
```

Here is an example for a view definition file:

```json
{
"links": {
"c": {
"analyzers": [
"identity"
],
"collectionName": "transactions",
"fields": {
"words" : {
"analyzers" : [
"text_en",
"identity"
],
"cache" : true
}
},
"inBackground": true,
"includeAllFields": false,
"name": "idx_1765991063252631552",
"primaryKeyCache": true,
"primarySort": [],
"primarySortCompression": "lz4",
"storeValues": "none",
"trackListPositions": false,
"type": "arangosearch",
"version": 1
}
},
"storedValues": [
{
"fields": [
"_key",
"words"
],
"compression": "lz4",
"cache": true
}
]
}
```

## Subcommand `dropView` (for `normal`)

Drops a view.

Example of usage:

```
normal dropView database=xyz view=v
```

Possible parameters for usage:

- `database`: name of the database where the collection will be dropped
(default: `_system`)
- `view`: name of view to drop, note that associated links are dropped
as well, but analyzers are not, since we do not know which analyzers
have been created alongside this view and which have been there from
before (default: `v`)

## Subcommand `nastyViewData` (for `normal`)

Inserts data into a collection with batches, producing data which is
particularly nasty for a search view.

Example of usage:

```
normal nastyViewData database=xyz collection=c parallelism=10 size=5G parallelism=16 batchSize=5000
```

Possible parameters for usage:

- `database`: name of the database where random replaces will be
executed (default: `_system`)
- `collection`: name of the collection where the random replaces will
be executed (default: `batchimport`)
- `parallelism`: number of threads to execute the replaces concurrently
(default: `16`)
- `batchSize`: size of the batches (default: `1000`)
- `startDelay`: delay in milliseconds between starts of different
threads, this is used to stagger the startup of the different threads
(go routines) (default: `5`)
- `timeout`: timeout in seconds for each batch insert, this is by
default very high, but can be lowered for special experiments
(default: `3600`)
- `retries`: number of retries after an error, errors with successful
retries are counted as errors, but do not lead to an abortion of the
operation, by default, no retries are done (default: `0`)
- `size`: total size of data to be inserted, one can use suffixes `G`
for gigabytes (1024^3), `T` for terabytes (1024^4), `M` for megabytes
(1024^2) and `K` for kilobytes (1024), (default: `16G`)
- `keySize`: size in bytes of the key generated, the key will always be
a SHA256 value of a stringified integer in the range from 0 to N-1
for some value of N-1 determined by the total size of the data,
`keySize` is then used to take a prefix of that SHA256 value (hex
stringification) of the appropriate number of bytes (default: `32`)
- `numberFields`: number of stored fields to generate ("data0",
"data1", ...) (default: `10`)
- `oneShard`: If set to `true` the database will be created as a one
shard database (default `false`)
- `replicationVersion`: set replication Version 1 (default) or replication
Version 2 (when creating the database)

## Subcommand `create` (for `graph`)

Creates a graph.

Examples of usage:

```
graph create database=xyz name=G vertexColl=V edgeColl=E type=cyclic graphSize=2000
```

possible parameters for usage:

- `database`: name of the database where the graph will be created
(default: `_system`)
- `name`: name of the graph to be created (default: `G`)
- `vertexColl`: name of the vertex collection of the graph (default: `V`)
- `edgeColl`: name of the edge collection of the graph (default: `E`)
- `type`: type of graph to be created, currently we have `cyclic` and
`tree` (default: `cyclic`)
- `graphSize`: number of vertices of the graph (default: `2000`)
- `graphDepth`: depth of the created tree (default: `10`)
- `graphBranching`: branching factor of the tree (default: `2`)
- `graphDirection`: direction of edges in the tree, can be `downwards`
or `upwards` or `bidirected` (default: `downwards`)

For the different types of graphs, the following options are relevant:

- `cyclic`: only needs `graphSize`
- `tree`: needs `graphDepth`, `graphBranching` and `graphDirection`.

## Subcommand `insertvertices` (for `graph`)

Batch inserts the vertices of a graph.

Examples of usage:

```
graph insertvertices database=xyz name=G vertexColl=V type=cyclic graphSize=2000
```

possible parameters for usage, see `create`, plus:

- `batchSize`: size of the batches for the insert (default: `1000`)
- `parallelism`: number of threads (go-routines) to use client-side
(default: `16`)

For the documents being created as vertices the same parameters as above
in the `insert` case (subcommand for `normal`) can be used.

## Subcommand `insertedges` (for `graph`)

Batch inserts the edges of a graph.

Examples of usage:

```
graph insertedges database=xyz name=G edgeColl=E type=cyclic graphSize=2000
```

possible parameters for usage, see `create`, plus:

- `batchSize`: size of the batches for the insert (default: `1000`)
- `parallelism`: number of threads (go-routines) to use client-side
(default: `16`)

For the documents being created as edges the same parameters as above
in the `insert` case (subcommand for `normal`) can be used.

## Command `replayAQL`

Replays a previously recorded list of AQL queries against a new
deployment. You have to switch the log topic `requests` to the `TRACE`
level on all your coordinators. You can either do this on the command
line with

```
--log.level=requests=trace
```

or at runtime by sending a body of

```
{"requests":"trace"}
```

to the API `PUT /_admin/log/level` on all coordinators.

Then use the awk script `scripts/extractQueries.awk` on your log as
follows:

```
awk -f scripts/extractQueries.awk < coordinator.log > queries.jsonl
```

This will produce a file `queries.jsonl` with lines like this:

```
{"t":"2023-03-09T08:31:09Z", "db": "_system", "q":{"query":"FOR d IN c FILTER d.Hallo == @x RETURN d","count":false,"bindVars":{"x":12},"stream":false}}
```

You can then use this file as input file for `feed` for a command like
this:

```
replayAQL input=queries.jsonl parallelism=8 delayByTimestamp=true
```

possible parameters for usage:

- `parallelism`: number of threads (go-routines) to use client-side
(default: `16`)
- `input`: name of input file in the above format
- `delayByTimestamp`: a boolean parameter, for an explanation see below

This command will take the queries and use them on the current
deployment which is to be tested. It uses the same database names and
query parameters as before in the recording. All results are consumed
until the database does not deliver any more.

The `feed` tool will use as many go routines (threads) as given in the
`parallelism` parameter. Each go routine will grab a query and execute
it against the deployment until it returns no more results, and then
moves to the next query. Note that each query in the input is executed
by exactly one of the go routines.

If the argument `delayByTimestamp` is `false`, then each go routine will
try as quickly as possible to execute all queries it can grab (one by
one). If, however, the argument is `true`, then each go routine will
consider the time stamps. It will delay the executing of a query, as
long as the time which has passed since starting the command is smaller
than the time difference of the time stamp of the query and the very
first time stamp in the file. The effect of this is essentially that
the queries are executed in the original frequency (unless the
parallelism is lower than the number of concurrently running queries in
the recording).

Note that if you combine multiple logs from multiple coordinators, you
have to merge these files and sort the input file by time stamp.
The time stamps must be according to RFC3339.