Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/milover/post

A program for processing structured data files in bulk
https://github.com/milover/post

cli csv latex openfoam postprocessing

Last synced: 14 days ago
JSON representation

A program for processing structured data files in bulk

Host: GitHub
URL: https://github.com/milover/post
Owner: Milover
License: mit
Created: 2023-05-10T02:31:40.000Z (almost 2 years ago)
Default Branch: master
Last Pushed: 2023-10-17T12:27:14.000Z (over 1 year ago)
Last Synced: 2024-11-16T07:01:45.534Z (3 months ago)
Topics: cli, csv, latex, openfoam, postprocessing
Language: Go
Homepage:
Size: 354 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# post

`post` is a program for processing structured data files in bulk.

It was originally intended as an automation tool for generating [LaTeX][latex]
graphs from `functionObject` data generated by [OpenFOAM®][openfoam] simulations,
but has since evolved such that it can be used as a general structured data
processor with optional graph generation support.

It's primary use is processing and formatting data spread over multiple files
and/or archives. The main benefit being that the entire process is defined
through one or more YAML formatted run files, hence, automating data processing
pipelines is fairly simple, while no programming is necessary.

## Contents

- [Installation](#installation)
- [CLI usage](#cli-usage)
- [Run file structure](#run-file-structure)
- [Input](#input)
- [Processing](#processing)
- [Output](#output)
- [Graphing](#graphing)
- [Templates](#templates)

## Installation

If [Go][golang] is installed locally, the following command will compile and
install the latest version of `post`:

```shell
$ go install github.com/Milover/post@latest
```

Precompiled binaries for Linux, Windows and Mac OS (Apple silicon) are also
available under [releases][post-release].

Finally, `post` can also be built from source, assuming [Go][golang] is
available locally, by running the following commands:

```shell
$ git clone https://github.com/Milover/post
$ cd post
$ go install
```

## CLI usage

Usage:

```
post [run file] [flags]
post [command]
```

Available Commands:

```
completion Generate the autocompletion script for the specified shell
graphfile Generate graph file stub(s)
help Help about any command
runfile Generate a run file stub
```

Flags:

```
--dry-run check runfile syntax and exit
-h, --help help for post
--log-mem log memory usage at the end of each pipeline
--no-graph don't write or generate graphs
--no-graph-generate don't generate graphs
--no-graph-write don't write graph files
--no-output don't output data
--no-process don't process data
--only-graphs only write and generate graphs, skip input, processing and output
--skip strings a list of pipeline IDs to be skipped during processing
-v, --verbose verbose log output
```

## Run file structure

`post` is controlled by a run file in YAML format file supplied as a CLI parameter.
The run file usually consists of a list of pipelines, each defining 4 sections:
`input`, `process`, `output` and `graph`. The `input` section defines input
files and formats from which data is read; the `process` section defines
operations which are applied to the data; the `output` section defines how
the processed data will be output/stored; and the `graph` section defines
how the data will be graphed.

> Note: All file paths within the run file are evaluated using
> the run file's parent directory as the current working directory.

All sections are optional and can be omitted, defined by themselves, or
as part of a pipeline. A special case is the `template` section,
which *cannot* be defined as part of a pipeline.
See [Templates](#templates) for a breakdown of their use.

A single pipeline has the following fields:

```yaml
- id:
input:
type:
fields:
type_spec:
process:
- type:
type_spec:
output:
- type:
type_spec:
graph:
type:
graphs:
```

- `id`: the pipeline tag, used to reference the pipeline on the CLI; optional
- `input`: the input section
- `type`: input type; see [Input](#input) for type descriptions
- `fields`: field (column) names of the input data; optional
- `type_spec`: input type specific configuration
- `process`: the process section
- `type`: process type; see [Processing](#processing) for type descriptions
- `type_spec`: process type specific configuration
- `output`: the output section
- `type`: output type; see [Output](#output) for type descriptions
- `type_spec`: output type specific configuration
- `graph`: the graph section
- `type`: graph type; see [Graphing](#graphing) for type descriptions
- `graphs`: a list of graph type specific graph configurations

A simple run file example is shown below.

```yaml
- input:
type: dat
fields: [x, y]
type_spec:
file: 'xy.dat'
process:
- type: expression
type_spec:
expression: '100*y'
result: 'result'
output:
- type: csv
type_spec:
file: 'output/data.csv'
graph:
type: tex
graphs:
- name: xy.tex
directory: output
table_file: 'output/data.csv'
axes:
- x:
min: 0
max: 1
label: '$x$'
y:
min: 0
max: 100
label: '$100 y$'
tables:
- x_field: x
y_field: result
legend_entry: 'result'
```

The example run file instructs `post` to do the following:

1. read data from a DAT formatted file `xy.dat` and rename the fields (columns)
to `x` and `y`
2. evaluate the expression `100*y` and store the result to a field named `result`
3. output the data, now containing the fields `x`, `y` and `result` to a
CSV formatted file `output/data.csv`, if the directory `output` does not
exist, it will be created
4. generate a graph using TeX in the `output` directory, using `output/data.csv`
as the table (data) file, with `x` as the abscissa and `result` as the ordinate

For more examples see the [examples/](examples) directory.

A generic run file stub, which can be a useful starting point, can be created
by running:

```shell
$ post runfile
```

## Input

The following is a list of available input types and their descriptions
along with their run file configuration stubs:

- [`archive`](#archive)
- [`csv`](#csv)
- [`dat`](#dat)
- [`multiple`](#multiple)
- [`ram`](#ram)
- [`time-series`](#time-series)

---

#### `archive`

`archive` reads input from an archive. The archive format is inferred from
the file name extension. The following archive formats are supported:
`TAR`, `TAR-GZ`, `TAR-BZIP2`, `TAR-XZ`, `ZIP`. Note that `archive` input wraps
one or more input types, i.e., the `archive` configuration only specifies
how to read _some data_ from an archive, the wrapped input type reads the
actual data.

Another important note is that the contents of the archive are stored into
memory the first time it is read, so if the same archive is used multiple
times as an input source, it will be read from disk only once, each subsequent
read will read directly from RAM. Hence it is beneficial to use the `archive`
input type when the data consists of a large amount of input files,
e.g., a large `time-series`.

> Warning: it is currently faster to read regularly from the filesystem than
> using the `archive` on most machines due to a poorly optimized implementation
> of `archive`, so use with caution.

The `clear_after_read` flag can be used to clear *all* `archive` memory
after reading the data.

```yaml
type: archive
type_spec:
file: # file path of the archive
clear_after_read: # clear memory after reading; 'false' by default
format_spec: # input type configuration, e.g., a CSV input type
```

#### `csv`

`csv` reads from a CSV formatted file. If the file contains a header line
the `header` field should be set to `true` and the header column names will
be used as the field names for the data. If no header line is present the
`header` field must be set to `false`.

```yaml
type: csv
type_spec:
file: # file path of the CSV file
header: # determines if the CSV file has a header; default 'true'
comment: # character to denote comments; default '#'
delimiter: # character to use as the field delimiter; default ','
```

#### `dat`

`dat` reads from a white-space-separated-value file. The type and amount of
white space between columns is irrelevant, as are leading and trailing white
spaces, as long as the number of columns (non-white space fields) is
consistent in each row.

```yaml
type: dat
type_spec:
file: # file path of the DAT file
```

#### `multiple`

`multiple` is a wrapper for multiple input types. Data is read from
each input type specified and once all inputs have been read, the data from
each input is merged into a single data instance containing all fields
(columns) from all inputs. The number and type of input types specified is
arbitrary, but each input must yield data with the same number of rows.

```yaml
type: multiple
type_spec:
format_specs: # a list of input type configurations
```

#### `ram`

`ram` reads data from an in-memory store. For the data to be read it must
have been stored previously, e.g., a previous `output` section defines a `ram`
output.

The `clear_after_read` flag can be used to clear *all* `ram` memory
after reading the data.

```yaml
type: ram
type_spec:
name: # key under which the data is stored
clear_after_read: # clear memory after reading; 'false' by default
```

#### `time-series`

`time-series` reads data from a time-series of structured data files in
the following format:

```
.
├── 0.0
│ ├── data_0.csv
│ ├── data_1.dat
│ └── ...
├── 0.1
│ ├── data_0.csv
│ ├── data_1.dat
│ └── ...
└── ...
```

where each `data_*.*` file contains the data in some format at the moment in
time specified by the directory name.
Each series dataset must be output into a different file, i.e., the
`data_0.csv` files contain one dataset, `data_1.dat` another one, and so on.

```yaml
type: time-series
type_spec:
file: # file name (base only) of the time-series data files
directory: # path to the root directory of the time-series
time_name: # the time field name; default is 'time'
format_spec: # input type configuration, e.g., a CSV input type
```

## Processing

The following is a list of available processor types and their descriptions
along with their run file configuration stubs:

- [`assert-equal`](#assert-equal)
- [`average-cycle`](#average-cycle)
- [`bin`](#bin)
- [`expression`](#expression)
- [`filter`](#filter)
- [`regexp-rename`](#regexp-rename)
- [`rename`](#rename)
- [`resample`](#resample)
- [`select`](#select)
- [`sort`](#sort)

---

#### `assert-equal`

`assert-equal` asserts that all `fields` are equal element-wise,
up to `precision`. All field types must be the same.
If all fields are equal then no error is returned, otherwise
a non-nil error is returned, i.e., the program will stop execution.
The data remains unchanged in either case.

```yaml
type: assert-equal
type_spec:
fields: # field names for which to assert equality
precision: # optional; machine precision by default
```

#### `average-cycle`

`average-cycle` mutates the data by computing the enesemble average of a cycle
for all numeric fields. The ensemble average is computed as:

```
Φ(ωt) = 1/N Σ ϕ[ω(t+j)T], j = 0...N-1
```

where `ϕ` is the slice of values to be averaged, `ω` the angular velocity,
`t` the time and `T` the period.

The resulting data will contain the cycle average of all numeric fields and a
time field (named `time`), containing times for each row of cycle average
data, in the range (0, T]. The time field will be the last field (column),
while the order of the other fields is preserved.

Time matching can be optionally specified, as well as the match precision,
by setting `time_field` and `time_precision` respectively in the configuration.
This checks whether the time (step) is uniform and whether there is a
mismatch between the expected time of the averaged value, as per the number
of cycles defined in the configuration and the supplied data, and the read time.
The read time is the one read from the field named `time_field`.
Note that in this case the output time field will be named after `time_field`,
i.e., the time field name will remain unchanged.

> Warning: It is assumed that data is sorted chronologically, i.e.,
> by ascending time, even if `time_field` is not specified or does not exist.

```yaml
type: average-cycle
type_spec:
n_cycles: # number of cycles to average over
time_field: # time field name; optional
time_precision: # time-matching precision; optional
```

#### `bin`

`bin` mutates the data by dividing all numeric fields into `n_bins`
and setting the field values to bin-mean-values.

> Warning: Each bin _must_ contain the same number of field values,
> i.e., `len(field) % n_fields == 0`.
> This might change in the future.

```yaml
type: bin
type_spec:
n_bins: # number of bins into which the data is divided
```

#### `expression`
`expression` evaluates an arithmetic expression and appends the resulting
field (column) to the data. The expression operands can be scalar values or
fields (columns) present in the data, which are referenced by their names.
Note that at least one of the operands must be a field present in the data.

Each operation involving a field is applied element-wise. The following
arithmetic operations are supported: `+` `-` `*` `/` `**`

```yaml
type: expression
type_spec:
expression: # an arithmetic expression
result: # field name of the resulting field
```

#### `filter`

`filter` mutates the data by applying a set of row filters as defined
in the configuration. The filter behaviour is described by providing
the field name `field` to which the filter is applied, the comparison
operator `op` and a comparison value `value`. Rows satisfying the comparison
are kept, while others are discarded. The following comparison operators
are supported: `==` `!=` `>` `>=` `<` `<=`

All defined filters are applied at the same time. The way in which they
are aggregated is controlled by setting the `aggregation` field in
the configuration, `and` and `or` aggregation modes are available.
The `or` mode is the default if the `aggregation` field is unset.

```yaml
type: filter
type_spec:
aggregation: # aggregration mode; defaults to 'or'
filters:
- field: # field name to which the filter is applied
op: # filtering operation
value: # comparison value
```

#### `regexp-rename`

`regexp-rename` mutates the data by replacing field names which
match the regular expression src with repl.
See [https://golang.org/s/re2syntax](https://golang.org/s/re2syntax) for the
regexp syntax description.

```yaml
type: regexp-rename
type_spec:
src: # regular expression to use in matching
repl: # replacement string
```

#### `rename`

`rename` mutates the data by renaming fields (columns).

```yaml
type: rename
type_spec:
fields: # map of old-to-new name key-value pairs
```

#### `resample`

`resample` mutates the data by linearly interpolating all numeric fields,
such that the resulting fields have `n_points` values, at uniformly
distributed values of the field `x_field`.
If `x_field` is not set, a uniform resampling is performed, i.e., as if
the values of each field were given at a uniformly distributed x,
where x ∈ [0,1].
The first and last values of a field are preserved in the resampled field.

```yaml
type: resample
type_spec:
n_points: # number of resampling points
x_field: # field name of the independent variable; optional
```

#### `select`

`select` mutates the data by keeping or removing 'fields' (columns).
If 'remove' is true, the fields are removed, otherwise only the selected
fields are kept in the order specified.

```yaml
type: select
type_spec:
fields: # a list of field names
remove: # remove/keep selected fields; 'false' by default
```

#### `sort`

`sort` sorts the data by `field` in ascending or descending,
if `descending == true`, order. The processor takes a list of fields and
orderings and applies them in sequence. The order in which the fields
are listed defines the sorting precedence, hence it is possible for some
constraints to not be satisfied.

```yaml
type: sort
type_spec:
- field: # field by which to sort
descending: # sort in descending order; 'false' by default
- field:
descending:
```

## Output

The following is a list of available output types and their descriptions
along with their run file configuration stubs.

#### `csv`

`csv` writes CSV formatted data to a file. If `header` is set to `true`
the file will contain a header line with the field names as the column names.
Note that, if necessary, directories will be created so as to ensure that
`file` specifies a valid path.

#### `ram`

`ram` stores data in an in-memory store. Once data is stored, any subsequent
`ram` input type can access the data.

```yaml
type: ram
type_spec:
name: # key under which the data is stored
```

## Graphing

Only TeX graphing, via `tikz` and `pgfplots`, is supported currently. Hence
for the graph generation to work, TeX needs to be installed along with any
dependent packages.

Graphing consists of two steps: generating TeX graph files from templates, and
generating the graphs from TeX files. To see the default template files run:

```shell
$ post graphfile --outdir=templates
```

The templates can be user supplied by setting `template_directory` and
`template_main` (if necessary) in the run file configuration. The templates
use [Go][golang] template syntax, see the [package documentation][godoc-text-template]
for more information.

A `tex` graph configuration stub is given below, note that several fields expect
raw TeX as input.

```yaml
type: tex
graphs:
- name: # used as a basename for all graph related files
directory: # optional; output directory name, created if not present
table_file: # optional; needed if 'tables.table_file' is undefined
template_directory: # optional; template directory
template_main: # optional; root template file name
template_delims: # optional; go template delimiters; ['__{','}__'] by default
tex_command: # optional; 'pdflatex' by default
axes:
- x:
min:
max:
label: # raw TeX
y:
min:
max:
label: # raw TeX
width: # optional; raw TeX, axis width option
height: # optional; raw TeX, axis height option
legend_style: # optional; raw TeX, axis legend style option
raw_options: # optional; raw TeX, if defined all other options are ignored
tables:
- x_field:
y_field:
legend_entry: # raw TeX
col_sep: # optional; 'comma' by default
table_file: # optional; needed if 'graphs.table_file' is undefined
```

## Templates

Templates reduce boilerplate when it is necessary to process different sources
of data but use the same processing pipeline.

For example, consider the case when we would like to extract data at specific
times from some time series. The run file would look something like this:

```yaml
- input: # extract data at t = 0.1
type: dat
fields: [time, value]
type_spec:
file: 'data.dat'
process:
- type: filter
type_spec:
filters:
- field: 'time'
op: '=='
value: 0.1
output:
- type: csv
type_spec:
file: 'output/data_0.1.csv'

- input: # extract data at t = 0.2
...

- input: # extract data at t = 0.3
...
```

A new pipeline has to be defined for each time we would like to extract since
the `filter` uses a different time value and the extracted data is written
to a different file each time. This is both cumbersome and error prone.
So we use a `template` to simplify this:

```yaml
- template:
params:
t: [0.1, 0.2, 0.3]
src: |
- input:
type: dat
fields: [time, value]
type_spec:
file: 'data.dat'
process:
- type: filter
type_spec:
filters:
- field: 'time'
op: '=='
value: {{ .t }}
output:
- type: csv
type_spec:
file: 'output/data_{{ .t }}.csv'
```

Now we only have to define the pipeline once and, in this case,
parametrize it by time.

A `template` consists of the following fields:

```yaml
- template:
params: # a map of parameters used in the template
src: # YAML formatted string for the pipeline to template
```

For a `template` definition to be the following must be true:

- the `template` must be defined as part of a sequence (`!!seq`)
- the definition can contain only one mapping,
which must have the tag `template`, i.e., it cannot be defined as part
of a pipeline

The `params` field is a map of parameters and their values. The values can
be of any type, including a mapping, but must be given as a list, even if
only one value is given. Here are some examples:

```yaml
- template:
params:
o: [0] # a single integer, but must be a list
p: [0, 1, 2] # a list of integers
q: ['ab', 'cd'] # a list of strings
r: # a list of maps
- tag: a
val: 0
- tag: b
val: 1
```

The `src` field is a string containing the pipeline template, using
[Go template syntax][godoc-text-template], i.e., the string within `src` is
expanded directly into the run file using parameter values defined in `params`.

If multiple parameters are defined, the `template` is executed for all
combinations of parameters. For example, the following `template`:

```yaml
- template:
params:
typ: [dat, csv]
ind: [0, 1]
src: |
input:
type: ram
type_spec:
name: 'data_{{ .ind }}_{{ .typ }}'
output:
- type: {{ .typ }}
type_spec:
file: 'data_{{ .ind }}.{{ .typ }}'
```

will generate 4 files: `data_0.dat`, `data_1.dat`, `data_0.csv` and `data_1.csv`,
although not necessarily in that order since the execution order of
multi-parameter templates is undefined, and so shouldn't be relied upon.

> Warning: YAML aliases currently *cannot* be used within the `src` field.
> This might change in the future.

See the [examples/](examples) directory for more usage examples.

[godoc-text-template]: https://pkg.go.dev/text/template
[golang]: https://go.dev
[latex]: https://www.latex-project.org/
[openfoam]: https://www.openfoam.com
[post-release]: https://github.com/Milover/post/releases