https://github.com/ydb-platform/ydb-parallel-processor

YDB Parallel Record Processor
https://github.com/ydb-platform/ydb-parallel-processor

batch-processing ydb

Last synced: 2 days ago
JSON representation

YDB Parallel Record Processor

Host: GitHub
URL: https://github.com/ydb-platform/ydb-parallel-processor
Owner: ydb-platform
License: apache-2.0
Created: 2025-05-20T14:05:59.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-09-27T11:57:26.000Z (10 months ago)
Last Synced: 2025-09-27T13:28:20.193Z (10 months ago)
Topics: batch-processing, ydb
Language: Java
Homepage: https://ydb.tech
Size: 121 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# YDB parallel records batch processor

[See the Releases page for downloads](https://github.com/zinal/ydb-parallel-processor/releases).

This tool provides basic capabilities for parallel job processing of records stored in [YDB](https://ydb.tech) tables.
It can be used as a standalone program, or embedded into the user application (which needs to be based on Java-compatible stack) as a library.

The basic workflow implemented by the tool consists of two phases:

1. The tool grabs the record keys using the initial query, applying the basic filtering. Output paging can optionally be used, as [described in the YDB documentation](https://ydb.tech/docs/en/dev/paging). For paging, the additional "paging" query needs to be configured.
2. The tool uses the collected record keys to apply the additional processing, with the following options:
* Table joins and additional computation may be applied to grab and enrich the records, and output the full records in CSV or JSON format.
* Data records can be updated, such as setting some fields in the existing records or inserting new records.

Phase 2 is executed using the configurable executor pool, providing parallel execution for enhanced performance. This means that the output records cannot generally be sorted, as the output from different parallel jobs will be provided in the undefined order.

## Requirements

- Java SDK 17 or later (for running and building)
- Apache Maven (for building)

## Running the tool as a standalone program

```bash
./Run.sh connection.xml jobdef.xml
... or ...
./Run.sh connection.xml jobdef.xml vars.xml
```

* The first parameter should point to the file with the connection parameters.
* The second parameter should point to the file with the job definition.
* The optional third parameter can be missing, or should point to the file with the substitution variables.

For information about the substitution variables, see the section at the end of this file (`README.md`), or in the corresponding section of `README-ru.md`.

## Embedding the tool into the user program

Embedding the tool into the user program can be performed using the following Maven dependency:

```xml

tech.ydb.app
ydb-parallel-processor
1.3

```

> [!WARNING]
> Currently the tool artifacts are not published to Maven Central.
> In order to use the dependency shown above, local build is necessary.

Using the class `tech.ydb.app.parproc.Tool`, the following can be implemented:

```java
JobDef job = new JobDef();
job.setMainQuery("SELECT ...");
job.setDetailsQuery("SELECT ...");
job.getDetailsInput().add("id");
...
Properties propsConn = new Properties();
propsConn.setProperty("ydb.url", "grpcs://ydb01.localdomain:2135/cluster1/testdb");
...
try (YdbConnector yc = new YdbConnector(propsConn)) {
try (Tool app = new Tool(yc, job)) {
app.run();
}
}
```

## Connection parameters

Connection parameters are provided either programmatically (as `java.util.Properties` object), or via XML properties file.

| **Parameter** | **Description** |
| --- | --- |
| `ydb.url` | YDB connection URL |
| `ydb.cafile` | Path to the custom CA certificate file |
| `ydb.auth.mode` | [Authentication mode](https://ydb.tech/docs/en/reference/ydb-sdk/auth#auth-provider) (NONE, ENV, STATIC, METADATA, SAKEY) |
| `ydb.auth.username` | Username for STATIC authentication |
| `ydb.auth.password` | Password for STATIC authentication |
| `ydb.auth.sakey` | Service account key file name for SAKEY authentication |
| `ydb.preferLocalDc` | Prefer local or nearest datacenter for connections (true or false, default false) |

Example properies file:

```xml

grpcs://ydb01.localdomain:2135/cluster1/testdb
/home/ydbadmin/Work/cluster1/tls/ca.crt
STATIC
root
???
1000

```

## Processing parameters

Processing parameters are provided either programmatically (as `tech.ydb.app.parproc.JobDef` object), or via XML configuration file.

| **Parameter** | **Description** |
| --- | --- |
| `worker-count` | Number of parallel workers |
| `queue-size` | Size of the queue between the main query and the detail query |
| `batch-limit` | Maximum number of keys in a batch for details query |
| `output-format` | Output format (CSV, TSV, JSON, CUSTOM1 - see below for details). Default is CSV if omitted |
| `output-file` | Output file name, with value '-' for STDOUT. Default is '-' if omitted |
| `isolation` | Transaction isolation level (SERIALIZABLE_RW, SNAPSHOT_RO, STALE_RO, ONLINE_RO, ONLINE_INCONSISTENT_RO). Default is SERIALIZABLE_RW if omitted |
| `timeout` | Query timeout, in milliseconds, for query-main, query-page or query-detail, default -1 (unlimited) |
| `query-main` | Main query (executed first) |
| `query-page` | Paging query (optional). Requires key sorting and row count limit both for itself and for the main query. |
| `query-details` | Detail query. Takes the keys from main and page queries, and applies extra logic. |
| `input-page` | List of input columns for the page query (subset of columns from main and page queries). |
| `input-details` | List of input columns for the detail query (subset of columns from main and page queries), optional |

In the tags `query-main`, `query-page` and `query-details` an optional `timeout` attribute can also be specified, setting the maximum execution time of the query in milliseconds. If the specified time is exceeded, the query execution is aborted and retried. This timeout mechanism helps protect against performance degradation caused by rare slowdowns of query execution.

Supported output formats:
- `CSV` - regular CSV according to RFC4180, with comma delimited, double quotes, and CR-LF line separators
- `TSV` - tab-delimoted format, double quotes, CR-LF
- `JSON` - a separate JSON document per line, CR delimited
- `CUSTOM1` - CSV-style format with 0x19 as field delimiter, LF (0x0A) as record delimiter and minimized (mostly no) quotes.

Example configuration file:

```xml

5
100
1000
CSV
example1.csv
SNAPSHOT_RO
= Timestamp('2021-01-01T00:00:00Z')
AND sys_update_tv < Timestamp('2021-01-02T00:00:00Z')
ORDER BY sys_update_tv, id -- Mandatory sorting on primary key or secondary index
LIMIT 1000; -- Mandatory limit on the number of output records
]]>

sys_update_tv
id

;
SELECT sys_update_tv, id
FROM my_documents VIEW ix_sys_update_tv
WHERE (sys_update_tv, id) > ($input.sys_update_tv, $input.id) -- Paging condition
AND sys_update_tv < Timestamp('2021-01-02T00:00:00Z') -- Repeat the filter from the main query
ORDER BY sys_update_tv, id -- Mandatory sorting on primary key or secondary index
LIMIT 1000; -- Mandatory limit on the number of output records
]]>

id

>;
SELECT
documents.*,
d1.attr1 AS d1_attr1,
d2.attr1 AS d2_attr1
FROM AS_TABLE($input) AS input
INNER JOIN my_documents VIEW PRIMARY KEY AS documents
ON input.id=documents.id
LEFT JOIN my_dict1 AS d1
ON d1.key=documents.dict1
LEFT JOIN my_dict2 AS d2
ON d2.key=documents.dict2
WHERE documents.some_state IN ('ONE'u, 'TWO'u, 'THREE'u, 'FOUR'u)
AND documents.input_tv IS NOT NULL;
]]>

```

## Substitution variables

The tool can optionally apply variable substitutions to the XML file with job definition.

Substitutions are applied after parsing the XML content (so the file needs to be a valid XML before substititions), but before converting it to the job definition.

The variables can be specified as `${varname}` in attribute values, text values and CDATA section values, but not in the tag names or attribute names.

Example properies file containing some substitution variables:

```xml

output_with_some_suffix.csv
10
2020-01-01T00:00:00Z
2026-01-01T00:00:00Z

```

## Usage examples

The YDB parallel processor tool can be used for various data processing tasks. Below are examples for common use cases.

### 1. Filling a new column with computed data

This example shows how to fill a new column in a table with data computed from other columns using SQL queries.

Suppose that the table `some_table` has got a new column called `new_field` using the following statement:

```SQL
ALTER TABLE some_table ADD COLUMN new_field Text;
```

The `new_field` column contains just nulls, and there is a requirement that it should be filled with some data computed from the other fields (and possibly by requesting the data from other tables as well).

**Job definition file (`fill-column.xml`):**
```xml

10
100
1000
TSV
-
SERIALIZABLE_RW

>;
UPSERT INTO some_table
SELECT i.id AS id,
t.old_field || ' 'u || a.some_name AS new_field
FROM AS_TABLE($input) AS i
JOIN some_table AS t
ON t.id = i.id
LEFT JOIN another_table AS a
ON a.id = t.ref_a;
]]>

```

**Execution:**
```bash
./Run.sh connection.xml fill-column.xml
```

### 2. Archiving old records to another table

This example demonstrates archiving older records to an archive table and then deleting them from the original table.

Suppose that the original table `documents` has to be cleared every month from the documents older that 3 months.
The old documents should be put into the `documents_archive` table, and deleted from the original `documents` table.

**Job definition file (`archive-records.xml`):**
```xml

5
100
1000
TSV
-
SERIALIZABLE_RW

-- First, insert into archive table
UPSERT INTO documents_archive
SELECT d.*
FROM AS_TABLE($input) AS i
JOIN documents AS d ON d.id = i.id;

-- Then, delete from original table
DELETE FROM documents
ON SELECT * FROM AS_TABLE($input);
]]>

```

**Execution:**
```bash
./Run.sh connection.xml archive-records.xml
```

### 3. Extracting large amounts of data to CSV files

This example shows how to extract large datasets to CSV files with parallel processing and paging for better performance.

Suppose that there is a huge amount of data stored in the normalized table structure, and the BI system needs the de-normalized data to fill the data mart. Every day a job is run to extract the last two days' data into the CSV file in the form of a wide table, containing all the required attritutes to fill the data mart. The file is then loaded into the BI system using its native tools.

Job parameters are put into the separate substitution variables file, which is re-generated every time before job run.

**Job substitution variable file (`export-to-csv_params.xml`):**
```xml

exported_data.csv
10
2020-01-01T00:00:00Z
2026-01-01T00:00:00Z

```

**Job definition file (`export-to-csv.xml`):**
```xml

${worker_count}
100
1000
CSV
${file_name}
UTF-8
SNAPSHOT_RO

= Timestamp('${start_time}')
AND sys_update_tv < Timestamp('${finish_time}')
ORDER BY sys_update_tv, id
LIMIT 1000;
]]>

sys_update_tv
id

;
SELECT sys_update_tv, id
FROM my_documents VIEW ix_sys_update_tv
WHERE (sys_update_tv, id) > ($input.sys_update_tv, $input.id)
AND sys_update_tv < Timestamp('${finish_time}')
ORDER BY sys_update_tv, id
LIMIT 1000;
]]>

id

>;
SELECT
documents.*,
d1.attr1 AS d1_attr1,
d2.attr1 AS d2_attr1
FROM AS_TABLE($input) AS input
INNER JOIN my_documents VIEW PRIMARY KEY AS documents
ON input.id=documents.id
LEFT JOIN my_dict1 AS d1
ON d1.key=documents.dict1
LEFT JOIN my_dict2 AS d2
ON d2.key=documents.dict2
WHERE documents.some_state IN ('ONE'u, 'TWO'u, 'THREE'u, 'FOUR'u)
AND documents.input_tv IS NOT NULL;
]]>

```

**Execution:**
```bash
./Run.sh connection.xml export-to-csv.xml export-to-csv_params.xml
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ydb-platform/ydb-parallel-processor

Awesome Lists containing this project

README