https://github.com/smithclay/otlp2records
shared library for converting OTLP data into records for parquet, flattened json, avro, clickhouse, etc
https://github.com/smithclay/otlp2records
opentelemetry parquet
Last synced: 17 days ago
JSON representation
shared library for converting OTLP data into records for parquet, flattened json, avro, clickhouse, etc
- Host: GitHub
- URL: https://github.com/smithclay/otlp2records
- Owner: smithclay
- License: apache-2.0
- Created: 2026-01-11T19:26:53.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-05-27T01:17:44.000Z (21 days ago)
- Last Synced: 2026-05-27T01:20:53.902Z (21 days ago)
- Topics: opentelemetry, parquet
- Language: Rust
- Homepage:
- Size: 919 KB
- Stars: 4
- Watchers: 0
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# otlp2records
[](https://crates.io/crates/otlp2records)
Transform OTLP telemetry (logs, traces, metrics) into Arrow RecordBatches.
A high-performance, WASM-compatible library for converting OpenTelemetry Protocol (OTLP) data to Apache Arrow format for efficient storage and querying.
Currently consumed by [duckdb-otlp](https://github.com/smithclay/duckdb-otlp), [otlp2parquet](https://github.com/smithclay/otlp2parquet) and [otlp2pipeline](https://github.com/smithclay/otlp2pipeline).
## Design Principles
- **No I/O**: Core never touches network or filesystem
- **No async**: Pure synchronous transforms
- **WASM-first**: All dependencies compile to wasm32
- **Arrow-native**: RecordBatch is the canonical output format
## Features
- Transform OTLP logs, traces, and metrics to Arrow RecordBatches
- Support for both Protobuf and JSON input formats
- Output to NDJSON, Arrow IPC, or Parquet
- Direct OTLP-to-Arrow hot path for high-throughput ingestion
- JSON/JSONL support through OTLP request normalization into the same Arrow builders
## Installation
Add to your `Cargo.toml`:
```toml
[dependencies]
otlp2records = "0.8"
# Optional: Enable Parquet output
otlp2records = { version = "0.8", features = ["parquet"] }
# Optional: Enable WASM bindings
otlp2records = { version = "0.8", features = ["wasm"] }
```
## Usage
### Rust API
#### High-level API (Recommended)
```rust
use otlp2records::{transform_logs, transform_traces, transform_metrics, InputFormat};
// Transform OTLP logs
let bytes: &[u8] = /* OTLP log data */;
let batch = transform_logs(bytes, InputFormat::Protobuf)?;
println!("Transformed {} log records", batch.num_rows());
// Transform OTLP traces
let batch = transform_traces(bytes, InputFormat::Json)?;
println!("Transformed {} spans", batch.num_rows());
// Transform OTLP metrics (returns separate batches by type)
let batches = transform_metrics(bytes, InputFormat::Protobuf)?;
if let Some(gauge) = batches.gauge {
println!("Transformed {} gauge metrics", gauge.num_rows());
}
if let Some(sum) = batches.sum {
println!("Transformed {} sum metrics", sum.num_rows());
}
```
#### Output Formats
```rust
use otlp2records::{transform_logs, to_json, to_ipc, InputFormat};
let batch = transform_logs(bytes, InputFormat::Protobuf)?;
// Output as NDJSON
let ndjson: Vec = to_json(&batch)?;
// Output as Arrow IPC (streaming format)
let ipc: Vec = to_ipc(&batch)?;
// Output as Parquet (requires "parquet" feature)
#[cfg(feature = "parquet")]
let parquet: Vec = otlp2records::to_parquet(&batch)?;
```
### WASM Usage
Build with the `wasm` feature for browser/Node.js environments:
```bash
cargo build --target wasm32-unknown-unknown --features wasm
```
```javascript
import init, { transform_logs_wasm } from './otlp2records.js';
await init();
// Transform OTLP logs (Uint8Array) to Arrow IPC
const otlpBytes = new Uint8Array(/* ... */);
const arrowIpc = transform_logs_wasm(otlpBytes, "protobuf");
```
## API Overview
### Input Formats
| Format | Description |
|--------|-------------|
| `InputFormat::Protobuf` | Standard OTLP protobuf encoding |
| `InputFormat::Json` | OTLP JSON encoding (camelCase field names) |
| `InputFormat::Jsonl` | Newline-delimited OTLP JSON envelopes |
| `InputFormat::Auto` | Auto-detect JSON vs protobuf with fallback decoding |
### High-level Functions
| Function | Description |
|----------|-------------|
| `transform_logs(bytes, format)` | Transform OTLP logs to Arrow RecordBatch |
| `transform_traces(bytes, format)` | Transform OTLP traces to Arrow RecordBatch |
| `transform_metrics(bytes, format)` | Transform OTLP metrics to MetricBatches |
### Schema Output Selection
The default output is `SchemaOutput::Normalized`, the flattened ClickStack-compatible
schema used by the existing `transform_logs`, `transform_traces`, and
`transform_metrics` APIs. The aliases `"normalized"`, `"clickstack"`,
`"clickstack-mode"`, `""`, and `"default"` all parse to this default.
Rust callers can opt into `SchemaOutput::OtapStar` with explicit APIs:
| Function | Description |
|----------|-------------|
| `transform_logs_with_schema(bytes, format, schema_output)` | Transform logs to `LogsOutput::Normalized` or `LogsOutput::OtapStar` |
| `transform_traces_with_schema(bytes, format, schema_output)` | Transform traces to `TracesOutput::Normalized` or `TracesOutput::OtapStar` |
| `transform_metrics_with_schema(bytes, format, schema_output)` | Transform metrics to `MetricsOutput::Normalized` or `MetricsOutput::OtapStar` |
`otap-star` / `otap_star` emits multi-table Arrow batches modeled after the
OpenTelemetry otel-arrow data model. Instead of flattened JSON columns such as
`events_json`, `links_json`, `metric_attributes`, or `exemplars_json`, child
entities are emitted as separate tables keyed by deterministic `id` and
`parent_id` columns. Use `iter_named_batches()` on `OtapLogsBatches`,
`OtapTracesBatches`, or `OtapMetricsBatches` to serialize each named table.
The FFI and WASM bindings continue to expose the normalized single-batch shape in
this release. `otap-star` is Rust API only to avoid changing those ABIs.
### Breaking Changes In 0.8.0
The 0.7 to 0.8 release intentionally changes the default normalized schema. The
existing `transform_logs`, `transform_traces`, and `transform_metrics` APIs still
return flattened batches by default, but downstream code that selects columns by
name or expects specific Arrow physical types must be updated.
Key normalized-schema changes:
- OTLP/OTAP field names replace older ClickStack-style names: for example,
`timestamp` becomes `time_unix_nano` for logs and metrics,
trace `timestamp` becomes `start_time_unix_nano`, `span_name` becomes `name`,
`span_kind` becomes `kind`, and metric `metric_name`/`metric_description`/
`metric_unit` become `name`/`description`/`unit`.
- Timestamps now use Arrow `Timestamp(Nanosecond)` instead of microsecond or
millisecond-scaled integer columns. Span duration is
`duration_time_unix_nano` with Arrow `Duration(Nanosecond)`.
- Trace and span identifiers are Arrow `FixedSizeBinary(16)` and
`FixedSizeBinary(8)` instead of hex strings.
- Metric number values are split into nullable `int_value` and `double_value`
columns instead of a single `Float64` `value` column.
- Histogram bucket columns now use typed Arrow list columns instead of JSON
strings, and dropped counts/flags/count fields use unsigned Arrow integer
types where OTAP does.
The flattened JSON convenience columns remain for now: `resource_attributes`,
`scope_attributes`, signal attribute JSON columns, `events_json`, `links_json`,
and `exemplars_json`. The new `otap-star` output is the more relational
multi-table shape for callers that want child tables instead of flattened JSON.
### Transform Observation
Production callers can opt into phase timings and counters without changing output semantics:
| Function | Description |
|----------|-------------|
| `transform_logs_with_observer(bytes, format, observer)` | Transform logs and report decode/build/append/finalize phases |
| `transform_traces_with_observer(bytes, format, observer)` | Transform traces and report decode/build/attribute JSON/append/finalize phases |
| `transform_metrics_with_observer(bytes, format, observer)` | Transform metrics and report decode/capacity/context/append/finalize phases |
Implement `TransformObserver` to receive `TransformPhaseTiming` and `TransformCounterValue`
events. Counters include duplicate resource/scope context hits and misses plus repeated
resource/scope attribute row-copy counts and bytes.
To observe an OTAP star transform, use the `*_with_schema_and_observer` entry
points, which route to either schema and thread the observer through:
| Function | Description |
|----------|-------------|
| `transform_logs_with_schema_and_observer(bytes, format, schema_output, observer)` | Logs transform with both schema selection and observer |
| `transform_traces_with_schema_and_observer(bytes, format, schema_output, observer)` | Traces transform with both schema selection and observer |
| `transform_metrics_with_schema_and_observer(bytes, format, schema_output, observer)` | Metrics transform with both schema selection and observer |
The OTAP path emits the same phase enum (`ProtobufDecode`, `JsonDecode`,
`JsonlDecode`, `BuilderInit`, `ResourceLogsBuild` / `ResourceSpansBuild` /
`ResourceMetricsBuild`, the matching `Scope*Build` and per-record
`LogRecordBuild` / `SpanBuild` / `MetricBuild`, and `ArrowFinalize`) plus the
`OutputRows`, `Resource/ScopeContextDuplicateHit`, and
`Resource/ScopeContextDuplicateMiss` counters. The
`Resource/ScopeAttributesRowCopies*` counters are normalized-only — OTAP
emits attributes as their own child tables, so no row replication happens.
### Output Functions
| Function | Description |
|----------|-------------|
| `to_json(&batch)` | Convert RecordBatch to NDJSON bytes |
| `to_ipc(&batch)` | Convert RecordBatch to Arrow IPC format |
| `to_parquet(&batch)` | Convert RecordBatch to Parquet (requires feature) |
These serializers operate on one `RecordBatch` at a time. For `otap-star`, call
them per table by iterating named batches.
### Schemas
| Function | Description |
|----------|-------------|
| `logs_schema()` | Arrow schema for log records |
| `traces_schema()` | Arrow schema for trace spans |
| `gauge_schema()` | Arrow schema for gauge metrics |
| `sum_schema()` | Arrow schema for sum metrics |
## Architecture
```
+-------------------+
| OTLP Input |
| (Protobuf / JSON) |
+---------+---------+
|
v
+---------+---------+
| Format Dispatch |
| (protobuf/jsonl) |
+---------+---------+
|
v
+---------+---------+
| OTLP Request |
| (prost structs) |
+---------+---------+
|
v
+---------+---------+
| Arrow Builders |
| (direct columns) |
+---------+---------+
|
v
+---------+---------+
| RecordBatch |
+---------+---------+
|
+---------------------+---------------------+
| | |
v v v
+-------+-------+ +-------+-------+ +-------+-------+
| NDJSON | | Arrow IPC | | Parquet |
+---------------+ +---------------+ +---------------+
```
### Public Surface
- **transform functions**: Convert OTLP logs, traces, and metrics to Arrow batches
- **schema functions**: Return the Arrow schemas used by the transform functions
- **partition helpers**: Group transformed batches by service
- **output helpers**: Serialize RecordBatches to NDJSON, Arrow IPC, or Parquet
- **wasm**: WASM bindings (optional)
## Output Schemas
`SchemaOutput::Normalized` is the default flattened schema. In 0.8.0 it uses
OTAP-compatible field names and high-value Arrow physical types while keeping
the flattened resource/scope/attribute convenience columns. The `clickstack`
and `clickstack-mode` schema aliases still select this normalized output.
### Logs Schema
| Field | Type | Description |
|-------|------|-------------|
| time_unix_nano | TimestampNanosecond | Log record timestamp |
| observed_time_unix_nano | TimestampNanosecond | When log was observed |
| trace_id | FixedSizeBinary(16) | Trace correlation ID |
| span_id | FixedSizeBinary(8) | Span correlation ID |
| service_name | String | Service name from resource |
| service_namespace | String | Service namespace |
| service_instance_id | String | Service instance ID |
| severity_number | Int32 | Numeric severity (1-24) |
| severity_text | String | Severity string (DEBUG, INFO, etc.) |
| event_name | String | Log event name |
| body | String | Log message body |
| resource_attributes | String | JSON-encoded resource attributes |
| scope_name | String | Instrumentation scope name |
| scope_version | String | Instrumentation scope version |
| scope_attributes | String | JSON-encoded scope attributes |
| log_attributes | String | JSON-encoded log attributes |
| dropped_attributes_count | UInt32 | Dropped log attributes |
| flags | UInt32 | Log flags |
### Traces Schema
| Field | Type | Description |
|-------|------|-------------|
| start_time_unix_nano | TimestampNanosecond | Span start time |
| duration_time_unix_nano | DurationNanosecond | Span duration |
| trace_id | FixedSizeBinary(16) | Trace ID |
| span_id | FixedSizeBinary(8) | Span ID |
| parent_span_id | FixedSizeBinary(8) | Parent span ID |
| trace_state | String | W3C trace state |
| name | String | Operation name |
| kind | Int32 | Span kind enum |
| status_code | Int32 | Status code |
| status_status_message | String | Status message |
| service_name | String | Service name from resource |
| service_namespace | String | Service namespace |
| service_instance_id | String | Service instance ID |
| scope_name | String | Instrumentation scope name |
| scope_version | String | Instrumentation scope version |
| scope_attributes | String | JSON-encoded scope attributes |
| span_attributes | String | JSON-encoded span attributes |
| resource_attributes | String | JSON-encoded resource attributes |
| events_json | String | JSON-encoded span events |
| links_json | String | JSON-encoded span links |
| dropped_attributes_count | UInt32 | Dropped attributes count |
| dropped_events_count | UInt32 | Dropped events count |
| dropped_links_count | UInt32 | Dropped links count |
| flags | UInt32 | Span flags |
### Gauge Metrics Schema
| Field | Type | Description |
|-------|------|-------------|
| time_unix_nano | TimestampNanosecond | Data point timestamp |
| start_time_unix_nano | TimestampNanosecond | Start of measurement window |
| name | String | Metric name |
| description | String | Metric description |
| unit | String | Unit of measurement |
| int_value | Int64 | Integer metric value |
| double_value | Float64 | Floating-point metric value |
| service_name | String | Service name from resource |
| service_namespace | String | Service namespace |
| service_instance_id | String | Service instance ID |
| resource_attributes | String | JSON-encoded resource attributes |
| scope_name | String | Instrumentation scope name |
| scope_version | String | Instrumentation scope version |
| scope_attributes | String | JSON-encoded scope attributes |
| metric_attributes | String | JSON-encoded metric attributes |
| flags | UInt32 | Data point flags |
| exemplars_json | String | JSON-encoded exemplars |
### Sum Metrics Schema
Includes all gauge fields plus:
| Field | Type | Description |
|-------|------|-------------|
| aggregation_temporality | Int32 | 1=Delta, 2=Cumulative |
| is_monotonic | Boolean | Whether sum is monotonic |
### Histogram Metrics Schema
Histogram metrics use the common metric context fields above, plus `count`
(`UInt64`), `sum`, `min`, `max`, typed `bucket_counts` (`List`),
typed `explicit_bounds` (`List`), `flags`, `exemplars_json`, and
`aggregation_temporality`.
### Exponential Histogram Metrics Schema
Exponential histograms use the common metric context fields above, plus
`count` (`UInt64`), `sum`, `min`, `max`, `scale`, `zero_count` (`UInt64`),
`zero_threshold`, typed positive/negative bucket-count lists, `flags`,
`exemplars_json`, and `aggregation_temporality`.
## Cargo Features
| Feature | Description | Default |
|---------|-------------|---------|
| `default` | Core functionality | Yes |
| `parquet` | Enable Parquet output | No |
| `wasm` | Enable WASM bindings | No |
## Performance
- Transforms are plain Rust functions with no interpreter or runtime overhead
- Arc-shared resource/scope values reduce memory allocations
- Arrow columnar format enables efficient compression
- Release builds use LTO and size optimization
## License
Licensed under either of:
- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.
## Contributing
Contributions welcome! Please ensure:
1. All tests pass: `cargo test`
2. Code is formatted: `cargo fmt`
3. No clippy warnings: `cargo clippy -- -D warnings`