An open API service indexing awesome lists of open source software.

https://github.com/hyparam/codex2parquet

Convert codex logs into a parquet dataset
https://github.com/hyparam/codex2parquet

Last synced: 18 days ago
JSON representation

Convert codex logs into a parquet dataset

Awesome Lists containing this project

README

          

# codex2parquet

[![mit license](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/licenses/MIT)
[![dependencies](https://img.shields.io/badge/Dependencies-1-blueviolet)](https://www.npmjs.com/package/codex2parquet?activeTab=dependencies)

A command-line tool to convert Codex session logs to Parquet format for data analysis and AI applications.

## Installation

```bash
npm install -g codex2parquet
```

## Usage

```bash
# Export Codex logs for current directory to codex_.parquet
codex2parquet

# Export logs from all projects
codex2parquet --all

# Export to custom filename
codex2parquet --output logs.parquet

# Export logs for a specific project directory
codex2parquet --project ~/code/myapp

# Read from a non-default Codex data directory
codex2parquet --codex-dir ~/.codex
```

### Example

```
$ codex2parquet
Exported 231 events from 6 sessions to codex_myapp.parquet

+------------------------------------------+
| Analyze logs with Hyperparam: |
| npx hyperparam scope codex_myapp.parquet |
+------------------------------------------+
```

## What Gets Exported

Codex stores local data under `~/.codex` by default. This tool reads:

- `~/.codex/sessions/**/*.jsonl`: current Codex rollout logs. Each line is a JSON object with `timestamp`, `type`, and `payload`.
- `~/.codex/sessions/rollout-*.json`: legacy rollout logs. Each file contains a `session` object and an `items` array.
- `~/.codex/state_5.sqlite`: thread metadata, including cwd, title, model, model provider, CLI version, sandbox policy, approval mode, token totals, git metadata, dynamic tools, and subagent parent/child edges.
- `~/.codex/history.jsonl`: prompt history rows with `session_id`, Unix timestamp, and text.
- `~/.codex/logs_2.sqlite`: diagnostic/runtime log rows when the current Node.js runtime includes `node:sqlite`.

The SQLite sources are optional. The exporter reads them through Node's native `node:sqlite` module and does not require a system `sqlite3` command. If the SQLite files are missing or unreadable, the exporter still writes rollout and history rows.

## Output Schema

The generated Parquet file is an event table. It includes one row per rollout event, legacy item, history prompt, or diagnostic log entry.

Important columns:

- `source_kind`: `rollout`, `history`, or `diagnostic_log`
- `project`: Project name derived from `cwd`
- `session_id`: Codex thread/session identifier
- `item_index`: Event index within its source
- `timestamp`: ISO timestamp when available
- `rollout_path`: Source rollout file path
- `top_level_type`: Current JSONL top-level type, such as `session_meta`, `event_msg`, `response_item`, or `turn_context`
- `event_type`: Nested event type for `event_msg` payloads
- `item_type`: Response item type, such as `message`, `reasoning`, `function_call`, or `function_call_output`
- `role`, `name`, `status`, `call_id`, `item_id`, `turn_id`: Common message and tool-call identifiers
- `text`: The primary readable body for messages, user prompts, tool results, agent messages, and diagnostics
- `tool_input_json`, `tool_output`: Tool/function call inputs and decoded outputs
- `model`, `model_provider`, `reasoning_effort`, `cwd`, `title`, `source`, `cli_version`: Thread/session metadata
- `approval_mode`, `sandbox_policy`, `tokens_used`, `git_sha`, `git_branch`, `git_origin_url`: Execution metadata from `state_5.sqlite`
- `input_tokens`, `cached_input_tokens`, `output_tokens`, `reasoning_output_tokens`, `total_tokens`: Token usage when present in event payloads
- `rate_limits_json`, `metadata_json`, `content_json`, `payload_json`, `raw_json`: Metadata and raw JSON preservation columns

All Parquet columns are written as strings to keep the schema stable across Codex log format changes. Rare or source-specific details, such as diagnostic log module paths, dynamic tools, and subagent metadata, are preserved in `metadata_json` instead of becoming mostly-empty top-level columns.

## Options

- `--output `, `-o `: Output parquet filename (default: `codex_.parquet`, or `codex_logs.parquet` with `--all`)
- `--project `: Filter logs to a specific project directory
- `--all`: Export logs from all Codex projects
- `--since `: Only include rows on or after this date (`YYYY-MM-DD` or ISO timestamp)
- `--until `: Only include rows on or before this date (`YYYY-MM-DD` or ISO timestamp); bare dates are inclusive of the full day
- `--codex-dir `: Codex data directory (default: `~/.codex`)
- `--no-history`: Skip prompt history rows
- `--no-diagnostics`: Skip diagnostic log rows
- `--help`, `-h`: Show help message

## Requirements

- Node.js 22.5.0 or newer. SQLite enrichment uses native `node:sqlite`; no `sqlite3` CLI is required.
- Codex local data in `~/.codex`

## Use Cases

- Analyzing Codex usage patterns across projects
- Building datasets from human-agent coding sessions
- Auditing tool calls, command outputs, and runtime diagnostics
- Creating dashboards over models, projects, token usage, and git branches

## Hyperparam

[Hyperparam](https://hyperparam.app) is a tool for exploring and curating AI datasets, such as those produced by codex2parquet.