https://github.com/hyparam/codex2parquet
Convert codex logs into a parquet dataset
https://github.com/hyparam/codex2parquet
Last synced: 18 days ago
JSON representation
Convert codex logs into a parquet dataset
- Host: GitHub
- URL: https://github.com/hyparam/codex2parquet
- Owner: hyparam
- License: mit
- Created: 2026-04-14T05:07:17.000Z (2 months ago)
- Default Branch: master
- Last Pushed: 2026-04-28T21:20:39.000Z (about 2 months ago)
- Last Synced: 2026-04-28T23:17:31.683Z (about 2 months ago)
- Language: JavaScript
- Size: 42 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# codex2parquet
[](https://opensource.org/licenses/MIT)
[](https://www.npmjs.com/package/codex2parquet?activeTab=dependencies)
A command-line tool to convert Codex session logs to Parquet format for data analysis and AI applications.
## Installation
```bash
npm install -g codex2parquet
```
## Usage
```bash
# Export Codex logs for current directory to codex_.parquet
codex2parquet
# Export logs from all projects
codex2parquet --all
# Export to custom filename
codex2parquet --output logs.parquet
# Export logs for a specific project directory
codex2parquet --project ~/code/myapp
# Read from a non-default Codex data directory
codex2parquet --codex-dir ~/.codex
```
### Example
```
$ codex2parquet
Exported 231 events from 6 sessions to codex_myapp.parquet
+------------------------------------------+
| Analyze logs with Hyperparam: |
| npx hyperparam scope codex_myapp.parquet |
+------------------------------------------+
```
## What Gets Exported
Codex stores local data under `~/.codex` by default. This tool reads:
- `~/.codex/sessions/**/*.jsonl`: current Codex rollout logs. Each line is a JSON object with `timestamp`, `type`, and `payload`.
- `~/.codex/sessions/rollout-*.json`: legacy rollout logs. Each file contains a `session` object and an `items` array.
- `~/.codex/state_5.sqlite`: thread metadata, including cwd, title, model, model provider, CLI version, sandbox policy, approval mode, token totals, git metadata, dynamic tools, and subagent parent/child edges.
- `~/.codex/history.jsonl`: prompt history rows with `session_id`, Unix timestamp, and text.
- `~/.codex/logs_2.sqlite`: diagnostic/runtime log rows when the current Node.js runtime includes `node:sqlite`.
The SQLite sources are optional. The exporter reads them through Node's native `node:sqlite` module and does not require a system `sqlite3` command. If the SQLite files are missing or unreadable, the exporter still writes rollout and history rows.
## Output Schema
The generated Parquet file is an event table. It includes one row per rollout event, legacy item, history prompt, or diagnostic log entry.
Important columns:
- `source_kind`: `rollout`, `history`, or `diagnostic_log`
- `project`: Project name derived from `cwd`
- `session_id`: Codex thread/session identifier
- `item_index`: Event index within its source
- `timestamp`: ISO timestamp when available
- `rollout_path`: Source rollout file path
- `top_level_type`: Current JSONL top-level type, such as `session_meta`, `event_msg`, `response_item`, or `turn_context`
- `event_type`: Nested event type for `event_msg` payloads
- `item_type`: Response item type, such as `message`, `reasoning`, `function_call`, or `function_call_output`
- `role`, `name`, `status`, `call_id`, `item_id`, `turn_id`: Common message and tool-call identifiers
- `text`: The primary readable body for messages, user prompts, tool results, agent messages, and diagnostics
- `tool_input_json`, `tool_output`: Tool/function call inputs and decoded outputs
- `model`, `model_provider`, `reasoning_effort`, `cwd`, `title`, `source`, `cli_version`: Thread/session metadata
- `approval_mode`, `sandbox_policy`, `tokens_used`, `git_sha`, `git_branch`, `git_origin_url`: Execution metadata from `state_5.sqlite`
- `input_tokens`, `cached_input_tokens`, `output_tokens`, `reasoning_output_tokens`, `total_tokens`: Token usage when present in event payloads
- `rate_limits_json`, `metadata_json`, `content_json`, `payload_json`, `raw_json`: Metadata and raw JSON preservation columns
All Parquet columns are written as strings to keep the schema stable across Codex log format changes. Rare or source-specific details, such as diagnostic log module paths, dynamic tools, and subagent metadata, are preserved in `metadata_json` instead of becoming mostly-empty top-level columns.
## Options
- `--output `, `-o `: Output parquet filename (default: `codex_.parquet`, or `codex_logs.parquet` with `--all`)
- `--project `: Filter logs to a specific project directory
- `--all`: Export logs from all Codex projects
- `--since `: Only include rows on or after this date (`YYYY-MM-DD` or ISO timestamp)
- `--until `: Only include rows on or before this date (`YYYY-MM-DD` or ISO timestamp); bare dates are inclusive of the full day
- `--codex-dir `: Codex data directory (default: `~/.codex`)
- `--no-history`: Skip prompt history rows
- `--no-diagnostics`: Skip diagnostic log rows
- `--help`, `-h`: Show help message
## Requirements
- Node.js 22.5.0 or newer. SQLite enrichment uses native `node:sqlite`; no `sqlite3` CLI is required.
- Codex local data in `~/.codex`
## Use Cases
- Analyzing Codex usage patterns across projects
- Building datasets from human-agent coding sessions
- Auditing tool calls, command outputs, and runtime diagnostics
- Creating dashboards over models, projects, token usage, and git branches
## Hyperparam
[Hyperparam](https://hyperparam.app) is a tool for exploring and curating AI datasets, such as those produced by codex2parquet.