{"id":50680280,"url":"https://github.com/hyparam/codex2parquet","last_synced_at":"2026-06-08T18:03:29.186Z","repository":{"id":352788963,"uuid":"1210107585","full_name":"hyparam/codex2parquet","owner":"hyparam","description":"Convert codex logs into a parquet dataset","archived":false,"fork":false,"pushed_at":"2026-04-28T21:20:39.000Z","size":43,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-04-28T23:17:31.683Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hyparam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-14T05:07:17.000Z","updated_at":"2026-04-28T21:20:43.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hyparam/codex2parquet","commit_stats":null,"previous_names":["hyparam/codex2parquet"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hyparam/codex2parquet","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fcodex2parquet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fcodex2parquet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fcodex2parquet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fcodex2parquet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hyparam","download_url":"https://codeload.github.com/hyparam/codex2parquet/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fcodex2parquet/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34073810,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-08T02:00:07.615Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-08T18:03:28.278Z","updated_at":"2026-06-08T18:03:29.179Z","avatar_url":"https://github.com/hyparam.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# codex2parquet\n\n[![mit license](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/licenses/MIT)\n[![dependencies](https://img.shields.io/badge/Dependencies-1-blueviolet)](https://www.npmjs.com/package/codex2parquet?activeTab=dependencies)\n\nA command-line tool to convert Codex session logs to Parquet format for data analysis and AI applications.\n\n## Installation\n\n```bash\nnpm install -g codex2parquet\n```\n\n## Usage\n\n```bash\n# Export Codex logs for current directory to codex_\u003cproject\u003e.parquet\ncodex2parquet\n\n# Export logs from all projects\ncodex2parquet --all\n\n# Export to custom filename\ncodex2parquet --output logs.parquet\n\n# Export logs for a specific project directory\ncodex2parquet --project ~/code/myapp\n\n# Read from a non-default Codex data directory\ncodex2parquet --codex-dir ~/.codex\n```\n\n### Example\n\n```\n$ codex2parquet\nExported 231 events from 6 sessions to codex_myapp.parquet\n\n+------------------------------------------+\n| Analyze logs with Hyperparam:            |\n| npx hyperparam scope codex_myapp.parquet |\n+------------------------------------------+\n```\n\n## What Gets Exported\n\nCodex stores local data under `~/.codex` by default. This tool reads:\n\n- `~/.codex/sessions/**/*.jsonl`: current Codex rollout logs. Each line is a JSON object with `timestamp`, `type`, and `payload`.\n- `~/.codex/sessions/rollout-*.json`: legacy rollout logs. Each file contains a `session` object and an `items` array.\n- `~/.codex/state_5.sqlite`: thread metadata, including cwd, title, model, model provider, CLI version, sandbox policy, approval mode, token totals, git metadata, dynamic tools, and subagent parent/child edges.\n- `~/.codex/history.jsonl`: prompt history rows with `session_id`, Unix timestamp, and text.\n- `~/.codex/logs_2.sqlite`: diagnostic/runtime log rows when the current Node.js runtime includes `node:sqlite`.\n\nThe SQLite sources are optional. The exporter reads them through Node's native `node:sqlite` module and does not require a system `sqlite3` command. If the SQLite files are missing or unreadable, the exporter still writes rollout and history rows.\n\n## Output Schema\n\nThe generated Parquet file is an event table. It includes one row per rollout event, legacy item, history prompt, or diagnostic log entry.\n\nImportant columns:\n\n- `source_kind`: `rollout`, `history`, or `diagnostic_log`\n- `project`: Project name derived from `cwd`\n- `session_id`: Codex thread/session identifier\n- `item_index`: Event index within its source\n- `timestamp`: ISO timestamp when available\n- `rollout_path`: Source rollout file path\n- `top_level_type`: Current JSONL top-level type, such as `session_meta`, `event_msg`, `response_item`, or `turn_context`\n- `event_type`: Nested event type for `event_msg` payloads\n- `item_type`: Response item type, such as `message`, `reasoning`, `function_call`, or `function_call_output`\n- `role`, `name`, `status`, `call_id`, `item_id`, `turn_id`: Common message and tool-call identifiers\n- `text`: The primary readable body for messages, user prompts, tool results, agent messages, and diagnostics\n- `tool_input_json`, `tool_output`: Tool/function call inputs and decoded outputs\n- `model`, `model_provider`, `reasoning_effort`, `cwd`, `title`, `source`, `cli_version`: Thread/session metadata\n- `approval_mode`, `sandbox_policy`, `tokens_used`, `git_sha`, `git_branch`, `git_origin_url`: Execution metadata from `state_5.sqlite`\n- `input_tokens`, `cached_input_tokens`, `output_tokens`, `reasoning_output_tokens`, `total_tokens`: Token usage when present in event payloads\n- `rate_limits_json`, `metadata_json`, `content_json`, `payload_json`, `raw_json`: Metadata and raw JSON preservation columns\n\nAll Parquet columns are written as strings to keep the schema stable across Codex log format changes. Rare or source-specific details, such as diagnostic log module paths, dynamic tools, and subagent metadata, are preserved in `metadata_json` instead of becoming mostly-empty top-level columns.\n\n## Options\n\n- `--output \u003cfile\u003e`, `-o \u003cfile\u003e`: Output parquet filename (default: `codex_\u003cproject\u003e.parquet`, or `codex_logs.parquet` with `--all`)\n- `--project \u003cpath\u003e`: Filter logs to a specific project directory\n- `--all`: Export logs from all Codex projects\n- `--since \u003cdate\u003e`: Only include rows on or after this date (`YYYY-MM-DD` or ISO timestamp)\n- `--until \u003cdate\u003e`: Only include rows on or before this date (`YYYY-MM-DD` or ISO timestamp); bare dates are inclusive of the full day\n- `--codex-dir \u003cpath\u003e`: Codex data directory (default: `~/.codex`)\n- `--no-history`: Skip prompt history rows\n- `--no-diagnostics`: Skip diagnostic log rows\n- `--help`, `-h`: Show help message\n\n## Requirements\n\n- Node.js 22.5.0 or newer. SQLite enrichment uses native `node:sqlite`; no `sqlite3` CLI is required.\n- Codex local data in `~/.codex`\n\n## Use Cases\n\n- Analyzing Codex usage patterns across projects\n- Building datasets from human-agent coding sessions\n- Auditing tool calls, command outputs, and runtime diagnostics\n- Creating dashboards over models, projects, token usage, and git branches\n\n## Hyperparam\n\n[Hyperparam](https://hyperparam.app) is a tool for exploring and curating AI datasets, such as those produced by codex2parquet.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyparam%2Fcodex2parquet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhyparam%2Fcodex2parquet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyparam%2Fcodex2parquet/lists"}