{"id":26863918,"url":"https://github.com/bhupixb/pushshift-go","last_synced_at":"2026-05-18T04:01:38.711Z","repository":{"id":284583234,"uuid":"955393385","full_name":"bhupixb/pushshift-go","owner":"bhupixb","description":"Efficiently read reddit data from pushshift dataset in zst format and convert into Parquet files using DuckDB.","archived":false,"fork":false,"pushed_at":"2025-03-26T15:55:58.000Z","size":26,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-20T09:39:20.905Z","etag":null,"topics":["duckdb","golang","pushshift","reddit"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bhupixb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-26T15:11:06.000Z","updated_at":"2025-03-26T15:56:01.000Z","dependencies_parsed_at":"2025-03-26T16:40:33.960Z","dependency_job_id":"815fa80d-ae9c-4707-a734-6f8e35b504e2","html_url":"https://github.com/bhupixb/pushshift-go","commit_stats":null,"previous_names":["bhupixb/pushshift-go"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bhupixb/pushshift-go","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bhupixb%2Fpushshift-go","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bhupixb%2Fpushshift-go/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bhupixb%2Fpushshift-go/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bhupixb%2Fpushshift-go/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bhupixb","download_url":"https://codeload.github.com/bhupixb/pushshift-go/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bhupixb%2Fpushshift-go/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33164672,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-17T22:39:12.733Z","status":"online","status_checked_at":"2026-05-18T02:00:06.436Z","response_time":71,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["duckdb","golang","pushshift","reddit"],"created_at":"2025-03-31T03:33:12.636Z","updated_at":"2026-05-18T04:01:38.691Z","avatar_url":"https://github.com/bhupixb.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pushshift-Go\n\nEfficiently read large zst files of reddit data from pushshift and convert into Parquet files using DuckDB.\n\nA Go tool for processing large zst compressed JSON files from Pushshift, splitting them into manageable parts, and converting to Parquet format.\n\n## Overview\n\nThis tool efficiently processes large zst-compressed JSON files from Pushshift by:\n\n1. Decompressing the zst file on-the-fly.\n2. Write the data to a file in JSON format.\n3. Once the json file reaches manageable parts (8GB by default), convert it to Parquet format using DuckDB.\n4. The magic is done by DuckDB, it reads the json files,\nautomatically infers the schema \u0026 create a DuckDB table. Then we copy that DuckDB table to an output file in Parquet Format. Refer json_to_parquet_duckdb.sh script.\n\nThis approach provides:\n- Memory-efficient processing of zst files that are too large for single-pass conversion. If we decompress a 50gb zst file to JSON, then it will require us \u003e 1000 GB of storage because the compression ratio of zst:json is 1:~25.\n- So instead we are reading the data in chunk of 8gb, covert it to parquet format. The ratio of zst:parquet is 1:~3x only.\n- Optimized disk usage by removing intermediate files.\n\n## Prerequisites\n\n- Go 1.19+\n- DuckDB installed and available in PATH\n- `json_to_parquet_duckdb.sh` script in the project root (used for JSONL to Parquet conversion)\n\n### Installing DuckDB\n\nmacOS:\n```bash\nbrew install duckdb\n```\n\nLinux:\n```bash\n# Check your distribution's package manager or download from:\n# https://duckdb.org/docs/installation/\n```\n\n## Build\n\n```bash\ngo build -o pushshift-processor ./cmd/processor\n```\n\n## Usage\n\n```bash\n./pushshift-processor -input=your_data.zst -output=output_prefix\n```\n\nThe tool will:\n1. Process the zst file in chunks\n2. Create output files with the pattern: `output_prefix_part_001.parquet`, `output_prefix_part_002.parquet`, etc.\n\n### Command-line parameters\n\n- `-input`: Path to the input zst file (required)\n- `-output`: Output file prefix (defaults to \"output\")\n\n## Converter Script\n\nThe project includes a converter script `json_to_parquet_duckdb.sh` in the project root. This script is used to convert JSONL files to Parquet format using DuckDB.\n\nThe script takes the following parameters:\n```bash\n./json_to_parquet_duckdb.sh \u003cjsonl_file\u003e [output_name]\n```\n\n### How it works\n\nWhen you run the processor, it automatically calls this script to convert each part file from JSONL to Parquet format. The script:\n\n1. Takes a zst file as input, decompress it in json.\n2. Uses DuckDB to read the JSON data\n3. Exports the data to Parquet format\n4. Cleans up temporary tables\n\n## Performance Tuning\n\nThe processor uses the following buffer sizes, which can be adjusted in the code for different performance characteristics:\n\n```go\nconst (\n    partSizeThreshold = 8 * 1024 * 1024 * 1024 // 8GB for each part file\n    bufferSize        = 512 * 1024 * 1024      // 512MB buffer for reading\n    scannerBufferSize = 512 * 1024 * 1024      // 512MB buffer for scanner\n)\n```\n\n- Increase `partSizeThreshold` for fewer, larger output files\n- Adjust buffer sizes based on available memory\n\n## Parquet Benefits\n\nThe Parquet output format provides several advantages:\n- **Column-based storage**: More efficient for analytical queries\n- **Built-in compression**: Significantly reduces file size\n- **Predicate pushdown**: Query optimization for faster analytics\n- **Direct integration**: With tools like DuckDB, Apache Spark, Dask, etc.\n\n## Example Output\n\nOn my local Mac M3 12 core, 18gb.\nIt takes about 2min 21s to decompress a zst file of\nsize 1.7GB which is ~46GB in uncompressed format(json) and ~3GB in parquet format. \n```\n2025/03/26 20:53:07.159027 📊 Statistics:\n  📝 Total lines processed: 16680905\n  ⏱️  Execution time: 2m21.155298333s\n\n📊 Statistics:\n  📝 Total lines processed: 16680905\n  ⏱️  Execution time: 2m21.155298333s\n2025/03/26 20:53:07.159042 ✅ All done!\n```\n\n## Testing the Setup\n\nTo verify your setup is working correctly:\n\n1. Create a small test JSONL file with some valid JSON data:\n   ```bash\n   echo '{\"id\": 1, \"text\": \"test1\"}' \u003e test.jsonl\n   echo '{\"id\": 2, \"text\": \"test2\"}' \u003e\u003e test.jsonl\n   ```\n\n2. Compress it with zstd:\n   ```bash\n   # Install zstd if needed\n   # macOS: brew install zstd\n   # Ubuntu/Debian: apt-get install zstd\n   zstd test.jsonl -o test.jsonl.zst\n   ```\n\n3. Run the processor:\n   ```bash\n   ./pushshift-processor -input=test.jsonl.zst -output=test_output\n   ```\n\n4. Check the output:\n   ```bash\n   # Verify the Parquet file was created\n   ls -la test_output_part_001.parquet\n   \n   # You can view the Parquet file contents with DuckDB\n   duckdb -c \"SELECT * FROM read_parquet('test_output_part_001.parquet');\"\n   ```\n\n## Troubleshooting\n\n1. **DuckDB not found**: Ensure DuckDB is installed and available in your PATH\n2. **Converter script errors**: Make sure the script is executable and has the correct path\n3. **Go build errors**: Verify your Go installation and that all dependencies are installed\n\nFor any issues, check the error logs which will be displayed when running the processor.\nFeel free to open an issue.\n\n## Querying Parquet Files with DuckDB\n\nAfter processing your data, you can analyze the resulting Parquet files using DuckDB, which provides excellent performance for analytical queries.\nRefer https://duckdb.org/docs/stable/data/parquet/overview for more details.\n\n### Basic Querying\n\nTo run basic queries on your output Parquet files(for e.g. to see the schema etc):\n\n```bash\n# Start DuckDB CLI\nduckdb\n\n# Create a table\n\u003e CREATE TABLE my_table AS\n  SELECT * FROM read_parquet('output_file_part001.parquet');\n\n# see all columns and data type\n\u003e SELECT column_name, data_type\n    FROM information_schema.columns\n    WHERE table_name = 'my_table';\n\n# copy the schema to a file\n\u003e copy (SELECT column_name, data_type\n    FROM information_schema.columns\n    WHERE table_name = 'my_table') to 'schema.csv' (FORMAT CSV, HEADER TRUE);\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbhupixb%2Fpushshift-go","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbhupixb%2Fpushshift-go","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbhupixb%2Fpushshift-go/lists"}