{"id":41816790,"url":"https://github.com/jrothbaum/stata_parquet_io","last_synced_at":"2026-01-25T07:21:21.836Z","repository":{"id":292464606,"uuid":"972988285","full_name":"jrothbaum/stata_parquet_io","owner":"jrothbaum","description":"Read and write parquet files with stata (using the Stata c plugin api and rust polars)","archived":false,"fork":false,"pushed_at":"2026-01-19T19:39:10.000Z","size":750,"stargazers_count":4,"open_issues_count":1,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2026-01-20T01:25:41.116Z","etag":null,"topics":["parquet","stata"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jrothbaum.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-04-26T02:59:03.000Z","updated_at":"2025-11-20T15:35:23.000Z","dependencies_parsed_at":"2025-05-10T06:24:56.974Z","dependency_job_id":"f9e5478b-0cf3-4970-9c1d-08445e6151c6","html_url":"https://github.com/jrothbaum/stata_parquet_io","commit_stats":null,"previous_names":["jrothbaum/stata_parquet_io"],"tags_count":16,"template":false,"template_full_name":null,"purl":"pkg:github/jrothbaum/stata_parquet_io","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrothbaum%2Fstata_parquet_io","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrothbaum%2Fstata_parquet_io/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrothbaum%2Fstata_parquet_io/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrothbaum%2Fstata_parquet_io/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jrothbaum","download_url":"https://codeload.github.com/jrothbaum/stata_parquet_io/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrothbaum%2Fstata_parquet_io/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28747336,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-25T05:12:38.112Z","status":"ssl_error","status_checked_at":"2026-01-25T05:04:50.338Z","response_time":113,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["parquet","stata"],"created_at":"2026-01-25T07:21:21.108Z","updated_at":"2026-01-25T07:21:21.828Z","avatar_url":"https://github.com/jrothbaum.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Read/Write parquet files in stata\n\n`pq` is a Stata package that enables reading and writing Parquet files directly in Stata. This plugin bridges the gap between Stata's data analysis capabilities and the increasingly popular Parquet file format, which is optimized for storage and performance with large datasets.\n\n## Features\n\n- **Read** Parquet files into Stata datasets\n- **Write** Stata datasets to Parquet files  \n- **Append** Parquet files to existing data in memory\n- **Merge** Parquet files with existing data using standard Stata merge syntax\n- **Describe** Parquet file structure without loading the data\n- Support for variable selection and filtering during read operations\n- Automatic handling of data types between Stata and Parquet\n- Preserves original Parquet column names via variable labels\n- Performance optimizations including compression and parallel processing\n- Support for partitioned datasets and wildcard file patterns\n\n## Installation\n### Option 1: Install from SSC - https://ideas.repec.org/c/boc/bocode/s459458.html\n```stata\nssc install pq\n```\n\n\n### Option 2: Manual installation\n1. Download the package files for your archictecture (Linux, Mac, or Windows)  from the [latest release](https://github.com/jrothbaum/stata_parquet_io/releases)\n2. Place the files in your PLUS/p directory (find location with `sysdir`)\n\n### Important Note for Mac ARM Users\n\nYou may encounter an error related to Mac Gatekeeper restrictions on unsigned binaries. To resolve this:\n\n1. Go to **System Preferences/Settings → Privacy \u0026 Security**\n2. Look for a message about the blocked dylib near the bottom\n3. Click **\"Allow Anyway\"** next to the blocked file\n4. Authenticate with your password if prompted\n5. Try using the plugin again in Stata\n\n\n## Quick Start\n\n```stata\n* Load a Parquet file\npq use using mydata.parquet, clear\n\n* Save current data as Parquet\npq save using mydata.parquet, replace\n\n* Describe a Parquet file without loading\npq describe using mydata.parquet\n```\n\n## Usage\n\n### Reading Parquet Files\n\n```stata\n* Basic usage - read entire file\npq use using filename.parquet, clear\n\n* Read specific variables\npq use var1 var2 var3 using filename.parquet, clear\n\n* Read with observation filtering (using Stata syntax that will be converted to SQL)\npq use using filename.parquet, clear if(value \u003e 100)\n\n* Read subset of rows\npq use using filename.parquet, clear in(1/1000)\n\n* Use wildcards to select variables\npq use x* using filename.parquet, clear\n\n* Performance optimizations\npq use using large_file.parquet, clear compress compress_string_to_numeric sort(id)\n\n* Random sampling a specific number of rows\npq use using large_file.parquet, clear random_n(1000)\n\n* Random sampling a specific number of rows, with a seed for replicability\npq use using large_file.parquet, clear random_n(1000) random_seed(12345)\n\n* Random sampling a specific share of rows, with a seed for replicability\npq use using large_file.parquet, clear random_share(0.1) random_seed(12345)\n```\n\n### Appending Data\n\n```stata\n* Append Parquet file to existing data\npq append using additional_data.parquet\n\n* Append with filtering\npq append using new_data.parquet, if(year == 2024)\n```\n\n### Merging Data\n\n```stata\n* Standard Stata merge syntax with Parquet files\npq merge 1:1 id using lookup_table.parquet, generate(_merge)\npq merge m:1 category_id using categories.parquet, keep(match) nogenerate\n```\n\n### Writing Parquet Files\n\n```stata\n* Save entire dataset\npq save using filename.parquet, replace\n\n* Save specific variables\npq save var1 var2 var3 using filename.parquet, replace\n\n* Save with filtering\npq save using filename.parquet, replace if(value \u003e 100)\n\n* Save with compression options\npq save using compressed.parquet, replace compression(zstd) compression_level(9)\n\n* Create partitioned dataset\npq save using /output/partitioned_data, replace partition_by(year region)\n\n* Preserve original Parquet variable names (default behavior)\npq save using filename.parquet, replace\n* OR disable automatic renaming\npq save using filename.parquet, replace noautorename\n```\n\n### Examining Parquet Files\n\n```stata\n* Basic structure information\npq describe using filename.parquet\n\n* Detailed information including string lengths\npq describe using filename.parquet, detailed\n\n* Quiet mode for programmatic use\npq describe using filename.parquet, quietly\nreturn list  // View returned values\n```\n\n\n## Advanced Features\n\n### Working with Multiple Files\n\n```stata\n* Load multiple files with wildcard patterns\npq use using /data/sales_*.parquet, clear asterisk_to_variable(year)\n\n* Combine files with different schemas\npq use using /data/*.parquet, clear relaxed\n```\n\n### Performance Optimization\n\n```stata\n* Thread management - set environment variable to limit threads\n* (useful on shared systems)\n* Example: set POLARS_MAX_THREADS=4 before starting Stata\n\n* Parallel processing strategies\npq use using large_file.parquet, clear parallelize(columns)  // for wide files\npq use using large_file.parquet, clear parallelize(rows)     // for tall files\n\n* Compression and optimization\npq use using large_file.parquet, clear compress compress_string_to_numeric\n```\n\n### Variable Name Handling\n\nParquet files support more flexible variable names than Stata (spaces, special characters, unlimited length). When reading Parquet files:\n\n- Original column names are preserved in variable labels as `{parquet_name:original_name}`\n- Variables are renamed if they contain reserved words or exceed 32 characters\n- When saving, original Parquet names are automatically restored unless `noautorename` is specified\n\n### Partitioned Datasets\n\n```stata\n* Save as partitioned dataset (creates directory structure)\npq save using /output/partitioned_data, replace partition_by(year region)\n\n* Control partition overwrite behavior\npq save using /output/data, replace partition_by(year) nopartitionoverwrite\n```\n\n## Data Type Support\n\n| Parquet Type | Stata Type | Notes |\n|--------------|------------|-------|\n| String | str# or strL | Automatically sized; \u003e2045 chars become strL |\n| Integer | byte/int/long | Automatically sized based on range |\n| Float | float/double | Preserves precision |\n| Boolean | byte | 0/1 values |\n| Date | long | Formatted as %td |\n| DateTime | double | Formatted as %tc |\n| Time | double | Formatted as %tchh:mm:ss |\n| Binary | *dropped* | Not currently supported by Stata for C plugins |\n\n## Technical Details\n\n- Built on the [Polars](https://github.com/pola-rs/polars) library for blazing-fast performance (as all Rust libraries require you note)\n- Requires Stata 16.0 or later\n- Cross-platform support (Windows, Linux, macOS)\n- Efficient memory usage with optional compression\n- Plugin-based architecture for optimal performance\n\n## Limitations\n\n- **Binary data**: Not supported (columns are dropped with warning)\n- **strL performance**: Reading strL columns is slower due to Stata plugin limitations  \n- **SQL vs. Stata syntax**: The `if()` condition converts Stata if conditions to SQL-style comparisons where missing values are not treated as greater than any value (unlike Stata)\n\n\n\n## Benchmarks\nThis was run on my computer, with the following specs:\u003cbr\u003e\nCPU: \tAMD Ryzen 7 8845HS w/ Radeon 780M Graphics\u003cbr\u003e\nCores: \t16\u003cbr\u003e\nRAM: \t14Gi\u003cbr\u003e\nOS: \tWindows 11\u003cbr\u003e\nRun:\tJune 2, 2025\u003cbr\u003e\nThis is not intended to be a scientific benchmark, see the code below.\n\nBasically, it just draws a bunch of random normally distributed float variables (and an integer index stored as a float and a string variable) of various sizes (n_rows, n_columns) and save/use them as parquet and dta files and compares the time.  For each, I report the time for the save/use and next to the parquet time, I report the parquet time/dta time.\n\n\n```\n. benchmark_parquet_io_data,      n_cols(10)      ///\n\u003e                                 n_rows(1000)\nNumber of observations (_N) was 0, now 1,000.\n(          1,000,              10)\n    1: Stata:       save:        0.00\n    2: Parquet:     save:        0.00             4.00\n    3: Stata:       use:         0.01\n    4: Parquet:     use:         0.01             0.91\n\n    Loading only 5 variables of 10\n    5: Stata:       use:         0.00\n    6: Parquet:     use:         0.01          15.00\n\n.                                 \n. \n. benchmark_parquet_io_data,      n_cols(10)      ///\n\u003e                                 n_rows(10000)\nNumber of observations (_N) was 0, now 10,000.\n(         10,000,              10)\n    1: Stata:       save:        0.00\n    2: Parquet:     save:        0.01             9.00\n    3: Stata:       use:         0.01\n    4: Parquet:     use:         0.02             2.88\n\n    Loading only 5 variables of 10\n    5: Stata:       use:         0.00\n    6: Parquet:     use:         0.01          10.00\n\n. \n. benchmark_parquet_io_data,      n_cols(10)      ///\n\u003e                                 n_rows(100000)\nNumber of observations (_N) was 0, now 100,000.\n(        100,000,              10)\n    1: Stata:       save:        0.01\n    2: Parquet:     save:        0.03             5.50\n    3: Stata:       use:         0.00\n    4: Parquet:     use:         0.07            17.00\n\n    Loading only 5 variables of 10\n    5: Stata:       use:         0.01\n    6: Parquet:     use:         0.04           5.43\n\n.                                 \n.                                 \n. benchmark_parquet_io_data,      n_cols(10)      ///\n\u003e                                 n_rows(1000000)\nNumber of observations (_N) was 0, now 1,000,000.\n(      1,000,000,              10)\n    1: Stata:       save:        0.03\n    2: Parquet:     save:        0.26            10.24\n    3: Stata:       use:         0.02\n    4: Parquet:     use:         0.24            11.80\n\n    Loading only 5 variables of 10\n    5: Stata:       use:         0.04\n    6: Parquet:     use:         0.13           3.47\n\n.                                 \n.                                 \n. benchmark_parquet_io_data,      n_cols(10)      ///\n\u003e                                 n_rows(10000000)\nNumber of observations (_N) was 0, now 10,000,000.\n(     10,000,000,              10)\n    1: Stata:       save:        0.15\n    2: Parquet:     save:        1.56            10.34\n    3: Stata:       use:         0.11\n    4: Parquet:     use:         1.83            16.79\n\n    Loading only 5 variables of 10\n    5: Stata:       use:         0.31\n    6: Parquet:     use:         0.99           3.16\n\n. \n. benchmark_parquet_io_data,      n_cols(100)     ///\n\u003e                                 n_rows(1000000)\nNumber of observations (_N) was 0, now 1,000,000.\n(      1,000,000,             100)\n    1: Stata:       save:        0.15\n    2: Parquet:     save:        1.43             9.72\n    3: Stata:       use:         0.10\n    4: Parquet:     use:         2.47            24.95\n\n    Loading only 5 variables of 100\n    5: Stata:       use:         0.14\n    6: Parquet:     use:         0.14           0.99\n\n. \n. benchmark_parquet_io_data,      n_cols(1000)    ///\n\u003e                                 n_rows(100000)\nNumber of observations (_N) was 0, now 100,000.\n(        100,000,           1,000)\n    1: Stata:       save:        0.14\n    2: Parquet:     save:        1.58            11.35\n    3: Stata:       use:         0.10\n    4: Parquet:     use:         1.92            18.31\n\n    Loading only 5 variables of 1000\n    5: Stata:       use:         0.08\n    6: Parquet:     use:         0.06           0.71\n\n```\n\n\n\n\nBenchmark code:\n```\ncapture program drop benchmark_parquet_io_data\nprogram define benchmark_parquet_io_data\n\tversion 16\n\tsyntax\t\t, \tn_cols(integer)\t\t\t///\n\t\t\t\t\tn_rows(integer)\n\t\n\tclear\n\tset obs `n_rows'\n\tlocal cols_created = 0\n\n\tif `n_cols' \u003e `cols_created' {\n\t\tlocal cols_created = `cols_created' + 1\n\t\tquietly gen c_`cols_created' = _n\n\t}\n\n\tif `n_cols' \u003e `cols_created' {\n\t\tlocal cols_created = `cols_created' + 1\n\t\tquietly gen c_`cols_created' = char(65 + floor(runiform()*5))\n\t}\n\t\n\tif `n_cols' \u003e `cols_created' {\n\t\tlocal cols_created = `cols_created' + 1\n\t\tforvalues ci = `cols_created'/`n_cols' {\n\t\t\tquietly gen c_`ci' = rnormal()\n\t\t}\n\t}\n\t\n\tlocal n_to_load = 5\n\tlocal subset_to_load\n\tforvalues i=1/`n_to_load' {\n\t\tlocal subset_to_load `subset_to_load' c_`i'\n\t}\n\t\n\t\n\t\n\ttempfile path_save_root\n\tquietly {\n\t\ttimer clear\n\t\tdi \"save stata\"\n\t\ttimer on 1\n\t\tsave \"`path_save_root'.dta\", replace\n\t\ttimer off 1\n\t\t\n\t\tdi \"save parquet\"\n\t\ttimer on 2\n\t\t\n\t\tdi `\"pq save \"`path_save_root'.parquet\", replace\"'\n\t\tpq save \"`path_save_root'.parquet\", replace\n\t\ttimer off 2\n\t\t\n\t\tdi \"use stata\"\n\t\ttimer on 3\n\t\tuse \"`path_save_root'.dta\", clear\n\t\ttimer off 3\n\t\t\n\t\tdi \"use parquet\"\n\t\ttimer on 4\n\t\tdi `\"pq use \"`path_save_root'.parquet\", clear\"'\n\t\tpq use \"`path_save_root'.parquet\", clear\n\t\ttimer off 4\n\t\t\n\t\t\n\t\tdi \"use stata\"\n\t\ttimer on 5\n\t\tuse `subset_to_load' using \"`path_save_root'.dta\", clear\n\t\ttimer off 5\n\t\t\n\t\tdi \"use parquet\"\n\t\ttimer on 6\n\t\tdi `\"pq use \"`path_save_root'.parquet\", clear\"'\n\t\tpq use `subset_to_load' using \"`path_save_root'.parquet\", clear\n\t\ttimer off 6\n\t\t\n\t\ttimer list\n\t\tlocal save_stata = r(t1)\n\t\tlocal save_parquet = r(t2)\n\t\tlocal use_stata = r(t3)\n\t\tlocal use_parquet = r(t4)\n\t\tlocal use_stata_subset = r(t5)\n\t\tlocal use_parquet_subset = r(t6)\n\t\tlocal save_ratio = r(t2)/r(t1)\n\t\tlocal use_ratio = r(t4)/r(t3)\n\t\tlocal use_ratio_subset = r(t6)/r(t5)\n\t\tnoisily di \"(\" %15.0fc `n_rows' \", \" %15.0fc `n_cols' \")\"\n\t\tnoisily di \"\t1: Stata:\tsave:\t\" %9.2f `save_stata'\n\t\tnoisily di \"\t2: Parquet:\tsave:\t\" %9.2f `save_parquet' \"\t\" %9.2f `save_ratio'\n\t\tnoisily di \"\t3: Stata:\tuse:\t\" %9.2f `use_stata'\n\t\tnoisily di \"\t4: Parquet:\tuse:\t\" %9.2f `use_parquet'  \"\t\" %9.2f `use_ratio'\n\t\t\n\t\tnoisily di \"\"\n\t\tnoisily di \"\tLoading only `n_to_load' variables of `n_cols'\"\n\t\tnoisily di \"\t5: Stata:\tuse:\t\" %9.2f `use_stata_subset'\n\t\tnoisily di \"\t6: Parquet:\tuse:\t\" %9.2f `use_parquet_subset'  \"      \" %9.2f `use_ratio_subset'\n\t}\n\t\n\tcapture erase `path_save_root'.parquet\n\tcapture erase `path_save_root'.dta\n\t\nend\n\n\nclear\nset seed 1565225\n\nbenchmark_parquet_io_data, \tn_cols(10)\t///\n\t\t\t\tn_rows(1000)\n\t\t\t\t\n\nbenchmark_parquet_io_data, \tn_cols(10)\t///\n\t\t\t\tn_rows(10000)\n\nbenchmark_parquet_io_data, \tn_cols(10)\t///\n\t\t\t\tn_rows(100000)\n\t\t\t\t\n\t\t\t\t\nbenchmark_parquet_io_data, \tn_cols(10)\t///\n\t\t\t\tn_rows(1000000)\n\t\t\t\t\n\t\t\t\t\nbenchmark_parquet_io_data, \tn_cols(10)\t///\n\t\t\t\tn_rows(10000000)\n\nbenchmark_parquet_io_data, \tn_cols(100)\t///\n\t\t\t\tn_rows(1000000)\n\nbenchmark_parquet_io_data, \tn_cols(1000)\t///\n\t\t\t\tn_rows(100000)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjrothbaum%2Fstata_parquet_io","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjrothbaum%2Fstata_parquet_io","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjrothbaum%2Fstata_parquet_io/lists"}