{"id":16510177,"url":"https://github.com/mcaceresb/stata-gtools","last_synced_at":"2026-02-04T23:33:12.066Z","repository":{"id":43002356,"uuid":"91460490","full_name":"mcaceresb/stata-gtools","owner":"mcaceresb","description":"Faster implementation of Stata's collapse, reshape, xtile, egen, isid, and more using C plugins","archived":false,"fork":false,"pushed_at":"2023-11-09T00:48:06.000Z","size":40784,"stargazers_count":167,"open_issues_count":10,"forks_count":35,"subscribers_count":9,"default_branch":"master","last_synced_at":"2023-11-09T01:31:25.657Z","etag":null,"topics":["collapse","egen","gtools","hash","percentile","reshape","spookyhash","stata","xtile"],"latest_commit_sha":null,"homepage":"https://gtools.readthedocs.io","language":"Stata","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mcaceresb.png","metadata":{"files":{"readme":"README.md","changelog":"changelog.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2017-05-16T13:17:17.000Z","updated_at":"2024-04-15T15:33:01.965Z","dependencies_parsed_at":"2023-01-25T18:45:38.335Z","dependency_job_id":"355907d5-6189-4d70-8b0e-11f488915578","html_url":"https://github.com/mcaceresb/stata-gtools","commit_stats":null,"previous_names":[],"tags_count":19,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mcaceresb%2Fstata-gtools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mcaceresb%2Fstata-gtools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mcaceresb%2Fstata-gtools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mcaceresb%2Fstata-gtools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mcaceresb","download_url":"https://codeload.github.com/mcaceresb/stata-gtools/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241473949,"owners_count":19968680,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["collapse","egen","gtools","hash","percentile","reshape","spookyhash","stata","xtile"],"created_at":"2024-10-11T15:54:18.935Z","updated_at":"2026-02-04T23:33:12.024Z","avatar_url":"https://github.com/mcaceresb.png","language":"Stata","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"https://raw.githubusercontent.com/mcaceresb/mcaceresb.github.io/master/assets/icons/gtools-icon/gtools-icon-text.png\" alt=\"Gtools\" width=\"500px\"/\u003e\n\n[Overview](#faster-stata-for-big-data)\n| [Installation](#installation)\n| [Examples](#examples)\n| [Remarks](#remarks)\n| [FAQs \u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/6/64/Icon_External_Link.png\" width=\"13px\"/\u003e](https://gtools.readthedocs.io/en/latest/faqs/index.html)\n| [Benchmarks \u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/6/64/Icon_External_Link.png\" width=\"13px\"/\u003e](https://gtools.readthedocs.io/en/latest/benchmarks/index.html)\n| [Compiling \u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/6/64/Icon_External_Link.png\" width=\"13px\"/\u003e](https://gtools.readthedocs.io/en/latest/compiling/index.html)\n\nFaster Stata for big data. This packages uses C plugins and hashes to\nprovide a massive speed improvements to common Stata commands, including:\nreshape, collapse, xtile, tabstat, isid, egen, pctile, winsor, contract,\nlevelsof, duplicates, unique/distinct, and more.\n\n![Beta Version](https://img.shields.io/badge/beta-v1.11.8-blue.svg?longCache=true\u0026style=flat-square)\n![Supported Platforms](https://img.shields.io/badge/platforms-linux--64%20%7C%20osx--64%20%7C%20win--64-blue.svg?longCache=true\u0026style=flat-square)\n[![github linux status](https://github.com/mcaceresb/stata-gtools/actions/workflows/linux.yml/badge.svg?branch=master)](https://github.com/mcaceresb/stata-gtools/actions/workflows/linux.yml)\n[![github osx status](https://github.com/mcaceresb/stata-gtools/actions/workflows/osx.yml/badge.svg?branch=master)](https://github.com/mcaceresb/stata-gtools/actions/workflows/osx.yml)\n[![Appveyor Build status](https://img.shields.io/appveyor/ci/mcaceresb/stata-gtools/master.svg?longCache=true\u0026style=flat-square\u0026label=windows-cygwin)](https://ci.appveyor.com/project/mcaceresb/stata-gtools)\n\nFaster Stata for Big Data\n-------------------------\n\nThis package provides a fast implementation of various Stata commands\nusing hashes and C plugins. The syntax and purpose is largely analogous\nto their Stata counterparts; for example, you can replace `collapse`\nwith `gcollapse`, `reshape` with `greshape`, and so on. For a\ncomprehensive list of differences (including some extra features!)\nsee the [remarks](#remarks) below; for details and examples see [the\nofficial project page](https://gtools.readthedocs.io).\n\n__*Quickstart*__\n\n```stata\nssc install gtools\ngtools, upgrade\n```\n\nSome [quick benchmarks](https://raw.githubusercontent.com/mcaceresb/stata-gtools/master/docs/benchmarks/quick.do):\n\n_**NOTE:**_ Stata 17 introduced massive speed improvements to [sort and collapse](https://www.stata.com/new-in-stata/faster-stata-speed-improvements/).\nIn the MP version, in particular with many cores available, the native\n`collapse`  can be up to twice as fast. (YMMV; overall native collapses \ncould still be slower in some use cases.)  `gcollapse` remains faster\nin SE and older Stata versions.\n\n\u003cimg\n    src=\"https://raw.githubusercontent.com/mcaceresb/stata-gtools/master/docs/benchmarks/quick.png#gh-light-mode-only\"\n    alt=\"Gtools quick benchmark\"\n    style=\"display:block;margin-left:auto;margin-right:auto\"\n    width=\"80%\"/\u003e\n\n\u003cimg\n    src=\"https://raw.githubusercontent.com/mcaceresb/stata-gtools/master/docs/benchmarks/quickdark.png#gh-dark-mode-only\"\n    alt=\"Gtools quick benchmark\"\n    style=\"display:block;margin-left:auto;margin-right:auto\"\n    width=\"80%\"/\u003e\n\n__*Gtools commands with a Stata equivalent*__\n\n| Function     | Replaces    | Speedup (IC / MP)              | Unsupported             | Extras                                  |\n| ------------ | ----------- | ------------------------------ | ----------------------- | --------------------------------------- |\n| gcollapse    | collapse    | -0.5 to 2 (Stata 17+); 4 to 100 (Stata 16 and earlier)  || Quantiles, merge, labels, nunique, etc. |\n| greshape     | reshape     |  4 to 20  / 4 to 15            | \"advanced syntax\"       | `fast`, spread/gather (tidyr equiv)     |\n| gegen        | egen        |  9 to 26  / 4 to 9 (+,.)       | labels                  | Weights, quantiles, nunique, etc.       |\n| gcontract    | contract    |  5 to 7   / 2.5 to 4           |                         |                                         |\n| gisid        | isid        |  8 to 30  / 4 to 14            | `using`, `sort`         | `if`, `in`                              |\n| glevelsof    | levelsof    |  3 to 13  / 2 to 7             |                         | Multiple variables, arbitrary levels    |\n| gduplicates  | duplicates  |  8 to 16 / 3 to 10             |                         |                                         |\n| gquantiles   | xtile       |  10 to 30 / 13 to 25 (-)       |                         | `by()`, various (see [usage](https://gtools.readthedocs.io/en/latest/usage/gquantiles)) |\n|              | pctile      |  13 to 38 / 3 to 5 (-)         |                         | Ibid.                                   |\n|              | \\_pctile    |  25 to 40 / 3 to 5             |                         | Ibid.                                   |\n| gstats tab   | tabstat     |  10 to 50 / 5 to 30 (-)        | See [remarks](#remarks) | various (see [usage](https://gtools.readthedocs.io/en/latest/usage/gstats_summarize)) |\n| gstats sum   | sum, detail |  10 to 20 / 5 to 10            | See [remarks](#remarks) | various (see [usage](https://gtools.readthedocs.io/en/latest/usage/gstats_summarize)) |\n\n\u003csmall\u003e(+) The upper end of the speed improvements are for quantiles\n(e.g. median, iqr, p90) and few groups. Weights have not been\nbenchmarked.\u003c/small\u003e\n\n\u003csmall\u003e(.) Only gegen group was benchmarked rigorously.\u003c/small\u003e\n\n\u003csmall\u003e(-) Benchmarks computed 10 quantiles. When computing a large\nnumber of quantiles (e.g. thousands) `pctile` and `xtile` are prohibitively\nslow due to the way they are written; in that case gquantiles is hundreds\nor thousands of times faster, but this is an edge case.\u003c/small\u003e\n\n__*Extra commands*__\n\n| Function            | Similar (SSC/SJ)         | Speedup (IC / MP)       | Notes                         |\n| ------------------- | ------------------------ | ----------------------- | ----------------------------- |\n| fasterxtile         | fastxtile                | 20 to 30 / 2.5 to 3.5   | Allows `by()`                 |\n|                     | egenmisc (SSC) (-)       | 8 to 25 / 2.5 to 6      |                               |\n|                     | astile (SSC) (-)         | 8 to 12 / 3.5 to 6      |                               |\n| gstats hdfe         |                          | (.)                     | Allows weights, `by()`        |\n| gstats winsor       | winsor2                  | 10 to 40 / 10 to 20     | Allows weights                |\n| gunique             | unique                   | 4 to 26 / 4 to 12       |                               |\n| gdistinct           | distinct                 | 4 to 26 / 4 to 12       | Also saves results in matrix  |\n| gtop (gtoplevelsof) | groups, select()         | (+)                     | See table notes (+)           |\n| gstats range        | rangestat                | 10 to 20 / 10 to 20     | Allows weights; no flex stats |\n| gstats transform    |                          |                         | Various statistical functions |\n\n\u003csmall\u003e(-) `fastxtile` from egenmisc and `astile` were benchmarked against\n`gquantiles, xtile` (`fasterxtile`) using `by()`.\u003c/small\u003e\n\n\u003csmall\u003e(+) While similar to the user command 'groups' with the 'select'\noption, gtoplevelsof does not really have an equivalent. It is several\ndozen times faster than 'groups, select', but that command was not written\nwith the goal of gleaning the most common levels of a varlist. Rather, it\nhas a plethora of features and that one is somewhat incidental. As such, the\nbenchmark is not equivalent and `gtoplevelsof` does not attempt to implement\nthe features of 'groups'\u003c/small\u003e\n\n\u003csmall\u003e(.) Other than the dated 'hdfe' command, I do not know of a stata\ncommand that residualizes variables from a set of fixed effects. The\n'hdfe' command, as far as I can tell, morphed into the 'reghdfe'\npackage; the latter, however, is a fully-functioning regression command,\nwhile 'gstats hdfe' only residualizes a set of variables.\u003c/small\u003e\n\n__*Regression models*__\n\n_**WARNING:**_ Regression models are in beta and are only intended as utilities\nto compute coefficients and standard errors. I do not recommend their use in\nproduction; various post-estimation commands and statistics are _not_ availabe.\n(See `gstats hdfe` for residualizing variables net of fixed effects.)\n\n| Function            | Model   | Similar                        |\n| ------------------- | ------- | -----------------------------  |\n| gregress            | OLS     | `regress`, `reghdfe`           |\n| givregress          | 2SLS    | `ivregress 2sls`, `ivreghdfe`  |\n| gglm                | IRLS    | `logit`, `poisson`, `ppmlhdfe` |\n\nAll commands allow the user to optionally add:\n\n- `absorb()` for high-dimensional fixed effects absorptions.\n- `cluster()` for clustering (multiple covariates assume clusters are nested).\n- `by()` for regressions by group.\n- `weights` for weighted versions. Unlike other weights, `fweights` are assumed to refer to the _number_ of observations.\n\nLinear regression is computed via OLS (or WLS), IV regression is\ncomputed via two-stage least squares (2SLS), and GLM (poisson or logit)\nregression is computed via iteratively reweighted least squares (IRLS). \nSee the [TODO](#todo) section for planned features, or the\n[Missing Features](https://gtools.readthedocs.io/en/latest/usage/gregress/index.html#missing-features)\nsection in the documentation for what is missing before the first\nnon-beta release.\n\n__*Extra features*__\n\nSeveral commands offer additional features on top of the massive\nspeedup. See the [remarks](#remarks) section below for an overview; for\ndetails and examples, see each command's help page:\n\n- [gcollapse](https://gtools.readthedocs.io/en/latest/usage/gcollapse/index.html#examples)\n- [greshape](https://gtools.readthedocs.io/en/latest/usage/greshape/index.html#examples)\n- [gquantiles](https://gtools.readthedocs.io/en/latest/usage/gquantiles/index.html#examples)\n- [gstats sum/tab](https://gtools.readthedocs.io/en/latest/usage/gstats_summarize/index.html#examples)\n- [gstats transform/range/moving](https://gtools.readthedocs.io/en/latest/usage/gstats_transform/index.html#examples)\n- [glevelsof](https://gtools.readthedocs.io/en/latest/usage/glevelsof/index.html#examples)\n- [gtoplevelsof](https://gtools.readthedocs.io/en/latest/usage/gtoplevelsof/index.html#examples)\n- [gegen](https://gtools.readthedocs.io/en/latest/usage/gegen/index.html#examples)\n- [gdistinct](https://gtools.readthedocs.io/en/latest/usage/gdistinct/index.html#examples)\n- [gregress](https://gtools.readthedocs.io/en/latest/usage/gregress/index.html#examples)\n- [givregress](https://gtools.readthedocs.io/en/latest/usage/givregress/index.html#examples)\n- [gglm](https://gtools.readthedocs.io/en/latest/usage/gglm/index.html#examples) (poisson and logit)\n\nIn addition, several commands take gsort-style input, that is\n\n```stata\n[+|-]varname [[+|-]varname ...]\n```\n\nThis does not affect the results in most cases, just the sort order.\nCommands that take this type of input include:\n\n- gcollapse\n- gcontract\n- gegen\n- glevelsof\n- gtop (gtoplevelsof)\n\n__*Ftools*__\n\nThe commands here are also faster than the commands provided by\n`ftools`; further, `gtools` commands take a mix of string and numeric\nvariables, which is a limitation of `ftools`. (Note I could not get\nseveral parts of `ftools` working on the Linux server where I have\naccess to Stata/MP; hence the IC benchmarks.)\n\n| Gtools    | Ftools        | Speedup (IC) |\n| --------- | ------------- | ------------ |\n| gcollapse | fcollapse     | 2-9          |\n| gegen     | fegen         | 2.5-4 (+)    |\n| gisid     | fisid         | 4-14         |\n| glevelsof | flevelsof     | 1.5-13       |\n| hashsort  | fsort         | 2.5-4        |\n\n\u003csmall\u003e(+) Only egen group was benchmarked rigorously.\u003c/small\u003e\n\n__*Limitations*__\n\n- `strL` variables only partially supported on Stata 14 and above;\n  `gcollapse`, `gcontract`, and `greshape` do not support `strL` variabes.\n\n- Due to a Stata bug, gtools cannot support more\n  than `2^31-1` (2.1 billion) observations. See [this\n  issue](https://github.com/mcaceresb/stata-gtools/issues/43)\n\n- Due to limitations in the Stata Plugin Interface, gtools\n  can only handle as many variables as the largest `matsize`\n  in the user's Stata version. For MP this is more than\n  10,000 variables but in IC this is only 800. See [this\n  issue](https://github.com/mcaceresb/stata-gtools/issues/24).\n\n- Gtools uses compiled C code to achieve it's massive increases in\n  speed. This has two side-effects users might notice: First, it is sometimes\n  not possible to break the program's execution.  While this is already true\n  for at least some parts of most Stata commands, there are fewer opportunities\n  to break Gtools commands relative to their Stata counterparts.\n\n  Second, the Stata GUI might appear frozen when running Gtools\n  commands.  If the system then runs out of RAM (memory), it could look\n  like Stata has crashed (it may show a \"(Not Responding)\" message on\n  Windows or it may darken on \\*nix systems). However, the program has\n  not crashed; it is merely trying to swap memory.  To check this is the\n  case, the user can monitor disk activity or monitor their system's\n  pagefile or swap space directly.\n\nAcknowledgements\n----------------\n\n* The OSX version of gtools was implemented with invaluable help from @fbelotti\n  in [issue 11](https://github.com/mcaceresb/stata-gtools/issues/11).\n\n* Gtools was largely inspired by Sergio Correia's (@sergiocorreia) excellent\n  [ftools](https://github.com/sergiocorreia/ftools) package. Further, several\n  improvements and bug fixes have come from to @sergiocorreia's helpful comments.\n\n* With the exception of `greshape`, every gtools command has been\n  written almost entirely from scratch (and even `greshape` is mostly\n  new code). However, gtools commands typically mimic the functionality\n  of existing Stata commands, including community-contributed programs,\n  meaning many of the ideas and options are based on them (see the\n  respective help files for details). `gtools` commands based on\n  community-contributed programs include:\n\n    * [`gstats winsor`](https://gtools.readthedocs.io/en/latest/usage/gstats_winsor/index.html#acknowledgements), based on `winsor2` by Lian (Arlion) Yujun\n\n    * [`gunique`](https://gtools.readthedocs.io/en/latest/usage/gunique/index.html#acknowledgements), based on `unique` by Michael Hills and Tony Brady.\n\n    * [`gdistinct`](https://gtools.readthedocs.io/en/latest/usage/gdistinct/index.html#acknowledgements), based on `distinct` by Gary Longton and Nicholas J. Cox.\n\nInstallation\n------------\n\nI only have access to Stata 13.1, so I impose that to be the minimum.\nYou can install `gtools` from Stata via SSC:\n```stata\nssc install gtools\ngtools, upgrade\n```\n\nBy default this syncs to the master branch, which is stable. To install\nthe latest version directly, type:\n```stata\nlocal github \"https://raw.githubusercontent.com\"\nnet install gtools, from(`github'/mcaceresb/stata-gtools/master/build/)\n```\n\n### Examples\n\nThe syntax is generally analogous to the standard commands (see the corresponding\nhelp files for full syntax and options):\n```stata\nsysuse auto, clear\n\n* gstats {hdfe|residualize} varlist [if] [in] [weight], [absorb(varlist) options]\ngstats hdfe hdfe_price = price, absorb(foreign rep78)\ngstats residualize price mpg, absorb(foreign rep78) prefix(res_)\n\n* gstats {sum|tab} varlist [if] [in] [weight], [by(varlist) options]\ngstats sum price [pw = gear_ratio / 4]\ngstats tab price mpg, by(foreign) matasave\n\n* gquantiles [newvarname =] exp [if] [in] [weight], {_pctile|xtile|pctile} [options]\ngquantiles 2 * price, _pctile nq(10)\ngquantiles p10 = 2 * price, pctile nq(10)\ngquantiles x10 = 2 * price, xtile nq(10) by(rep78)\nfasterxtile xx = log(price) [w = weight], cutpoints(p10) by(foreign)\n\n* gstats winsor varlist [if] [in] [weight], [by(varlist) cuts(# #) options]\ngstats winsor price gear_ratio mpg, cuts(5 95) s(_w1)\ngstats winsor price gear_ratio mpg, cuts(5 95) by(foreign) s(_w2)\ndrop *_w?\n\n* hashsort varlist, [options]\nhashsort -make\nhashsort foreign -rep78, benchmark verbose mlast\n\n* gegen target  = stat(source) [if] [in] [weight], by(varlist) [options]\ngegen tag   = tag(foreign)\ngegen group = tag(-price make)\ngegen p2_5  = pctile(price) [w = weight], by(foreign) p(2.5)\n\n* gisid varlist [if] [in], [options]\ngisid make, missok\ngisid price in 1 / 2\n\n* gduplicates varlist [if] [in], [options gtools(gtools_options)]\ngduplicates report foreign\ngduplicates report rep78 if foreign, gtools(bench(3))\n\n* glevelsof varlist [if] [in], [options]\nglevelsof rep78, local(levels) sep(\" | \")\nglevelsof foreign mpg if price \u003c 4000, loc(lvl) sep(\" | \") colsep(\", \")\nglevelsof foreign mpg in 10 / 70, gen(uniq_) nolocal\n\n* gtop varlist [if] [in] [weight], [options]\n* gtoplevelsof varlist [if] [in] [weight], [options]\ngtoplevelsof foreign rep78\ngtop foreign rep78 [w = weight], ntop(5) missrow groupmiss pctfmt(%6.4g) colmax(3)\n\n* gregress depvar indepvars [if] [in] [weight], [by(varlist) options]\ngregress price mpg rep78, mata(coefs) prefix(b(_b_) se(_se_))\ngregress price mpg [fw = rep78], by(foreign) absorb(rep78 headroom) cluster(rep78)\n\n* givregress depvar (endog = instruments) exog [if] [in] [weight], [by(varlist) options]\ngivregress price (mpg = gear_ratio) rep78, mata(coefs) prefix(b(_b_) se(_se_)) replace\ngivregress price (mpg = gear_ratio) [fw = rep78], by(foreign) absorb(rep78 headroom) cluster(rep78)\n\n* gglm depvar indepvars [if] [in] [weight], family(...) [by(varlist) options]\ngglm price mpg rep78, family(poisson) mata(coefs) prefix(b(_b_) se(_se_)) replace\ngglm price mpg [fw = trunk], family(poisson) by(foreign) absorb(rep78 headroom) cluster(rep78)\n\ngglm foreign price rep78 [fw = trunk], family(binomial) absorb(headroom) mata(coefs)\ngglm foreign price if rep78 \u003e 2, family(binomial) by(rep78) prefix(b(_b_) se(_se_)) replace\n\n* gcollapse (stat) out = src [(stat) out = src ...] [if] [if] [weight], by(varlist) [options]\ngen h1 = headroom\ngen h2 = headroom\nlocal lbl labelformat(#stat:pretty# #sourcelabel#)\n\ngcollapse (mean) mean = price (median) p50 = gear_ratio, by(make) merge v `lbl'\ndisp \"`:var label mean', `:var label p50'\"\ngcollapse (iqr) irq? = h? (nunique) turn (p97.5) mpg, by(foreign rep78) bench(2) wild\n\n* gcontract varlist [if] [if] [fweight], [options]\ngcontract foreign [fw = turn], freq(f) percent(p)\n\n* greshape wide varlist,    i(i) j(j) [options]\n* greshape long prefixlist, i(i) [j(j) string options]\n*\n* greshape spread varlist, j(j) [options]\n* greshape gather varlist, j(j) value(value) [options]\n\ngen j = _n\ngreshape wide f p, i(foreign) j(j)\ngreshape long f p, i(foreign) j(j)\n\ngreshape spread f p, j(j)\ngreshape gather f? p?, j(j) value(fp)\n\n* gstats transform (stat) out = src [(stat) out = src ...] [if] [if] [weight], by(varlist) [options]\n* gstats range  (stat) out = src [...] [if] [if] [weight], by(varlist) [options]\n* gstats moving (stat) out = src [...] [if] [if] [weight], by(varlist) [options]\n\nsysuse auto, clear\ngstats transform (normalize) price (demean) price (range mean -sd sd) price, auto\ngstats range  (mean) mean_r = price (sd) sd_r = price, interval(-10 10 mpg)\ngstats moving (mean) mean_m = price (sd) sd_m = price, by(foreign) window(-5 5)\n```\n\nSee the [FAQs](faqs) or the respective documentation for a list of supported\n`gcollapse` and `gegen` functions.\n\nRemarks\n-------\n\n*__Functions available with `gegen`, `gcollapse`, `gstats tab`__*\n\n`gcollapse` supports every `collapse` function, including their\nweighted versions. In addition, weights can be selectively applied via\n`rawstat()`, and several additional statistics are allowed, including\n`nunique`, `select#`, and so on.\n\n`gegen` technically does not support all of `egen`, but whenever a\nfunction that is not supported is requested, `gegen` hashes the data and\ncalls `egen` grouping by the hash, which is often faster (`gegen` only\nsupports weights for internal functions, since `egen` does not normally\nallow weights).\n\nHence both should be able to replicate all of the functionality of their\nStata counterparts. Last, `gstats tab` allows every statistic allowed\nby `tabstat` as well as any statistic allowed by `gcollapse`; the\nsyntax for the statistics specified via `statistics()` is the same\nas in `tabstat`.\n\nThe following are implemented internally in C:\n\n| Function     | gcollapse | gegen   | gstats tab |\n| ------------ | --------- | ------- | ---------- |\n| tag          |           |   X     |            |\n| group        |           |   X     |            |\n| total        |           |   X     |            |\n| count        |     X     |   X     |      X     |\n| nunique      |     X     |   X     |      X     |\n| nmissing     |     X     |   X (+) |      X     |\n| sum          |     X     |   X     |      X     |\n| nansum       |     X     |   X     |      X     |\n| rawsum       |     X     |         |      X     |\n| rawnansum    |     X     |         |      X     |\n| mean         |     X     |   X     |      X     |\n| geomean      |     X     |   X     |      X     |\n| median       |     X     |   X     |      X     |\n| percentiles  |     X     |   X     |      X     |\n| iqr          |     X     |   X     |      X     |\n| sd           |     X     |   X     |      X     |\n| variance     |     X     |   X (+) |      X     |\n| cv           |     X     |   X     |      X     |\n| max          |     X     |   X     |      X     |\n| min          |     X     |   X     |      X     |\n| range        |     X     |   X     |      X     |\n| select       |     X     |   X     |      X     |\n| rawselect    |     X     |         |      X     |\n| percent      |     X     |   X     |      X     |\n| first        |     X     |   X (+) |      X     |\n| last         |     X     |   X (+) |      X     |\n| firstnm      |     X     |   X (+) |      X     |\n| lastnm       |     X     |   X (+) |      X     |\n| semean       |     X     |   X (+) |      X     |\n| sebinomial   |     X     |   X     |      X     |\n| sepoisson    |     X     |   X     |      X     |\n| skewness     |     X     |   X     |      X     |\n| kurtosis     |     X     |   X     |      X     |\n| gini         |     X     |   X     |      X     |\n| gini dropneg |     X     |   X     |      X     |\n| gini keepneg |     X     |   X     |      X     |\n\n\u003csmall\u003e(+) indicates the function has the same or a very similar\nname to a function in the \"egenmore\" packge, but the function was\nindependently implemented and is hence analogous to its gcollapse\ncounterpart, not necessarily the function in egenmore.\u003c/small\u003e\n\nThe percentile syntax mimics that of `collapse` and `egen`, with the addition\nthat quantiles are also supported. That is,\n\n```stata\ngcollapse (p#) target = var [target = var ...] , by(varlist)\ngegen target = pctile(var), by(varlist) p(#)\n```\n\nwhere # is a \"percentile\" with arbitrary decimal places (e.g. 2.5 or 97.5).\n`gtools` also supports selecting the `#`th smallest or largest value:\n```stata\ngcollapse (select#) target = var [(select-#) target = var ...] , by(varlist)\ngegen target = select(var), by(varlist) n(#)\ngegen target = select(var), by(varlist) n(-#)\n```\n\nIn addition, the following are allowed in `gegen` as wrappers to other\ngtools functions (`stat` is any stat available to `gcollapse`, except\n`percent`, `nunique`):\n\n| Function     | calls            |\n| ------------ | ---------------- |\n| xtile        | fasterxtile      |\n| standardize  | gstats transform |\n| normalize    | gstats transform |\n| demean       | gstats transform |\n| demedian     | gstats transform |\n| moving\\_stat | gstats transform |\n| range\\_stat  | gstats transform |\n| cumsum       | gstats transform |\n| shift        | gstats transform |\n| rank         | gstats transform |\n| winsor       | gstats winsor    |\n| winsorize    | gstats winsor    |\n\nLast, when `gegen` calls a function that is not implemented internally\nby `gtools`, it will hash the by variables and call `egen` with `by`\nset to an id based on the hash. That is, if `fcn` is not one of the\nfunctions above,\n\n```stata\ngegen outvar = fcn(varlist) [if] [in], by(byvars)\n```\n\nwould be the same as\n```stata\nhashsort byvars, group(id) sortgroup\negen outvar = fcn(varlist) [if] [in], by(id)\n```\n\nbut preserving the original sort order. In case an `egen` option might\nconflict with a gtools option, the user can pass `gtools_capture(fcn_options)`\nto `gegen`.\n\n__*Differences and Extras*__\n\nDifferences from `collapse`\n\n- String variables are not allowed for `first`, `last`, `min`, `max`, etc.\n  (see [issue 25](https://github.com/mcaceresb/stata-gtools/issues/25))\n- New functions: `nunique`, `nmissing`, `cv`, `variance`, `select#`, `select-#`, `range`, `gini`\n- `rawstat` allows selectively applying weights.\n- `rawselect` ignores weights for `select` (analogously to `rawsum`).\n- Option `wild` allows bulk-rename. E.g. `gcollapse mean_x* = x*, wild`\n- `gcollapse (nansum)` and `gcollapse (rawnansum)` outputs a missing\n  value for sums if all inputs are missing (instead of 0).\n- `gcollapse, merge` merges the collapsed data set back into memory. This is\n  much faster than collapsing a dataset, saving, and merging after. However,\n  Stata's `merge ..., update` functionality is not implemented, only replace.\n  (If the targets exist the function will throw an error without `replace`).\n- `gcollapse, labelformat` allows specifying the output label using placeholders.\n- `gcollapse, sumcheck` keeps integer types with `sum` if the sum will not overflow.\n\nDifferences from `reshape`\n\n- Allows an arbitrary number of variables in `i()` and `j()`\n- Several option allow turning off error checks for faster execution,\n  including: `fast` (similar to `fast` in `gcollapse`), `unsorted`\n  (do not sort the output), `nodupcheck` (allow duplicates in `i`),\n  `nomisscheck` (allow missing values and/or leading blanks in `j`), or\n  `nochecks` (all of the above).\n- Subcommands `gather` and `spread` implement the equivalent commands from\n  R's `tidyr` package.\n- At the moment, `j(name [values])` is not supported. All values of `j` are used.\n- \"reshape mode\" is not supported. Reshape variables are not saved as\n  part of the current dataset's characteristics, meaning the user cannot\n  type `reshape wide` and `reshape long` without further arguments to\n  reverse the `reshape`. This syntax is very cumbersome and difficult to\n  support; `greshape` re-wrote much of the code base and had to dispense\n  with this functionality.\n- For that same reason, \"advanced\" syntax is not supported, including\n  the subcommands: clear, error, query, i, j, xij, and xi.\n- `@` syntax can be modified via `match()`\n- `dropmiss` allows dropping missing observations when reshaping from\n  wide to long (via `long` or `gather`).\n\nDifferences from regression models\n\n`gregress`, `givregress`, and `gglm` do not aim to replicate\nthe entire table of estimation results, nor the entire suite of\npost-estimation results and tests, that `regress` (`reghdfe`),\n`ivregress 2sls` (`ivreghdfe`), `poisson` (`ppmlhdfe`), or `logit` make\navailable. At the moment, they are considered beta software and only\ncoefficients and standard errors are computed.\n\n- Results are saved either to mata (default) or copied to variables in\n  the dataset in memory.\n- `by()` and `absorb()` are allowed and can be combined.\n- `givregress` does a small sample adjustment (`small`) automatically.\n- `givregress` does not exit with error if covariates are collinear with\n  the dependent variable.\n- If the `givregress` model is not identified, standard errors and\n  coefficients are set to missing instead of exiting with error.\n- `gglm` runs with option `robust` automatically.\n- If the `givregress` model is not identified, standard errors and\n- If there are no non-linear covariates (i.e. all observations are\n  numerically zero) then the coefficients and standard errors are\n  _both_ set to missing.\n\nDifferences from `xtile`, `pctile`, and `_pctile`\n\n- Adds support for `by()` (including weights)\n- Does not ignore `altdef` with `xtile` (see [this Statalist thread](https://www.statalist.org/forums/forum/general-stata-discussion/general/1417198-typo-in-xtile-ado-with-option-altdef))\n- Category frequencies can also be requested via `binfreq[()]`.\n- `xtile`, `pctile`, and `_pctile` can be combined via `xtile(newvar)` and\n  `pctile(newvar)`\n- There is no limit to `nquantiles()` for `xtile`\n- Quantiles can be requested via `percentiles()` (or `quantiles()`),\n  `cutquantiles()`, or `quantmatrix()` for `xtile` as well as `pctile`.\n- Cutoffs can be requested via `cutquantiles()`, `cutoffs()`,\n  or `cutmatrix()` for `xtile` as well as `pctile`.\n- The user has control over the behavior of `cutpoints()` and `cutquantiles()`.\n  They obey `if` `in` with option `cutifin`, they can be group-specific with\n  option `cutby`, and they can be de-duplicated via `dedup`.\n- Fixes numerical precision issues with `pctile, altdef` (e.g. see [this Statalist thread](https://www.statalist.org/forums/forum/general-stata-discussion/general/1418732-numerical-precision-issues-with-stata-s-pctile-and-altdef-in-ic-and-se), which is a very minor thing so Stata and fellow users maintain it's not an issue, but I think it is because Stata/MP gives what I think is the correct answer whereas IC and SE do not).\n- Fixes a possible issue with the weights implementation in `_pctile`; see [this thread](https://www.statalist.org/forums/forum/general-stata-discussion/general/1454409-weights-in-pctile).\n\nDifferences from `egen`\n\n- `group` label options are not supported\n- weights are supported for internally implemented functions.\n- New functions: `nunique`, `nmissing`, `cv`, `variance`, `select#`, `select-#`, `range`\n- `gegen` upgrades the type of the target variable if it is not specified by\n  the user. This means that if the sources are `double` then the output will\n  be double. All sums are double. `group` creates a `long` or a `double`. And\n  so on. `egen` will default to the system type, which could cause a loss of\n  precision on some functions.\n- For internally supported functions, you can specify a varlist as the source,\n  not just a single variable. Observations will be pooled by row in that case.\n- While `gegen` is much faster for `tag`, `group`, and summary stats, most\n  egen function are not implemented internally, meaning for arbitrary `gegen`\n  calls this is a wrapper for hashsort and egen.\n\nDifferences from `tabstat`\n\n- Multiple groups are allowed.\n- Saving the output is done via `mata` instead of `r()`. No matrices\n  are saved in `r()` and option `save` is not allowed. However, option\n  `matasave` saves the output and `by()` info in `GstatsOutput` (the object\n  can be named via `matasave(name)`). See `mata GstatsOutput.desc()` after\n  `gstats tab, matasave` for details.\n- `GstatsOutput` provides helpers for extracting rows, columns, and levels.\n- Options `casewise`, `longstub` are not supported.\n- Option `nototal` is on by default; `total` is planned for a future release.\n- Option `pooled` pools the source variables into one.\n\nDifferences from `summarize, detail`\n\n- The behavior of `summarize` and `summarize, meanonly` can be\n  recovered via options `nodetail` and `meanonly`. These two\n  options are mainly for use with `by()`\n- Option `matasave` saves output and `by()` info in `GstatsOutput`,\n  a mata class object (the object can be named via `matasave(name)`).\n  See `mata GstatsOutput.desc()` after `gstats sum, matasave` for details.\n- Option `noprint` saves the results but omits printing output.\n- Option `tab` prints statistics in the style of `tabstat`\n- Option `pooled` pools the source variables and computes summary\n  stats as if it was a single variable.\n- `pweights` are allowed.\n- Largest and smallest observations are weighted.\n- `rolling:`, `statsby:`, and `by:` are not allowed. To use `by` pass\n  the option `by()`\n- `display options` are not supported.\n- Factor and time series variables are not allowed.\n\nDifferences from `levelsof`\n\n- It can take a `varlist` and not just a `varname`; in that case it prints\n  all unique combinations of the varlist. The user can specify column and row\n  separators.\n- It can deduplicate an arbitrary number of levels and store the results in a\n  new variable list or replace the old variable list via `gen(prefix)` and\n  `gen(replace)`, respectively. If the user runs up against the maximum macro\n  variable length, add option `nolocal`.\n\nDifferences from `isid`\n\n- No support for `using`. The C plugin API does not allow to load a Stata\n  dataset from disk.\n- Option `sort` is not available.\n- It can also check IDs with `if` and `in` conditions.\n\nDifferences from `gsort`\n\n- `hashsort` behaves as if `mfirst` was passed. To recover the default\n  behavior of `gsort` pass option `mlast`.\n\nDifferences from `duplicates`\n\n- `gduplicates` does not sort `examples` or `list` by default. This massively\n  enhances performance but it might be harder to read. Pass option `sort`\n  (`sorted`) to mimic `duplicates` behavior and sort the list.\n\nDifferences from `rangestat`\n\n- Note that `gstats range` is an alias for `gstats transform` that assumes\n  all the stats requested are range statistics. However, it can be called\n  in conjunction with any other transform via `(range stat ...)`. It was\n  not intended to be a replacement of `rangestat` but it can replicate some\n  of its functionality.\n\n- `flex_stat`s (reg, corr, cov) are not allowed (see `gregress`).\n\n- Intervals are of the form `interval(low high [keyvar])`; if `keyvar`\n  is missing then it is taken to be the source variable.\n\n- Variables are not allowed in place of `low` or `high`. Instead they\n  must be `#[stat]` where `#` is a number and `stat` is an optional\n  summary statistic; e.g. `interval(-sd 0.5sd x)`.\n\n- Separate interval and interval variables can be specified for each\n  target; e.g. `gstats range (mean -3 3) x (mean -2 . time) y ...`.\n\n- All statistics allowed by `gstats tab` are allowed by `gstats range`\n  (except `nunique` or `percent`).\n\n- Options `casewise`, `describe`, and `local` are not allowed.\n\nHashing and Sorting\n-------------------\n\nThere are two key insights to the massive speedups of Gtools:\n\n1. Hashing the data and sorting a hash is a lot faster than sorting\n  the data to then process it by group. Sorting a hash can be achieved\n  in linear O(N) time, whereas the best general-purpose sorts take O(N\n  log(N)) time. Sorting the groups would then be achievable in O(J\n  log(J)) time (with J groups). Hence the speed improvements are largest\n  when N / J is largest.\n\n2. Compiled C code is much faster than Stata commands. While it is true\n   that many of Stata's underpinnings are compiled code, several\n   operations are written in `ado` files without much thought given\n   to optimization. If you're working with tens of thousands of\n   observations you might barely notice (and the difference between\n   5 seconds and 0.5 seconds might not be particularly important).\n   However, with tens of millions or hundreds of millions of rows, the\n   difference between half a day and an hour can matter quite a lot.\n\n__*Stata Sorting*__\n\nIt should be noted that Stata's sorting mechanism is hard to improve\nupon because of the overhead involved in sorting. We have implemented a\nhash-based sorting command, `hashsort`, which should be faster Stata's\n`sort` for groups, but not necessarily otherwise:\n\n| Function  | Replaces | Speedup (IC / MP)    | Unsupported            | Extras               |\n| --------- | -------- | -------------------- | ---------------------- | -------------------- |\n| hashsort  | sort     | 2.5 to 4 / .8 to 1.3 |                        | Group (hash) sorting |\n|           | gsort    | 2 to 18 / 1 to 6     | `mfirst` (see `mlast`) | Sorts are stable     |\n\nThe overhead involves copying the by variables, hashing, sorting the hash,\nsorting the groups, copying a sort index back to Stata, and having Stata do\nthe final swaps. The plugin runs fast, but the copy overhead plus the Stata\nswaps often make the function be slower than Stata's native `sort`.\n\nThe reason that the other functions are faster is because they don't deal with\nall that overhead.  By contrast, Stata's `gsort` is not efficient. To sort\ndata, you need to make pair-wise comparisons. For real numbers, this is just\n`a \u003e b`. However, a generic comparison function can be written as `compare(a, b) \u003e 0`.\nThis is true if a is greater than b and false otherwise. To invert\nthe sort order, one need only use `compare(b, a) \u003e 0`, which is what gtools\ndoes internally.\n\nHowever, Stata creates a variable that is the inverse of the sort variable.\nThis is equivalent, but the overhead makes it slower than `hashsort`.\n\nTODO\n----\n\nPlanned features:\n\n- [ ] Things to add to gcollapse:\n    - [ ] `prod`\n    - [ ] `geomean pos`: exclude negative numbers _and_ zero.\n    - [ ] `geomean abspos`: ibid but take absolute value first.\n    - [ ] Generally should you add an `abs` option to everything?\n- [ ] Flexible save options for `gregress`\n    - [ ] `predict()`, including `xb` and `e`.\n    - [ ] `absorb(fe1=group1 fe2=group2 ...)` syntax to save the FE.\n    - [ ] Choose which coefs/se to save.\n- [ ] Improve formula documentation for summary statistics (e.g. `gini`)\n- [ ] Internal consistency test for various parts of `gquantiles`. Each\n      function section does cases but they should be consistent!\n\nThese are options/features/improvements I would like to add, but I don't\nhave an ETA for them (i.e. they are a wishlist because I am either not\nsure how to implement them or because writing the code will take a long\ntime). Roughly in order of likelihood:\n\n- [ ] `gregress` missing features\n    - [ ] Non-nested multi-way clustering.\n    - [ ] HDFE collienar categories check.\n    - [ ] HDFE drop singletons.\n    - [ ] Detect separated observations in `gglm, family(poisson)`.\n    - [ ] Guard against possible overflows in `X' X`\n    - [ ] Accelerate HDFE corner cases (e.g. very dense multi-way HDFE)\n    - [ ] Include quick primers on OLS, IV, and IRLS in docs.\n- [ ] Some support for Stata's extended syntax in `gregress`\n- [ ] Update benchmarks for all commands. Still on 0.8 benchmarks.\n- [ ] Dropmissing vs dropmissing but not extended missing values.\n- [ ] Allow keeping both variable names and labels in `greshape spread/gather`\n- [ ] Implement `selectoverflow(missing|closest)`\n- [ ] Add totals row for `J \u003e 1` in gstats\n- [ ] Improve debugging info.\n- [ ] Implement `collapse()` option for `greshape`.\n- [ ] Rolling (interval) and moving options for `gregress`.\n- [ ] Add support for binary `strL` variables.\n- [ ] Minimize memory use.\n- [ ] Add memory(greedy|lean) to give user fine-grained control over internals.\n- [ ] Create a Stata C hashing API with thin wrappers around core functions.\n    - [ ] This will be a C library that other users can import.\n    - [ ] Some functionality will be available from Stata via gtooos, api()\n    - [ ] Improve code comments when you write the API!\n    - [ ] Have some type of coding standard for the base (coding style)\n- [ ] Implement `gmerge`\n    - [ ] Integration with [ReadStat](https://github.com/WizardMac/ReadStat/tree/master/src)?\n\nAbout\n-----\n\nHi! I'm [Mauricio Caceres](https://mcaceresb.github.io); I made gtools\nafter some of my Stata jobs were taking literally days to run because of repeat\ncalls to `egen`, `collapse`, and similar on data with over 100M rows.  Feedback\nand comments are welcome! I hope you find this package as useful as I do.\n\nAlong those lines, here are some other Stata projects I like:\n\n* [`ftools`](https://github.com/sergiocorreia/ftools): The main inspiration for\n  gtools. Not as fast, but it has a rich feature set; its mata API in\n  particular is excellent.\n\n* [`reghdfe`](https://github.com/sergiocorreia/reghdfe): The fastest way to run\n  a regression with multiple fixed effects (as far as I know).\n\n* [`ivreghdfe`](https://github.com/sergiocorreia/ivreghdfe): A combination of\n  [`ivreg2`](https://ideas.repec.org/c/boc/bocode/s425401.html) and `reghdfe`.\n\n* [`stata_kernel`](https://kylebarron.github.io/stata_kernel): A Stata kernel\n  for Jupyter; extremely useful for interacting with Stata.\n\n* [`stata-cowsay`](https://github.com/mdroste/stata-cowsay): Productivity-boosting\n  cowsay functionality in Stata.\n\nLicense\n-------\n\nGtools is [MIT-licensed](https://github.com/mcaceresb/stata-gtools/blob/master/LICENSE).\n`./lib/spookyhash` and `./src/plugin/common/quicksort.c` belong to their respective\nauthors and are BSD-licensed. Also see `gtools, licenses`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmcaceresb%2Fstata-gtools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmcaceresb%2Fstata-gtools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmcaceresb%2Fstata-gtools/lists"}