{"id":32202299,"url":"https://github.com/pachadotdev/eflm","last_synced_at":"2025-10-22T04:03:34.639Z","repository":{"id":50785071,"uuid":"326508783","full_name":"pachadotdev/eflm","owner":"pachadotdev","description":"See pacha.dev/capybara for a much better GLM implementation. Efficient Fitting of Linear and Generalized Linear Models by using just base R.","archived":true,"fork":false,"pushed_at":"2024-07-24T05:43:56.000Z","size":11724,"stargazers_count":19,"open_issues_count":0,"forks_count":5,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-08T16:22:11.537Z","etag":null,"topics":["broom","glm","lm","r","sandwich"],"latest_commit_sha":null,"homepage":"https://pacha.dev/capybara","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pachadotdev.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-01-03T22:00:25.000Z","updated_at":"2024-07-24T05:44:18.000Z","dependencies_parsed_at":"2022-08-24T13:38:15.745Z","dependency_job_id":"6149f4c0-80b6-42a5-9183-2671edfd25da","html_url":"https://github.com/pachadotdev/eflm","commit_stats":{"total_commits":112,"total_committers":4,"mean_commits":28.0,"dds":0.0982142857142857,"last_synced_commit":"6a4974942d79b6c275df0aa43f3b0899846ae853"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/pachadotdev/eflm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pachadotdev%2Feflm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pachadotdev%2Feflm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pachadotdev%2Feflm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pachadotdev%2Feflm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pachadotdev","download_url":"https://codeload.github.com/pachadotdev/eflm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pachadotdev%2Feflm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280376532,"owners_count":26320275,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-22T02:00:06.515Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["broom","glm","lm","r","sandwich"],"created_at":"2025-10-22T04:01:26.682Z","updated_at":"2025-10-22T04:03:34.629Z","avatar_url":"https://github.com/pachadotdev.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r setup, include = FALSE}\nlibrary(knitr)\nlibrary(ggplot2)\nlibrary(dplyr)\nlibrary(tidyr)\nlibrary(patchwork)\nlibrary(gravity)\nlibrary(eflm)\n\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\"\n)\n```\n\n# Efficient Fitting of Linear Models\n\n\u003c!-- badges: start --\u003e\n[![Project Status: Active – The project has reached a stable, usable\nstate and is being actively\ndeveloped.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://www.tidyverse.org/lifecycle/#stable)\n[![CRAN\nstatus](https://www.r-pkg.org/badges/version/gravity)](https://cran.r-project.org/package=eflm)\n[![codecov](https://codecov.io/gh/pachadotdev/eflm/branch/main/graph/badge.svg?token=XI59cmGd15)](https://codecov.io/gh/pachadotdev/eflm)\n[![R-CMD-check](https://github.com/pachadotdev/eflm/workflows/R-CMD-check/badge.svg)](https://github.com/pachadotdev/eflm/actions)\n\u003c!-- badges: end --\u003e\n\n## Scope\n\n`eflm` package reduces the design matrix from $N\\times P$ into \n$P \\times P$ for reduced fitting time, and delivers functions that are drop-in \nreplacements for `glm` and `lm`, like:\n```{r, eval = FALSE}\n# just append and 'e' to glm\neglm(mpg ~ wt, data = mtcars)\n```\n\nThe best computational performance is obtained when R is linked against OpenBLAS,\nIntel MKL or other optimized BLAS library. This implementation aims at being\ncompatible with 'broom' and 'sandwich' packages for summary statistics and\nclustering by providing S3 methods.\n\nThis package takes ideas from glm2, speedglm, fastglm, speedglm and fixest \npackages, but the implementations here shall keep the functions and outputs as \nclosely as possible to the stats package, therefore making the functions \nprovided here compatible with packages such as sandwich for robust estimation, \neven if that means to attenuate the speed gains.\n\nThe greatest strength of this package is testing. With more than 1600 \n(and counting) tests, we try to do exactly the same as lm/glm, even in edge \ncases, but faster.\n\nThe ultimate aim of the project is to produce a package that:\n\n* Does exactly the same as lm and glm in less time\n* Is equally numerically stable as lm and glm\n* Depends only on base R, with no Rcpp or other calls\n* Uses R's internal C code such as the `Cdqrls` function that the stats package uses for model fitting\n* Can be used in Shiny dashboard and contexts where you need fast model fitting\n* Is useful for memory consuming models\n* Allows model fitting in cases demanding more memory than free RAM (PENDING)\n\n## Installation\n\nYou can install the released version of eflm from CRAN with:\n```r\ninstall.packages(\"eflm\")\n```\n\nAnd the development version with:\n``` r\nremotes::install_github(\"pachadotdev/eflm\")\n```\n\n## Progress list\n\n### Stats compatibility\n\n- [x] cooks.distance\n\n### Sandwich compatibility\n\n- [x] estfun\n- [x] bread\n- [x] vcovCL\n- [x] meatCL\n- [x] vcovCL\n- [x] vcovBS\n- [ ] vcovHC\n- [ ] meatHC\n- [ ] vcovPC\n- [ ] meatPC\n- [ ] vcovPL\n- [ ] meatPL\n\n### Broom compatibility\n\n- [x] augment\n- [x] tidy\n- [x] glance\n\n### Lmtest compatibility\n\n- [x] resettest\n\n## Benchmarking\n\nThe dataset for this benchmark was taken from Yotov et al. (2016) and consists \nin a 28,152 x 8 data frame with 6 numeric and 2 categorical columns of the form:\n\n|Year ($t$) |Trade ($X$) |DIST |CNTG |LANG |CLNY |Exp Year ($\\pi$) |Imp Year ($\\chi$) |\n|-----------|------------|-----|-----|-----|-----|-----------------|------------------|\n|1986       |27.8        |12045|0    |0    |0    |ARG1986          |AUS1986           |\n|1986       |3.56        |11751|0    |0    |0    |ARG1986          |AUT1986           |\n|1986       |96.1        |11305|0    |0    |0    |ARG1986          |BEL1986           |\n\nThis data can be found in the `tradepolicy` package.\n\nThe variables are:\n\n* `year`: time of export/import flow\n* `trade`: bilateral trade\n* `log_dist`: log of distance\n* `cntg`: contiguity (0/1)\n* `lang`: common language (0/1)\n* `clny`: colonial relation (0/1)\n* `exp_year`/`imp_year`: exporter/importer time fixed effects\n\nFor benchmarking I'll fit a PPML model, as it's a computationally expensive model.\n\n```r\nch1_application1 \u003c- tradepolicy::agtpa_applications %\u003e%\n  select(exporter, importer, pair_id, year, trade, dist, cntg, lang, clny) %\u003e%\n  filter(year %in% seq(1986, 2006, 4))\n  \nformula \u003c- trade ~ log(dist) + cntg + lang + clny + exp_year + imp_year\neglm(formula, quasipoisson, ch1_application1)\n```\n\nTo compare `glm`, the proposed `eglm` and Stata's `ppml`, I conducted a test \nwith 500 repetitions locally, and reported the median of the realizations as\nthe fitting time. The plots on the right report the fitting times and used \nmemory by running regressions with cumulative subset of the data for \n1986, ..., 2006 (e.g. regress for 1986, then 1986 and 1990, ..., \nthen 1986 to 2006), we obtain the next fitting times and memory allocation \ndepending on the design matrix dimensions:\n\n```{r echo=FALSE, fig.height=3, fig.width=6, message=FALSE, warning=FALSE, dpi=150, fig.align=\"center\"}\ng \u003c- readRDS(\"benchmarks/06-stata-vs-r.rds\")\ng2 \u003c- readRDS(\"benchmarks/05-benchmark-subsets.rds\")\ng + (g2[[1]] / g2[[2]])\n```\n\nYotov et al. (2016) features complex both partial and general equilibrium \nmodels. Some partial equilibrium models are particularly slow to fit because of\nthe allocated memory and the number of fixed effects, such as the Regional Trade \nAgreements (RTAs) model.\n\nIn the next table, TG means 'Traditional Gravity' (e.g. vanilla PPML), \nDP means 'Distance Puzzle' and GB stands for 'Globalization', which are \nrefinements of the simple PPML model and include dummy variables such as \nspecific country pair fixed effects and lagged RTAs.\n\n```{r echo=FALSE}\nload(\"benchmarks/04-glm-vs-eglm.RData\")\nt1 \u003c- bind_rows(\n  model_matrix_traditional_gravity %\u003e% \n    filter(fun %in% c(\"eglm\"), model_matrix_ncol == max(model_matrix_ncol)),\n  model_matrix_distance_puzzle %\u003e% \n    filter(fun %in% c(\"eglm\"), model_matrix_ncol == max(model_matrix_ncol)),\n  model_matrix_rtas %\u003e% \n    filter(fun %in% c(\"eglm\"), model_matrix_ncol == max(model_matrix_ncol))\n) %\u003e% \n  select(-fun)\ncolnames(t1) \u003c- c(\"Model\", \"Rows in design matrix\", \"Cols in design matrix\")\nkable(t1)\n```\n\n```{r echo=FALSE, fig.height=2, fig.width=6, message=FALSE, warning=FALSE, dpi=150, fig.align=\"center\"}\ninclude \u003c- t1$Model\n\ng2_time \u003c- ggplot(bench_models %\u003e% filter(fun %in% c(\"eglm\", \"glm\"), model %in% include)) + \n  geom_col(aes(x = model, y = median_seconds, fill = `Function`), position = \"dodge2\") + \n  scale_fill_manual(values = c(\"#3B9AB2\", \"#E1AF00\")) + \n  theme_minimal(base_size = 8) + \n  # theme(legend.position = \"none\") +\n  labs(x = \"Model\", y = \"Median Time (Seconds)\",\n       title = \"Time Benchmark\")\ng2_memory \u003c- ggplot(bench_models %\u003e% filter(fun %in% c(\"eglm\", \"glm\"), model %in% include)) + \n  geom_col(aes(x = model, y = mem_alloc_mb, fill = `Function`), position = \"dodge2\") + \n  scale_fill_manual(values = c(\"#3B9AB2\", \"#E1AF00\")) + \n  theme_minimal(base_size = 8) + \n  # theme(legend.position = \"none\") +\n  labs(x = \"Model\", y = \"Required Memory (MB)\",\n       title = \"Memory Benchmark\")\ng2_time + g2_memory\n```\n\nThe results for the RTA model show that the speedups can be scaled, and we can \nshow both time reduction and required memory increases.\n\n```{r echo=FALSE}\nt2 \u003c- bench_models %\u003e% \n  filter(fun %in% c(\"eglm\", \"glm\"), model %in% include) %\u003e% \n  select(model, fun, median_seconds) %\u003e% \n  spread(fun, median_seconds) %\u003e% \n  select(model, glm, eglm) %\u003e% \n  mutate(time_gain = paste0(round(100 * (glm - eglm) / glm, 2), \"%\"))\ncolnames(t2) \u003c- c(\"Model\", \"GLM Time (s)\", \"EGLM Time (s)\", \"Time Gain (%)\")\nkable(t2)\n```\n\nIs it important to mention that the increase in memory results in reduced object\nsize for the stored model.\n\n```{r echo=FALSE}\nt4 \u003c- bind_rows(\n  object_size_traditional_gravity %\u003e% \n    filter(fun %in% c(\"glm\", \"eglm\"), model %in% include),\n  object_size_distance_puzzle %\u003e% \n    filter(fun %in% c(\"glm\", \"eglm\"), model %in% include),\n  object_size_rtas %\u003e% \n    filter(fun %in% c(\"glm\", \"eglm\"), model %in% include)\n) %\u003e% \n  mutate(fun = ifelse(fun == \"glm\", \"GLM\", \"EGLM\"))\nt4 \u003c- t4 %\u003e% \n  select(model, fun, size) %\u003e% \n  mutate(size = round(size, 2)) %\u003e% \n  spread(fun, size) %\u003e% \n  select(model, GLM, EGLM) %\u003e% \n  mutate(object_gain = paste0(round(100 * (GLM - EGLM) / GLM, 2), \"%\"))\ncolnames(t4) \u003c- c(\"Model\", \"GLM Size (MB)\", \"EGLM Size (MB)\", \"Memory Savings (%)\")\nkable(t4)\n```\n\nTo conclude my benchmarks, I fitted the PPML model again on DigitalOcean \ndroplets, leading to consistent times across scaled hardware. The results can be\nseen in the next plot:\n```{r echo=FALSE, fig.height=2, fig.width=6, message=FALSE, warning=FALSE, dpi=150, fig.align=\"center\"}\ng_do \u003c- readRDS(\"benchmarks/07-r-digitalocean.rds\")\ng_do\n```\n\n## Edge cases\n\nAn elementary example that breaks `eflm` even with QR decomposition can be \nfound in Golub et al. (2013), which consists in passing an ill conditioned \nmatrix:\n```{r include=FALSE}\nset.seed(200100); n \u003c- 10000\ndf \u003c- data.frame(x1 = 1:n, x2 = c(0, 2:n))\ndf$y \u003c- with(df, 1.999 + 2.000 * x1 + 2.001 * x2 + rnorm(n))\nreg1 \u003c- lm(y ~ ., df); reg2 \u003c- elm(y ~ ., df)\n```\n\n| Model | (Intercept) | $x_1$ | $x_2$ | \n|-------|-------------|-------|-------|\n|REG 1  | 1.98        | 2.98  | 1.02  |\n|REG 2  | 1.98        | 4.00  | NA    |\n\n# References\n\nGolub, Gene H, and Charles F Van Loan. 2013. *Matrix Computations*. Vol. 3. JHU press.\n\nYotov, Yoto V, Roberta Piermartini, José-Antonio Monteiro, and Mario Larch. 2016. \n*An Advanced Guide to Trade Policy Analysis: The Structural Gravity Model*. World Trade\nOrganization Geneva.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpachadotdev%2Feflm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpachadotdev%2Feflm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpachadotdev%2Feflm/lists"}