{"id":13401069,"url":"https://github.com/edwindj/chunked","last_synced_at":"2025-05-13T13:58:34.737Z","repository":{"id":35309624,"uuid":"39571077","full_name":"edwindj/chunked","owner":"edwindj","description":"Chunkwise Text-file Processing for 'dplyr'","archived":false,"fork":false,"pushed_at":"2022-03-02T10:55:57.000Z","size":2254,"stargazers_count":167,"open_issues_count":13,"forks_count":7,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-05-02T15:21:33.703Z","etag":null,"topics":["chunk","database","dplyr","r"],"latest_commit_sha":null,"homepage":"https://edwindj.github.io/chunked","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/edwindj.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-07-23T14:12:10.000Z","updated_at":"2025-01-16T20:42:49.000Z","dependencies_parsed_at":"2022-09-17T06:13:19.241Z","dependency_job_id":null,"html_url":"https://github.com/edwindj/chunked","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edwindj%2Fchunked","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edwindj%2Fchunked/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edwindj%2Fchunked/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edwindj%2Fchunked/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/edwindj","download_url":"https://codeload.github.com/edwindj/chunked/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253957297,"owners_count":21990532,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chunk","database","dplyr","r"],"created_at":"2024-07-30T19:00:58.368Z","updated_at":"2025-05-13T13:58:34.709Z","avatar_url":"https://github.com/edwindj.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"\n# chunked\n\n[![version](https://cran.r-project.org/package=chunked)](https://cran.r-project.org/package=chunked)\n[![Downloads](https://cranlogs.r-pkg.org/badges/chunked)](https://cran.r-project.org/package=chunked)\n[![R-CMD-check](https://github.com/edwindj/chunked/workflows/R-CMD-check/badge.svg)](https://github.com/edwindj/chunked/actions)\n[![Coverage\nStatus](https://coveralls.io/repos/edwindj/chunked/badge.svg?branch=master\u0026service=github)](https://coveralls.io/github/edwindj/chunked?branch=master)\nR is a great tool, but processing data in large text files is\ncumbersome. `chunked` helps you to process large text files with *dplyr*\nwhile loading only a part of the data in memory. It builds on the\nexcellent R package [*LaF*](https://github.com/djvanderlaan/LaF).\n\nProcessing commands are written in dplyr syntax, and `chunked` (using\n`LaF`) will take care that chunk by chunk is processed, taking far less\nmemory than otherwise. `chunked` is useful for **select**-ing columns,\n**mutate**-ing columns and **filter**-ing rows. It is less helpful in\n**group**-ing and **summarize**-ation of large text files. It can be\nused in data pre-processing.\n\n## Install\n\n‘chunked’ can be installed with\n\n``` r\ninstall.packages('chunked')\n```\n\nbeta version with:\n\n``` r\ninstall.packages('chunked', repos=c('https://cran.rstudio.com', 'https://edwindj.github.io/drat'))\n```\n\nand the development version with:\n\n``` r\ndevtools::install_github('edwindj/chunked')\n```\n\nEnjoy! Feedback is welcome…\n\n# Usage\n\n## Text file -\u003e process -\u003e text file\n\nMost common case is processing a large text file, select or add columns,\nfilter it and write the result back to a text file\n\n``` r\n  read_chunkwise(\"./large_file_in.csv\", chunk_size=5000) %\u003e% \n  select(col1, col2, col5) %\u003e%\n  filter(col1 \u003e 10) %\u003e% \n  mutate(col6 = col1 + col2) %\u003e% \n  write_chunkwise(\"./large_file_out.csv\")\n```\n\n`chunked` will write process the above statement in chunks of 5000\nrecords. This is different from for example `read.csv` which reads all\ndata into memory before processing it.\n\n## Text file -\u003e process -\u003e database\n\nAnother option is to use `chunked` as a preprocessing step before adding\nit to a database\n\n``` r\ncon \u003c- DBI::dbConnect(RSQLite::SQLite(), 'test.db', create=TRUE)\ndb \u003c- dbplyr::src_dbi(con)\n\ntbl \u003c- \n  read_chunkwise(\"./large_file_in.csv\", chunk_size=5000) %\u003e% \n  select(col1, col2, col5) %\u003e%\n  filter(col1 \u003e 10) %\u003e% \n  mutate(col6 = col1 + col2) %\u003e% \n  write_chunkwise(dbplyr::src_dbi(db), 'my_large_table')\n  \n# tbl now points to the table in sqlite.\n```\n\n## Db -\u003e process -\u003e Text file\n\nChunked can be used to export chunkwise to a text file. Note however\nthat in that case processing takes place in the database and the\nchunkwise restrictions only apply to the writing.\n\n## Lazy processing\n\n`chunked` will not start processing until `collect` or `write_chunkwise`\nis called.\n\n``` r\ndata_chunks \u003c- \n  read_chunkwise(\"./large_file_in.csv\", chunk_size=5000) %\u003e% \n  select(col1, col3)\n  \n# won't start processing until\ncollect(data_chunks)\n# or\nwrite_chunkwise(data_chunks, \"test.csv\")\n# or\nwrite_chunkwise(data_chunks, db, \"test\")\n```\n\nSyntax completion of variables of a chunkwise file in RStudio works like\na charm…\n\n# Dplyr verbs\n\n`chunked` implements the following dplyr verbs:\n\n-   `filter`\n-   `select`\n-   `rename`\n-   `mutate`\n-   `mutate_each`\n-   `transmute`\n-   `do`\n-   `tbl_vars`\n-   `inner_join`\n-   `left_join`\n-   `semi_join`\n-   `anti_join`\n\nSince data is processed in chunks, some dplyr verbs are not implemented:\n\n-   `arrange`\n-   `right_join`\n-   `full_join`\n\n`summarize` and `group_by` are implemented but generate a warning: they\noperate on each chunk and **not** on the whole data set. However this\nmakes is more easy to process a large file, by repeatedly aggregating\nthe resulting data.\n\n-   `summarize`\n-   `group_by`\n\n``` r\ntmp \u003c- tempfile()\nwrite.csv(iris, tmp, row.names=FALSE, quote=FALSE)\niris_cw \u003c- read_chunkwise(tmp, chunk_size = 30) # read in chunks of 30 rows for this example\n\niris_cw %\u003e% \n  group_by(Species) %\u003e%            # group in each chunk\n  summarise( m = mean(Sepal.Width) # and summarize in each chunk\n           , w = n()\n           ) %\u003e% \n  as.data.frame %\u003e%                  # since each Species has 50 records, results will be in multiple chunks\n  group_by(Species) %\u003e%              # group the results from the chunk\n  summarise(m = weighted.mean(m, w)) # and summarize it again\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedwindj%2Fchunked","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fedwindj%2Fchunked","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedwindj%2Fchunked/lists"}