{"id":13666013,"url":"https://github.com/hrbrmstr/docxtractr","last_synced_at":"2026-03-05T23:38:33.936Z","repository":{"id":1554490,"uuid":"41317592","full_name":"hrbrmstr/docxtractr","owner":"hrbrmstr","description":":scissors: Extract Tables from Microsoft Word Documents with R","archived":false,"fork":false,"pushed_at":"2021-10-02T22:49:05.000Z","size":584,"stargazers_count":176,"open_issues_count":12,"forks_count":29,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-05-23T11:30:24.592Z","etag":null,"topics":["docx","extract-tables","microsoft-word","r","rstats","table-extraction"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hrbrmstr.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-08-24T17:37:26.000Z","updated_at":"2025-04-17T00:13:53.000Z","dependencies_parsed_at":"2022-08-06T10:16:38.577Z","dependency_job_id":null,"html_url":"https://github.com/hrbrmstr/docxtractr","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/hrbrmstr/docxtractr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fdocxtractr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fdocxtractr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fdocxtractr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fdocxtractr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hrbrmstr","download_url":"https://codeload.github.com/hrbrmstr/docxtractr/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fdocxtractr/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30155727,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T22:39:40.138Z","status":"ssl_error","status_checked_at":"2026-03-05T22:39:24.771Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docx","extract-tables","microsoft-word","r","rstats","table-extraction"],"created_at":"2024-08-02T06:00:55.581Z","updated_at":"2026-03-05T23:38:33.899Z","avatar_url":"https://github.com/hrbrmstr.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r setup, echo=FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"README-\"\n)\n```\n\n[![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/docxtractr.svg?branch=master)](https://travis-ci.org/hrbrmstr/docxtractr)\n[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/hrbrmstr/docxtractr?branch=master\u0026svg=true)](https://ci.appveyor.com/project/hrbrmstr/docxtractr)\n[![Coverage Status](https://img.shields.io/codecov/c/github/hrbrmstr/docxtractr/master.svg)](https://codecov.io/github/hrbrmstr/docxtractr?branch=master)\n[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/docxtractr)](http://cran.r-project.org/package=docxtractr)\n\n![](docxtractr-logo.png)\n\n# docxtractr \n\nExtract Data Tables and Comments from 'Microsoft' 'Word' Documents\n\n## Description \n\nAn R package for extracting tables \u0026 comments out of Word documents (docx). Development versions are available here and production versions are [on CRAN](https://cran.rstudio.com/web/packages/docxtractr/index.html).\n\nMicrosoft Word docx files provide an XML structure that is fairly\nstraightforward to navigate, especially when it applies to Word tables. The docxtractr package provides tools to determine table count, table structure and extract tables from Microsoft Word docx documents.\n\nMany tables in Word documents are in twisted formats where there may be labels or other oddities mixed in that make it difficult to work with the underlying data. `docxtractr` provides a function\u0026mdash;`assign_colnames`\u0026mdash;that makes it easy to identify a particular row in a scraped (or any, really) `data.frame` as the one containing column names and have it become the column names, removing it and (optionally) all of the rows before it (since that's usually what needs to be done).\n\n## What's in the tin?\n\nThe following functions are implemented:\n\n- `read_docx`:\tRead in a Word document for table extraction\n- `docx_describe_tbls`:\tReturns a description of all the tables in the Word document\n- `docx_describe_cmnts`:\tReturns a description of all the comments in the Word document\n- `docx_extract_tbl`:\tExtract a table from a Word document\n- `docx_extract_cmnts`:\tExtract comments from a Word document\n- `docx_extract_all_tbls`:\tExtract all tables from a Word document (`docx_extract_all` is now deprecated)\n- `docx_tbl_count`:\tGet number of tables in a Word document\n- `docx_cmnt_count`:\tGet number of comments in a Word document\n- `assign_colnames`:\tMake a specific row the column names for the specified data.frame\n- `mcga` : Make column names great again\n- `set_libreoffice_path`:\tPoint to Local soffice.exe File\n\nThe following data file are included:\n\n- `system.file(\"examples/data.docx\", package=\"docxtractr\")`: Word docx with 1 table\n- `system.file(\"examples/data3.docx\", package=\"docxtractr\")`: Word docx with 3 tables\n- `system.file(\"examples/none.docx\", package=\"docxtractr\")`: Word docx with 0 tables\n- `system.file(\"examples/complex.docx\", package=\"docxtractr\")`: Word docx with non-uniform tables\n- `system.file(\"examples/comments.docx\", package=\"docxtractr\")`: Word docx with comments\n- `system.file(\"examples/realworld.docx\", package=\"docxtractr\")`: A \"real world\" Word docx file with tables of all shapes and sizes\n- `system.file(\"examples/trackchanges.docx\", package=\"docxtractr\")`: Word docx with track changes in a table\n\n## Installation\n\n```{r inst, eval=FALSE}\n# devtools::install_github(\"hrbrmstr/docxtractr\")\n# OR \ninstall.packages(\"docxtractr\")\n```\n\n```{r opts, echo=FALSE}\noptions(width=120)\n```\n\n## Usage\n\n```{r libs, message=FALSE, warning=FALSE}\nlibrary(docxtractr)\nlibrary(tibble)\nlibrary(dplyr)\n\n# current version\npackageVersion(\"docxtractr\")\n\n```\n\n```{r sample}\n# one table\ndoc \u003c- read_docx(system.file(\"examples/data.docx\", package=\"docxtractr\"))\n\ndocx_tbl_count(doc)\n\ndocx_describe_tbls(doc)\n\ndocx_extract_tbl(doc, 1)\n\ndocx_extract_tbl(doc)\n\ndocx_extract_tbl(doc, header=FALSE)\n\n# url \n\nbudget \u003c- read_docx(\"http://rud.is/dl/1.DOCX\")\n\ndocx_tbl_count(budget)\n\ndocx_describe_tbls(budget)\n\ndocx_extract_tbl(budget, 1)\n\ndocx_extract_tbl(budget, 2) \n\n# three tables\ndoc3 \u003c- read_docx(system.file(\"examples/data3.docx\", package=\"docxtractr\"))\n\ndocx_tbl_count(doc3)\n\ndocx_describe_tbls(doc3)\n\ndocx_extract_tbl(doc3, 3)\n\n# no tables\nnone \u003c- read_docx(system.file(\"examples/none.docx\", package=\"docxtractr\"))\n\ndocx_tbl_count(none)\n\n# wrapping in try since it will return an error\n# use docx_tbl_count before trying to extract in scripts/production\ntry(docx_describe_tbls(none))\ntry(docx_extract_tbl(none, 2))\n\n# 5 tables, with two in sketchy formats\ncomplx \u003c- read_docx(system.file(\"examples/complex.docx\", package=\"docxtractr\"))\n\ndocx_tbl_count(complx)\n\ndocx_describe_tbls(complx)\n\ndocx_extract_tbl(complx, 3, header=TRUE)\n\ndocx_extract_tbl(complx, 4, header=TRUE)\n\ndocx_extract_tbl(complx, 5, header=TRUE)\n\n# a \"real\" Word doc\nreal_world \u003c- read_docx(system.file(\"examples/realworld.docx\", package=\"docxtractr\"))\n\ndocx_tbl_count(real_world)\n\n# get all the tables\ntbls \u003c- docx_extract_all_tbls(real_world)\n\n# see table 1\ntbls[[1]]\n\n# make table 1 better\nassign_colnames(tbls[[1]], 2)\n\n# make table 1's column names great again \nmcga(assign_colnames(tbls[[1]], 2))\n\n# see table 5\ntbls[[5]]\n\n# make table 5 better\nassign_colnames(tbls[[5]], 2)\n\n# preserve lines\nintracell_whitespace \u003c- read_docx(system.file(\"examples/preserve.docx\", package=\"docxtractr\"))\ndocx_extract_all_tbls(intracell_whitespace, preserve=TRUE)\n\ndocx_extract_all_tbls(intracell_whitespace)\n\n# comments\ncmnts \u003c- read_docx(system.file(\"examples/comments.docx\", package=\"docxtractr\"))\n\nprint(cmnts)\n\nglimpse(docx_extract_all_cmnts(cmnts))\n```\n\n### Track Changes (depends on `pandoc` being available)\n\n```{r track-changes}\n# original\nread_docx(\n  system.file(\"examples/trackchanges.docx\", package=\"docxtractr\")\n) %\u003e% \n  docx_extract_all_tbls(guess_header = FALSE)\n\n# accept\nread_docx(\n  system.file(\"examples/trackchanges.docx\", package=\"docxtractr\"),\n  track_changes = \"accept\"\n) %\u003e% \n  docx_extract_all_tbls(guess_header = FALSE)\n\n# reject\nread_docx(\n  system.file(\"examples/trackchanges.docx\", package=\"docxtractr\"),\n  track_changes = \"reject\"\n) %\u003e% \n  docx_extract_all_tbls(guess_header = FALSE)\n```\n\n## Test Results\n\n```{r test}\nlibrary(docxtractr)\nlibrary(testthat)\n\ndate()\n\ntest_dir(\"tests/\")\n```\n\n### Code of Conduct\n\nPlease note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). \nBy participating in this project you agree to abide by its terms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Fdocxtractr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhrbrmstr%2Fdocxtractr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Fdocxtractr/lists"}