{"id":13665924,"url":"https://github.com/quanteda/readtext","last_synced_at":"2025-04-05T05:09:44.326Z","repository":{"id":37405869,"uuid":"71951499","full_name":"quanteda/readtext","owner":"quanteda","description":"an R package for reading text files","archived":false,"fork":false,"pushed_at":"2024-02-27T19:08:19.000Z","size":16224,"stargazers_count":115,"open_issues_count":31,"forks_count":28,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-04-26T07:41:12.087Z","etag":null,"topics":["encoding","quanteda","r","text"],"latest_commit_sha":null,"homepage":"https://readtext.quanteda.io","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/quanteda.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-10-26T00:47:47.000Z","updated_at":"2024-06-19T17:39:59.650Z","dependencies_parsed_at":"2023-01-22T14:00:52.769Z","dependency_job_id":"961f14cb-b08b-4725-8033-6876367e2222","html_url":"https://github.com/quanteda/readtext","commit_stats":{"total_commits":414,"total_committers":11,"mean_commits":37.63636363636363,"dds":0.5169082125603865,"last_synced_commit":"647c46510fbc09b605cb46e38873f68ff157b858"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quanteda%2Freadtext","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quanteda%2Freadtext/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quanteda%2Freadtext/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quanteda%2Freadtext/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/quanteda","download_url":"https://codeload.github.com/quanteda/readtext/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247289429,"owners_count":20914464,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["encoding","quanteda","r","text"],"created_at":"2024-08-02T06:00:53.941Z","updated_at":"2025-04-05T05:09:44.308Z","avatar_url":"https://github.com/quanteda.png","language":"R","readme":"---\noutput: github_document\n---\n\n```{r, echo = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"##\",\n  fig.path = \"images/\"\n)\n```\n```{r echo=FALSE, results=\"hide\", message=FALSE}\nlibrary(\"badger\")\n```\n\n# readtext: Import and handling for plain and formatted text files\n\n\u003c!-- badges: start --\u003e\n[![CRAN Version](https://www.r-pkg.org/badges/version/readtext)](https://CRAN.R-project.org/package=readtext)\n`r badge_devel(\"quanteda/readtext\", \"royalblue\")`\n[![Downloads](https://cranlogs.r-pkg.org/badges/readtext)](https://CRAN.R-project.org/package=readtext)\n[![Total Downloads](https://cranlogs.r-pkg.org/badges/grand-total/readtext?color=orange)](https://CRAN.R-project.org/package=readtext)\n[![R-CMD-check](https://github.com/quanteda/readtext/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/quanteda/readtext/actions/workflows/R-CMD-check.yaml)\n[![Codecov test coverage](https://codecov.io/gh/quanteda/readtext/branch/master/graph/badge.svg)](https://app.codecov.io/gh/quanteda/readtext?branch=master)\n\u003c!-- badges: end --\u003e\n\n\n[1]: https://codecov.io/gh/quanteda/readtext/branch/master\n\nAn R package for reading text files in all their various formats, by Ken Benoit, Adam Obeng, Paul Nulty, Aki Matsuo, Kohei Watanabe, and Stefan Müller.\n\n## Introduction\n\n**readtext** is a one-function package that does exactly what it says on the tin: It reads files containing text, along with any associated document-level metadata, which we call \"docvars\", for document variables.  Plain text files do not have docvars, but other forms such as .csv, .tab, .xml, and .json files usually do.  \n\n**readtext** accepts filemasks, so that you can specify a pattern to load multiple texts, and these texts can even be of multiple types.  **readtext** is smart enough to process them correctly, returning a data.frame with a primary field \"text\" containing a character vector of the texts, and additional columns of the data.frame as found in the document variables from the source files.\n\nAs encoding can also be a challenging issue for those reading in texts, we include functions for diagnosing encodings on a file-by-file basis, and allow you to specify vectorized input encodings to read in file types with individually set (and different) encodings.  (All encoding functions are handled by the **stringi** package.)\n\n## How to Install\n\n\n1.  From CRAN\n\n    ```{r, eval = FALSE}\n    install.packages(\"readtext\")\n    ```\n\n2.  From GitHub, if you want the latest development version.\n\n    ```{r, eval = FALSE}\n    # devtools packaged required to install readtext from Github \n    remotes::install_github(\"quanteda/readtext\") \n    ```\n\nLinux note: There are a couple of dependencies that may not be available on linux systems. On Debian/Ubuntu try installing these packages by running these commands at the command line:\n\n```{bash, eval = FALSE}\nsudo apt-get install libpoppler-cpp-dev   # for antiword\n```\n\n## Demonstration: Reading one or more text files\n\n**readtext** supports plain text files (.txt), data in some form of JavaScript Object Notation (.json), comma-or tab-separated values (.csv, .tab, .tsv), XML documents (.xml), as well as PDF, Microsoft Word formatted files and other document formats (.pdf, .doc, .docx, .odt, .rtf). **readtext** also handles multiple files and file types using for instance a \"glob\" expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz).\n\nThe file formats are determined automatically by the filename extensions.  If a file has no extension or is unknown, **readtext** will assume that it is plain text.  The following command, for instance, will load in all of the files from the subdirectory `txt/UDHR/`:\n\n```{r}\nlibrary(\"readtext\")\n# get the data directory from readtext\nDATA_DIR \u003c- system.file(\"extdata/\", package = \"readtext\")\n\n# read in all files from a folder\nreadtext(paste0(DATA_DIR, \"/txt/UDHR/*\"))\n```\n\nFor files that contain multiple documents, such as comma-separated-value documents, you will need to specify the column name containing the texts, using the `text_field` argument:\n\n```{r}\n# read in comma-separated values and specify text field\nreadtext(paste0(DATA_DIR, \"/csv/inaugCorpus.csv\"), text_field = \"texts\")\n```\n\nFor a more complete demonstration, see the package [vignette](https://readtext.quanteda.io/articles/readtext_vignette.html).\n\n## Inter-operability with other packages\n\n### With **quanteda**\n\n**readtext** was originally developed in early versions of the [**quanteda**](https://github.com/quanteda/quanteda) package for the quantitative analysis of textual data. Because **quanteda**'s corpus constructor recognizes the data.frame format returned by `readtext()`, it can construct a corpus directly from a readtext object, preserving all docvars and other meta-data.\n\n```{r}\nlibrary(\"quanteda\")\n# read in comma-separated values with readtext\nrt_csv \u003c- readtext(paste0(DATA_DIR, \"/csv/inaugCorpus.csv\"), text_field = \"texts\")\n# create quanteda corpus\ncorpus_csv \u003c- corpus(rt_csv)\nsummary(corpus_csv, 5)\n```\n\n### Text Interchange Format compatibility\n\n**readtext** returns a data.frame that is formatted as per the corpus structure of the [Text Interchange Format](https://github.com/ropenscilabs/tif), it can easily be used by other packages that can accept a corpus in data.frame format.  \n\nIf you only want a named `character` object, **readtext** also defines an `as.character()` method that inputs its data.frame and returns just the named character vector of texts, conforming to the TIF definition of the character version of a corpus.\n","funding_links":[],"categories":["R"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquanteda%2Freadtext","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquanteda%2Freadtext","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquanteda%2Freadtext/lists"}