{"id":25295917,"url":"https://github.com/ropenscilabs/tif","last_synced_at":"2025-10-28T02:31:19.539Z","repository":{"id":78722687,"uuid":"89080883","full_name":"ropenscilabs/tif","owner":"ropenscilabs","description":"Text Interchange Formats","archived":false,"fork":false,"pushed_at":"2023-11-26T21:57:13.000Z","size":45,"stargazers_count":35,"open_issues_count":5,"forks_count":4,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-12-23T00:23:46.223Z","etag":null,"topics":["corpus","natural-language-processing","r","r-package","rstats","term-frequency","text-processing","tokenizer"],"latest_commit_sha":null,"homepage":"https://docs.ropensci.org/tif","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ropenscilabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-04-22T15:45:55.000Z","updated_at":"2024-01-11T08:54:48.000Z","dependencies_parsed_at":"2023-08-20T09:30:54.755Z","dependency_job_id":null,"html_url":"https://github.com/ropenscilabs/tif","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropenscilabs%2Ftif","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropenscilabs%2Ftif/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropenscilabs%2Ftif/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropenscilabs%2Ftif/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ropenscilabs","download_url":"https://codeload.github.com/ropenscilabs/tif/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238585203,"owners_count":19496435,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["corpus","natural-language-processing","r","r-package","rstats","term-frequency","text-processing","tokenizer"],"created_at":"2025-02-13T02:42:11.310Z","updated_at":"2025-10-28T02:31:14.261Z","avatar_url":"https://github.com/ropenscilabs.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"## tif: Text Interchange Formats\n\n\u003c!-- badges: start --\u003e\n[![R-CMD-check](https://github.com/ropensci/tif/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ropensci/tif/actions/workflows/R-CMD-check.yaml)\n\u003c!-- badges: end --\u003e\n\nThis package describes and validates formats for storing\ncommon object arising in text analysis as native R objects.\nRepresentations of a text corpus, document term matrix, and\ntokenized text are included. The tokenized text format is\nextensible to include other annotations. There are two versions\nof the corpus and tokens objects; packages should accept\nboth and return or coerce to at least one of these.\n\n## Installation\n\nYou can install the development version using devtools:\n\n```{r}\ndevtools::install_github(\"ropensci/tif\")\n```\n\n## Usage\n\nThe package can be used to check that a particular object is in a valid \nformat. For example, here we see that the object `corpus` is a valid corpus\ndata frame:\n\n```{r}\nlibrary(tif)\ncorpus \u003c- data.frame(doc_id = c(\"doc1\", \"doc2\", \"doc3\"),\n                     text = c(\"Aujourd'hui, maman est morte.\",\n                      \"It was a pleasure to burn.\",\n                      \"All this happened, more or less.\"),\n                     stringsAsFactors = FALSE)\n\ntif_is_corpus_df(corpus)\n```\n```\nTRUE\n```\n\nThe package also has functions to convert between the list and data frame\nformats for corpus and token object. For example:\n\n```{r}\ntif_as_corpus_character(corpus)\n```\n```\n                              doc1                               doc2 \n   \"Aujourd'hui, maman est morte.\"       \"It was a pleasure to burn.\" \n                              doc3 \n\"All this happened, more or less.\" \n```\n\nNote that extra meta data columns will be lost in the conversion from a data\nframe to a named character vector.\n\n## Details\n\nThis package describes and validates formats for storing\ncommon object arising in text analysis as native R objects.\nRepresentations of a text corpus, document term matrix, and\ntokenized text are included. The tokenized text format is\nextensible to include other annotations. There are two versions\nof the corpus and tokens objects; packages should accept and return\nat least one of these.\n\n**corpus** (data frame) - A valid corpus data frame object\nis a data frame with at least two columns. The first column\nis called doc_id and is a character vector with UTF-8 encoding. Document\nids must be unique. The second column is called text and\nmust also be a character vector in UTF-8 encoding. Each\nindividual document is represented by a single row in\nthe data frame. Addition document-level metadata columns\nand corpus level attributes are allowed but not required.\n\n**corpus** (character vector) - A valid character vector corpus\nobject is an character vector with UTF-8 encoding. If it has\nnames, this should be a unique character also in UTF-8\nencoding. No other attributes should be present.\n\n**dtm** - A valid document term matrix is a sparse matrix with\nthe row representing documents and columns representing\nterms. The row names is a character vector giving the\ndocument ids with no duplicated entries. The column\nnames is a character vector giving the terms of the\nmatrix with no duplicated entries. The sparse matrix\nshould inherit from the Matrix class dgCMatrix.\n\n**tokens** (data frame) - A valid data frame tokens\nobject is a data frame with at least two columns. There must be\na column called doc_id that is a character vector\nwith UTF-8 encoding. Document ids must be unique.\nThere must also be a column called token that must also be a\ncharacter vector in UTF-8 encoding.\nEach individual token is represented by a single row in\nthe data frame. Addition token-level metadata columns\nare allowed but not required. \n\n**tokens** (list) - A valid corpus tokens object is (possibly\nnamed) list of character vectors. The character vectors, as\nwell as names, should be in UTF-8 encoding. No other\nattributes should be present in either the list or any of its\nelements.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fropenscilabs%2Ftif","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fropenscilabs%2Ftif","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fropenscilabs%2Ftif/lists"}