{"id":13401292,"url":"https://github.com/hrbrmstr/jericho","last_synced_at":"2025-10-29T01:30:50.881Z","repository":{"id":141238418,"uuid":"102382246","full_name":"hrbrmstr/jericho","owner":"hrbrmstr","description":":notebook_with_decorative_cover: Extract plain or structured text from HTML content in R","archived":false,"fork":false,"pushed_at":"2019-03-01T17:46:03.000Z","size":245,"stargazers_count":14,"open_issues_count":0,"forks_count":0,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-07-21T07:32:10.616Z","etag":null,"topics":["java","r","r-cyber","rstats"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hrbrmstr.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2017-09-04T16:42:22.000Z","updated_at":"2022-03-31T18:05:44.000Z","dependencies_parsed_at":"2024-01-18T11:04:13.522Z","dependency_job_id":"40487e1e-8c40-437b-b585-64aa01325ea1","html_url":"https://github.com/hrbrmstr/jericho","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fjericho","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fjericho/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fjericho/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fjericho/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hrbrmstr","download_url":"https://codeload.github.com/hrbrmstr/jericho/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":219858513,"owners_count":16556043,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["java","r","r-cyber","rstats"],"created_at":"2024-07-30T19:01:01.038Z","updated_at":"2025-10-29T01:30:45.605Z","avatar_url":"https://github.com/hrbrmstr.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"---\noutput: rmarkdown::github_document\n---\n\n[![Build Status](https://travis-ci.org/hrbrmstr/jericho.svg?branch=master)](https://travis-ci.org/hrbrmstr/jericho)\n[![Build status](https://ci.appveyor.com/api/projects/status/nosmgh0b2wthjjf3/branch/master?svg=true)](https://ci.appveyor.com/project/hrbrmstr/jericho/branch/master)\n[![codecov](https://codecov.io/gh/hrbrmstr/jericho/branch/master/graph/badge.svg)](https://codecov.io/gh/hrbrmstr/jericho)\n\n`jericho` : Break Down the Walls of 'HTML' Tags into Usable Text\n\nStructured 'HTML' content can be useful when you need to parse data tables or other tagged data from within a document. However, it is also useful to obtain \"just the text\" from a document free from the walls of tags that surround it. Tools are provied that wrap methods in the 'Jericho HTML Parser' Java library by Martin Jericho \u003chttp://jericho.htmlparser.net/docs/index.html\u003e. Martin's library is used in many at-scale projects, icluding the 'The Internet Archive'.\n\nAs a result of using a Java library, this package requires `rJava`.\n\nThe following functions are implemented:\n\n- `html_to_text`:\tConvert HTML to Text\n- `render_html_to_text`:\tRender HTML to Text\n\n### Installation\n\nIf you do use `devtools`, then it *should* pickup the `Remotes:` section in `DESCRIPTION`. Until the package is on CRAN, you might want to also invoke the installation of `jerichojars` as shown below:\n\n```{r eval=FALSE}\ninstall.packages(c(\"jerichojars\", \"jericho\"), repos = \"https://cinc.rud.is/\")\n```\n\n```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}\noptions(width = 120)\n```\n\n### Usage\n\nLet's use [this NASA blog post](https://blogs.nasa.gov/spacestation/2017/09/02/touchdown-expedition-52-back-on-earth/) as an example.\n\n```{r message=FALSE, warning=FALSE, error=FALSE}\nlibrary(jericho)\n\n# current verison\npackageVersion(\"jericho\")\n\nURL \u003c- \"https://blogs.nasa.gov/spacestation/2017/09/02/touchdown-expedition-52-back-on-earth/\"\n  \ndoc \u003c- paste0(readr::read_lines(URL), collapse = \"\\n\")\n```\n\nThis is pure text extraction:\n\n```{r message=FALSE, warning=FALSE, error=FALSE, eval=FALSE}\nhtml_to_text(doc)\n```\n\nThis provides a human readable version of the segment content that is modelled on the way Mozilla Thunderbird and other email clients provide an automatic conversion of HTML content to text in their alternative MIME encoding of emails.\n\n```{r message=FALSE, warning=FALSE, error=FALSE, eval=FALSE}\nrender_html_to_text(doc)\n```\n\nYou should run each to see and compare the output (GitHub markdown documents aren't the best viewing medium).\n\n### `jericho` Metrics\n\n```{r cloc, echo=FALSE}\ncloc::cloc_pkg_md()\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Fjericho","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhrbrmstr%2Fjericho","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Fjericho/lists"}