{"id":21721055,"url":"https://github.com/ingmarboeschen/jatsdecoder","last_synced_at":"2025-04-12T21:33:48.398Z","repository":{"id":44579188,"uuid":"305655897","full_name":"ingmarboeschen/JATSdecoder","owner":"ingmarboeschen","description":"A text extraction and manipulation toolset for NISO-JATS coded XML files ","archived":false,"fork":false,"pushed_at":"2025-04-11T07:33:42.000Z","size":3087,"stargazers_count":19,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-11T08:42:22.622Z","etag":null,"topics":["cermine","niso-jats","pubmedcentral","r","text-extraction","text-mining","xml-files"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ingmarboeschen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-10-20T09:25:32.000Z","updated_at":"2025-04-11T07:33:45.000Z","dependencies_parsed_at":"2023-02-04T07:30:25.835Z","dependency_job_id":"d8a8caee-fc5a-4fc1-8a9e-063a8f98442e","html_url":"https://github.com/ingmarboeschen/JATSdecoder","commit_stats":{"total_commits":310,"total_committers":1,"mean_commits":310.0,"dds":0.0,"last_synced_commit":"4eb710c9e6e55b232de2453c13d280e1eafe615b"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ingmarboeschen%2FJATSdecoder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ingmarboeschen%2FJATSdecoder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ingmarboeschen%2FJATSdecoder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ingmarboeschen%2FJATSdecoder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ingmarboeschen","download_url":"https://codeload.github.com/ingmarboeschen/JATSdecoder/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248636810,"owners_count":21137527,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cermine","niso-jats","pubmedcentral","r","text-extraction","text-mining","xml-files"],"created_at":"2024-11-26T02:13:30.256Z","updated_at":"2025-04-12T21:33:48.384Z","avatar_url":"https://github.com/ingmarboeschen.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# JATSdecoder\nA metadata and text extraction and text manipulation tool set for the statistical programming language [R](www.r-project.org). \n\n**JATSdecoder** facilitates text mining projects on scientific articles by enabling an individual selection of metadata and text parts. \nIts function `JATSdecoder()` extracts metadata, sectioned text and reference list from [NISO-JATS](https://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html) coded XML files. \nThe function `study.character()` uses the `JATSdecoder()` result to perform fine-tuned text extraction tasks to identify key study characteristics like statistical methods used, alpha-error, statistical results reported in text and others. \n\nNote:  \n- PDF article collections can be converted to NISO-JATS coded XML files with the open source software [CERMINE](https://github.com/CeON/CERMINE).\n- To extract statistical test results reported in simple/unpublished PDF documents with JATSdecoder::get.stats(), the R package [pdftools](https://cran.r-project.org/web/packages/pdftools/) and its function pdf_text() may help to extract textual content (be aware that tabled content may cause corrupt text).  \n\nNote too:  \n- A minimal web app to extract statistical results from textual resources with get.stats() is hosted at:  \n[https://get-stats.app](https://get-stats.app)  \n- An interactive web application to analyze study characteristics of articles stored in the PubMed Central database and perform an individual article selection by study characteristcs is hosted at:  \n[https://scianalyzer.com/](https://scianalyzer.com/)\n\n**JATSdecoder** supplies some convenient functions to work with textual input in general. \nIts function `text2sentences()` is especially designed to break floating text with scientific content (references, results) into sentences. \n`text2num()` unifies representations of written numbers and special annotations (percent, fraction, e+10) into digit numbers. \nYou can extract adjustable n words around a pattern match in a sentence with `ngram()`. \n`letter.convert()` unifies hexadecimal to Unicode characters and, if [CERMINE](https://github.com/CeON/CERMINE) generated CERMXML files are processed, special error correction and special letter uniformization is performed, which is extremely relevant for `get.stats()`'s ability to extract and recompute statistical results in text. \n\nThe contained functions are listed below. For a detailed description, see the documentation on [CRAN](https://cran.r-project.org/web/packages/JATSdecoder/index.html).\n\n- **JATSdecoder::JATSdecoder()** uses functions that can be applied stand alone on NISO-JATS coded XML files or text input:\n  - get.title()      # extracts title                \n  - get.author()     # extracts author/s as vector   \n  - get.aff()        # extracts involved affiliation/s as vector\n  - get.journal()    # extracts journal\n  - get.vol()        # extracts journal volume as vector\n  - get.doi()        # extracts Digital Object Identifier\n  - get.history()    # extracts publishing history as vector with available date stamps\n  - get.country()    # extracts country/countries of origin as vector with unique countries\n  - get.type()       # extracts document type\n  - get.subject()    # extracts subject/s as vector\n  - get.keywords()   # extracts keyword/s as vector\n  - get.abstract()   # extracts abstract\n  - get.text()       # extracts sections and text as list\n  - get.references() # extracts reference list as vector\n\n\n- **JATSdecoder::study.character()** applies several functions on specific elements of the `JATSdecoder()` result. These functions can be used stand alone on any plain textual input:\n  - get.n.studies()   # extracts number of studies from sections or abstract\n  - get.alpha.error()  # extracts alpha error from text \n  - get.method()  # extracts statistical methods from method and result section with `ngram()`\n  - get.stats()  # extracts statistical results reported in text (abstract and full text, method and result section, result section only) and compare extracted recalculated p-values if possible \n  - get.software()  # extracts software name/s mentioned in method and result section with dictionary search\n  - get.R.package()  # extracts mentioned R package/s in method and result section with dictionary search on all available R packages created with `available.packages()`\n  - get.power()  # extracts power (1-beta-error) if mentioned in text\n  - get.assumption()  # extracts mentioned assumptions from method and result section with dictionary search\n  - get.multiple.comparison()  # extracts correction method for multiple testing from method and result section with dictionary search\n  - get.sig.adjectives()  # extracts common inadequate adjectives used before *significant* and *not significant* \n\n- **JATSdecoder helper functions** are helpful for many text mining projects and straight forward to use on any textual input:\n  - text2sentences() # breaks floating text into sentences\n  - text2num() # converts spelled out numbers, fractions, potencies, percentages and numbers denoted with e+num to decimals\n  - ngram() # creates \u0026plusmn;n-gram bag of words around a pattern match in text \n  - strsplit2() # splits text at pattern match with option \"before\" or \"after\" and without removing the pattern match \n  - grep2() # extension of grep(). Allows connecting multiple search patterns with logical AND operator\n  - letter.convert() # unifies many and converts most hexadecimal and HTML characters to Unicode and performs CERMINE specific error correction\n  - which.term() # returns hit vector for a set of patterns to search for in text (can be reduced to hits only)\n\n### Built With\n* [R Core 3.6](https://www.r-project.org)\n* [RKWard](https://rkward.kde.org/)\n* [devtools](https://github.com/r-lib/devtools) package\n\n\n### How to cite JATSdecoder\n```\nJATSdecoder: A Metadata and Text Extraction and Manipulation Tool Set. Ingmar Böschen (2023). R package version 1.2.0\n```\n### Resources\n**Articles:**\n\n- Böschen, I. (2021). Software review: The JATSdecoder package—extract metadata, abstract and sectioned text from NISO-JATS coded XML documents; Insights to PubMed central’s open access database. Scientometrics. https://doi.org/10.1007/s11192-021-04162-z. [link to [repo](https://github.com/ingmarboeschen/JATSdecoderEvaluation/tree/main/Evaluation_PubMedCentral_data)]\n\n- Böschen, I. (2021). Evaluation of JATSdecoder as an automated text extraction tool for statistical results in scientific reports. Scientific Reports 11, 19525. https://doi.org/10.1038/s41598-021-98782-3. [link to [repo](https://github.com/ingmarboeschen/JATSdecoderEvaluation/tree/main/Evaluation_get.stats_data)]\n\n- Böschen, I. (2023). Evaluation of the extraction of methodological study characteristics with JATSdecoder. Scientific Reports 13, 139. https://doi.org/10.1038/s41598-022-27085-y. [link to [repo](https://github.com/ingmarboeschen/JATSdecoderEvaluation/tree/main/Evaluation_study.character_data)]\n\n- Böschen, I. (2023). Changes in methodological study characteristics in psychology between 2010-2021. PLOS ONE 18(5). https://doi.org/10.1371/journal.pone.0283353. [link to [repo](https://github.com/ingmarboeschen/JATSdecoderEvaluation/tree/main/Study_ChangesInMethodologyInPsychology)]\n\n- Böschen, I. (2024). statcheck is flawed by design and no valid spell checker for statistical results. [https://arxiv.org/abs/2408.07948](https://arxiv.org/abs/2408.07948). [link to [repo](https://github.com/ingmarboeschen/JATSdecoderEvaluation/tree/main/Check_statcheck)]\n\n**Evaluation data and code:** \n\n[https://github.com/ingmarboeschen/JATSdecoderEvaluation/](https://github.com/ingmarboeschen/JATSdecoderEvaluation/)\n\n**JATSdecoder on CRAN:**\n\n[https://CRAN.R-project.org/package=JATSdecoder/](https://CRAN.R-project.org/package=JATSdecoder/)\n\n\n\u003c!-- GETTING STARTED --\u003e\n## Getting Started\n\nTo install **JATSdecoder** run the following steps:\n\n### Installation\nOption 1: Install **JATSdecoder** from [CRAN](https://cran.r-project.org/)\n``` r\ninstall.packages(\"JATSdecoder\")\n``` \nOption 2: Install **JATSdecoder** from [github](https://github.com/ingmarboeschen/JATSdecoder) with the [devtools](https://cran.r-project.org/web/packages/devtools/index.html) package\n``` r\nif(require(devtools)!=TRUE) install.packages(\"devtools\")\ndevtools::install_github(\"ingmarboeschen/JATSdecoder\")\n```\n\n\n\u003c!-- USAGE EXAMPLES --\u003e\n## Usage for a single XML file\nHere, a simple download of a NISO-JATS coded XML file is performed with `download.file()`:\n``` r\n# load package\nlibrary(JATSdecoder)\n# download example XML file via URL\nURL \u003c- \"https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0114876\u0026type=manuscript\"\ndownload.file(URL,\"file.xml\")\n# convert full article to list with metadata, sectioned text and reference list\nJATSdecoder(\"file.xml\")\n# extract specific content (here: abstract)\nJATSdecoder(\"file.xml\",output=\"abstract\")\nget.abstract(\"file.xml\")\n# extract study characteristics as list\nstudy.character(\"file.xml\")\n# extract specific study characteristic (here: statistical results)\nstudy.character(\"file.xml\",output=c(\"stats\",\"standardStats\")) \n# reduce to checkable results only\nstudy.character(\"file.xml\",output=\"standardStats\",stats.mode=\"checkable\")\n# compare with result of statcheck's function checkHTML() (Epskamp \u0026 Nuijten, 2018)\ninstall.packages(\"statcheck\")\nlibrary(statcheck)\ncheckHTML(\"file.xml\")\n\n# extract results with get.stats() from simple/unpublished manuscripts with pdftools::pdf_text()\nx\u003c-pdftools::pdf_text(\"path2file.pdf\")\nx\u003c-unlist(strsplit(x,\"\\\\n\"))\nJATSdecoder::get.stats(x)\n\n```\n\n## Usage for a collection of XML files\nThe [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/) database offers more than 5.4 million documents related to the biology and health sciences. The full repository is bulk downloadable as NISO-JATS coded NXML documents here: [PMC bulk download](https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/). \n\n1. Get XML file names from working directory\n``` r\nsetwd(\"/home/PMC\") # choose a specific folder with NISO-JATS coded articles in XML files on your device\nfiles\u003c-list.files(pattern=\"XML$|xml$\",recursive=TRUE)\n``` \n2. Apply the extraction of article content to all files (replace `lapply()` with `future.apply()` from [future.apply](https://github.com/HenrikBengtsson/future.apply) package for multicore processing)\n``` r\nlibrary(JATSdecoder)\n# extract full article content\nJATS\u003c-lapply(files,JATSdecoder)\n# extract single article content (here: abstract)\nabstract\u003c-lapply(files,JATSdecoder,output=\"abstract\")\n# or\nabstract\u003c-lapply(files,get.abstract)\n# extract study characteristics\ncharacter\u003c-lapply(files,study.character)\n```\n3. Working with a list of `JATSdecoder()` results\n``` r\n# first article content as list\nJATS[[1]] \ncharacter[[1]] \n# names of all extractable elements\nnames(JATS[[1]])\nnames(character[[1]])\n# extract one element only (here: title, abstract, history)\nlapply(JATS,\"[[\",\"title\") \nlapply(JATS,\"[[\",\"abstract\") \nlapply(JATS,\"[[\",\"history\") \n# extract year of publication from history tag\nunlist(lapply(JATS,\"[[\",\"history\") ,\"[\",\"pubyear\")\n``` \n4. Examples for converting, unifying and selecting text with helper functions\n``` r\n# extract full text from all documents\ntext\u003c-lapply(JATS,\"[[\",\"text\") \n# convert floating text to sentences\nsentences\u003c-lapply(text,text2sentences)\nsentences\n# only select sentences with pattern and unlist article wise\npattern\u003c-\"significant\"\nhits\u003c-lapply(sentences,function(x) grep(pattern,x,value=T))\nhits\u003c-lapply(hits,unlist)\nhits\n# number of sentences with pattern\nlapply(hits,length)\n# unify written numbers, fractions, percentages, potencies and numbers denoted with e+num to digit number\nlapply(text,text2num)\n``` \n\n## Exemplary analysis of some NISO-JATS tags\nNext, some example analysis are performed on the full PMC article collection. As each variable is very memory consuming, you might want to reduce your analysis to a smaller amount of articles. \n\n1. Extract JATS for article collection (replace `lapply()` with `future.apply()` from [future.apply](https://github.com/HenrikBengtsson/future.apply) package for multicore processing)\n```r\n# load package\nlibrary(JATSdecoder)\n# set working directory\nsetwd(\"/home/foldername\")\n# get XML file names\nfiles\u003c-list.files(patt=\"xml$|XML$\")\n# extract JATS\nJATS\u003c-lapply(files,JATSdecoder)\n```\n\n2. Analyze distribution of publishing year\n```r\n# extract and numerize year of publication from history tag\nyear\u003c-unlist(lapply(lapply(JATS,\"[[\",\"history\") ,\"[\",\"pubyear\"))\nyear\u003c-as.numeric(year)\n# frequency table\ntable(year)\n# display absolute number of published documents per year in barplot\n# with factorized year\nyear\u003c-factor(year,min(year,na.rm=TRUE):max(year,na.rm=TRUE))\nbarplot(table(year),las=1,xlab=\"year\",main=\"absolute number of published PMC documents per year\")\n# display cummulative number of published documents in barplot\nbarplot(cumsum(table(year)),las=1,xlab=\"year\",main=\"cummulative number of published PMC documents\")\n``` \n![](articlesperyear.png)\n\n3. Analyze distribution of document type\n```r\n# extract document type\ntype\u003c-unlist(lapply(JATS ,\"[\",\"type\"))\n# increase left margin of grafik output\npar(mar=c(5,12,4,2)+.1)\n# display in barplot\nbarplot(sort(table(type)),horiz=TRUE,las=1)             \n# set margins back to normal\npar(mar=c(5,4,4,2)+.1)\n``` \n![](type.png)\n\n4. Find most frequent authors\n\nNOTE: author names are not stored fully consistent. Some first and middle names are abbreviated, first names are followed by last names and vice versa!\n\n```r\n# extract author\nauthor\u003c-lapply(JATS ,\"[\",\"author\")\n# top 100 most present author names \ntab\u003c-sort(table(unlist(author)),dec=T)[1:100]\n# frequency table\ntab\n# display in barplot\n# increase left margin of grafik output\npar(mar=c(5,12,4,2)+.1)\nbarplot(tab,horiz=TRUE,las=1)             \n# set margins back to normal\npar(mar=c(5,4,4,2)+.1)\n# display in wordcloud with wordcloud package\nlibrary(wordcloud)\nwordcloud(names(tab),tab)\n``` \n![](author.png)\n\n## References\n\u003cdiv id=\"refs\" class=\"references\"\u003e\n\u003cdiv id=\"CERMINE\"\u003e\n- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM). 2014. Journal Publishing Tag Library - NISO JATS Draft Version 1.1d2. \n[https://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html].\n\u003c/div\u003e\n\n\u003cdiv id=\"JATS\"\u003e\n- Dominika Tkaczyk, Pawel Szostek, Mateusz Fedoryszak, Piotr Jan Dendek and Lukasz Bolikowski. \nCERMINE: automatic extraction of structured metadata from scientific literature. \nIn International Journal on Document Analysis and Recognition (IJDAR), 2015, \nvol. 18, no. 4, pp. 317-335, doi: 10.1007/s10032-015-0249-8. \n[https://github.com/CeON/CERMINE/].\n\u003c/div\u003e\n\n\n\u003cdiv id=\"JATSdecoder\"\u003e\n- Böschen, I. (2021) Software review: The JATSdecoder package—extract metadata, abstract and sectioned text from NISO-JATS coded XML documents; Insights to PubMed central’s open access database. Scientometrics. https://doi.org/10.1007/s11192-021-04162-z\n\u003c/div\u003e\n  \n\u003cdiv id=\"get.stats\"\u003e\n- Böschen, I. (2021). Evaluation of JATSdecoder as an automated text extraction tool for statistical results in scientific reports. Scientific Reports. 11, 19525. https://doi.org/10.1038/s41598-021-98782-3\n\u003c/div\u003e\n  \n\u003cdiv id=\"study.character\"\u003e\n- Böschen, I. (2023). Evaluation of the extraction of methodological study characteristics with JATSdecoder. Scientific Reports. 13, 139. https://doi.org/10.1038/s41598-022-27085-y\n\u003c/div\u003e\n\n\u003c/div\u003e\n\n\u003c!-- ACKNOWLEDGEMENTS --\u003e\n## Acknowledgements\nThis software is part of a dissertation project about the evolution of methodological characteristics in psychological research and financed by a grant awarded by the Department of [Research Methods and Statistics](https://www.psy.uni-hamburg.de/arbeitsbereiche/forschungsmethoden-und-statistik.html), [Institute of Psychology](https://www.psy.uni-hamburg.de/), [University Hamburg](https://www.uni-hamburg.de/), Germany.  \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fingmarboeschen%2Fjatsdecoder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fingmarboeschen%2Fjatsdecoder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fingmarboeschen%2Fjatsdecoder/lists"}