{"id":15297488,"url":"https://github.com/mit-lcp/bloatectomy","last_synced_at":"2025-04-13T22:40:36.378Z","repository":{"id":57415654,"uuid":"273322509","full_name":"MIT-LCP/bloatectomy","owner":"MIT-LCP","description":"A python package for removing duplicate text in clinical notes or other documents","archived":false,"fork":false,"pushed_at":"2020-08-06T18:33:03.000Z","size":7844,"stargazers_count":36,"open_issues_count":1,"forks_count":9,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-11T23:16:28.204Z","etag":null,"topics":["fda","mimic","mimic-iii","nlp-resources","plagarism","plagiarism-evaluation","python-3","python3","text-analysis","text-mining","text-processing"],"latest_commit_sha":null,"homepage":"","language":"TeX","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MIT-LCP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-06-18T19:25:06.000Z","updated_at":"2025-01-13T23:33:18.000Z","dependencies_parsed_at":"2022-09-01T16:22:31.747Z","dependency_job_id":null,"html_url":"https://github.com/MIT-LCP/bloatectomy","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MIT-LCP%2Fbloatectomy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MIT-LCP%2Fbloatectomy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MIT-LCP%2Fbloatectomy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MIT-LCP%2Fbloatectomy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MIT-LCP","download_url":"https://codeload.github.com/MIT-LCP/bloatectomy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248794103,"owners_count":21162610,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fda","mimic","mimic-iii","nlp-resources","plagarism","plagiarism-evaluation","python-3","python3","text-analysis","text-mining","text-processing"],"created_at":"2024-09-30T19:17:48.833Z","updated_at":"2025-04-13T22:40:36.335Z","avatar_url":"https://github.com/MIT-LCP.png","language":"TeX","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Bloatectomy [![DOI](https://zenodo.org/badge/273322509.svg)](https://zenodo.org/badge/latestdoi/273322509)\n\nBloatectomy: a method for the identification and removal of duplicate text in the bloated notes of electronic health records and other documents. Takes in a list of notes or a single file (.docx, .txt, .rtf, etc) or single string to be marked for duplicates which can then be highlighted, bolded, or removed. Marked output and tokens are output.\n\n# Paper\nFor details about how the package works and our reasons for developing it, read the paper here https://github.com/MIT-LCP/bloatectomy/blob/master/bloatectomy_paper.pdf\n\nTo acknowledge use of the software, please cite the DOI provided via Zenodo:\n\nSummer K. Rankin, Roselie Bright, \u0026 Katherine Dowdy. (2020, June 26). Bloatectomy (Version v0.0.12). Zenodo. http://doi.org/10.5281/zenodo.3909030\n\nor\n```\n@software{summer_k_rankin_2020_3909030,\n  author       = {Summer K. Rankin and Roselie A. Bright and Kate Dowdy},\n  title        = {Bloatectomy},\n  month        = jun,\n  year         = 2020,\n  publisher    = {Zenodo},\n  version      = {v0.0.12},\n  doi          = {10.5281/zenodo.3909030},\n  url          = {https://doi.org/10.5281/zenodo.3909030}\n}\n```\n# Requirements\n- Python\u003e=3.7.x (in order for the regular expressions to work correctly)\n- re\n- sys\n- pandas (optional, only necessary if using MIMIC III data)\n- docx (optional, only necessary if input or output is a word/docx file)\n\n# Installation\nusing anaconda or miniconda\n```\nconda install -c summerkrankin bloatectomy\n```\n\nusing pip via PyPI  \nmake sure to install it to python3 if your default is python2\n```\npython3 -m pip install bloatectomy\n```\nusing pip via github\n```\npython3 -m pip install git+git://github.com/MIT-LCP/bloatectomy\n```\nmanual install by cloning the repository\n```\ngit clone git://github.com/MIT-LCP/bloatectomy\ncd bloatectomy\npython3 setup.py install\n```\n\n# Input\nThe input for Bloatectomy can be a string, or a text file (txt, rtf), or a word (docx) document. See examples below for implementation of each. \n\n# Output\n\n- **bloatectomized_file.html** = The default output is the input text with highlighted duplicates in html format. The file can be renamed using `filename=`.\n- **[filename]_original_token_numbers.txt** = The numbered tokens of the original input text can be exported as a text file by setting `output_original_tokens=True`. \n- **[filename]_token_numbers.txt** = The numbered tokens of the marked output can be exported as a text file by setting `output_numbered_tokens=True` (the token numbers of the two numbered token files will only differ if the `style='remov'` parameter is set).  \n\n# Examples\n\nTo use with example text or load ipynb examples, download the repository or just the bloatectomy_examples folder. \nThis is the simplest use with default parameters. We only specify the type of marking and the type of output.\n```\nfrom bloatectomy import bloatectomy\n\ntext = ''Assessment and Plan\n61 yo male Hep C cirrhosis\nAbd pain:\n-other labs: PT / PTT / INR:16.6//    1.5, CK / CKMB /\nICU Care\n-other labs: PT / PTT / INR:16.6//    1.5, CK / CKMB /\nAssessment and Plan\n'''\n\nbloatectomy(text)\n```\n\nThis example highlights duplicates and creates an html, displays the result in the console, specifies the location and name of the output (`filename=`).\n\n```\nbloatectomy('text', \n            style='highlight',\n            display=True,\n            filename='./output/sample_txt_output',\n            output='html')\n```\n\nThis example removes duplicates and creates an html, displays the result in the console, specifies the location and name of the output (`filename=`), and exports the numbered tokens (useful for dissecting how the text is tokenized). \n\n```\nbloatectomy('text', \n            style='remov',\n            display=True,\n            filename='./output/sample_txt_remov_output',\n            output='html',\n            output_numbered_tokens=True,\n            output_original_tokens=True)\n```\n\nThis example takes in the single text file (i.e., sample_text.txt) to be marked for duplicates. The marked output, original numbered tokens and marked numbered tokens are exported. Note that the tokens in the two numbered token files will have the same token numbers unless they style parameter is set to \"remov\" ```style='remov'```.\n\n```\nbloatectomy('./input/sample_text.txt',\n             filename='./output/sampletxt_output',\n             style='highlight',\n             output='html',\n             output_numbered_tokens=True,\n             output_original_tokens=True )\n```\n\nThis example takes in and exports a word document and marks duplicates in bold. \n```\nbloatectomy('./input/sample_text.docx',\n            style='bold',\n            output='docx',\n            filename='./output/sample_docx_output')\n```\n\nThis example takes in an .rtf file and exports a word document with duplicates removed. \n```\nbloatectomy('./input/sample_text.rtf',\n            style='remov',\n            output='docx',\n            filename='./output/sample_docx_output')\n```\n\n# Documentation\n\n```\nclass bloatectomy(input_text,\n                  path = '',\n                  filename='bloatectomized_file',\n                  display=False,\n                  style='highlight',\n                  output='html',\n                  output_numbered_tokens=False,\n                  output_original_tokens=False,\n                  regex1=r\"(.+?\\.[\\s\\n]+)\",\n                  regex2=r\"(?=\\n\\s*[A-Z1-9#-]+.*)\",\n                  postgres_engine=None,\n                  postgres_table=None)\n```\n## Parameters  \n**input_text**: file, str, list  \nAn input document (.txt, .rtf, .docx), a string of text, or list of hadm_ids for postgres mimiciii database or the raw text.\n\n**style**: str, optional, default=`highlight`  \nHow to denote a duplicate. The following are allowed: `highlight`, `bold`, `remov`.\n\n**output**: str, optional, default=`html`  \nType of marked output file as an html or a word document (docx). The following are allowed: `html`, `docx`.\n\n**filename**: str, optional, default=`bloatectomized_file`\nA string to name output file of the marked document.\n\n**path**: str, optional, default=`' '`  \nThe directory for output files.\n\n**output_numbered_tokens**: bool, optional, default=`False`  \nIf set to `True`, a .txt file with each token enumerated and marked for duplication is output as `[filename]_token_numbers.txt`. This is useful when diagnosing your own regular expression for tokenization or testing the `remov` option for **style**.\n\n**output_original_tokens**: bool, optional, default=`False`  \nIf set to  `True`, a .txt file with each original (non-marked) token enumerated but not marked for duplication, is output as `[filename]_original_token_numbers.txt`. This is useful when diagnosing your own regular expression for tokenization or testing the `remov` option for **style**.\n\n**display**: bool, optional, default=`False`  \nIf set to `True`, the bloatectomized text will display in the console on completion.\n\n**regex1**: str, optional, default=`r\"(.+?\\.[\\s\\n]+)\"`  \nThe regular expression for the first tokenization. Split on a period (.) followed by one or more white space characters (space, tab, line breaks) or a line feed character. This can be replaced with any valid regular expression to change the way tokens are created.\n\n**regex2**: str, optional, default=`r\"(?=\\n\\s*[A-Z1-9#-]+.*)\"`  \nThe regular expression for the second tokenization. Split on any line feed character followed by an uppercase letter, a number, or a dash. This can be replaced with any valid regular expression to change how sub-tokens are created.\n\n**postgres_engine**: str, optional\nThe postgres connection. Only relevant for use with the MIMIC III dataset. When data is pulled from postgres the hadm_id of the file will be appended to the `filename` if set or the default `bloatectomized_file`. See the jupyter notebook [mimic_bloatectomy_example](./bloatectomy_examples/mimic_bloatectomy_example.ipynb) for the example code.\n\n**postgres_table**: str, optional\nThe name of the postgres table containing the concatenated notes. Only relevant for use with the MIMIC III dataset. When data is pulled from postgres the hadm_id of the file will be appended to the `filename` if set or the default `bloatectomized_file`. See the jupyter notebook [mimic_bloatectomy_example](./bloatectomy_examples/mimic_bloatectomy_example.ipynb) for the example code.\n\n# Contributing\n\nWe encourage you to share any additions or changes to our package. To contribute, please:\n\nFork the repository using the following link: https://github.com/MIT-LCP/bloatectomy/fork. For a background on GitHub forks, see: https://help.github.com/articles/fork-a-repo/\n\nCommit your changes to the forked repository.\n\nSubmit a pull request to the MIMIC code repository, using the method described at: https://help.github.com/articles/using-pull-requests/\n\n## License\n\nBy committing your code to the Bloatectomy Repository you agree to release the code under the [GNU General Public License v3.0](LICENSE.txt) in this repository.\n\n## Issues or Bugs\n\nPlease feel free to create an issue for any questions, bugs, or suggestions you may have about our package or even the documentation (i.e. additional examples). We appreciate any feedback. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmit-lcp%2Fbloatectomy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmit-lcp%2Fbloatectomy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmit-lcp%2Fbloatectomy/lists"}