{"id":21740543,"url":"https://github.com/writecrow/text_processing","last_synced_at":"2025-10-06T05:53:45.271Z","repository":{"id":42569068,"uuid":"108482073","full_name":"writecrow/text_processing","owner":"writecrow","description":"A repository for text_processing tools used by crow","archived":false,"fork":false,"pushed_at":"2025-03-21T23:15:29.000Z","size":450,"stargazers_count":12,"open_issues_count":4,"forks_count":2,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-26T20:51:27.749Z","etag":null,"topics":["natural-language-processing","python-script"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/writecrow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-10-27T00:59:42.000Z","updated_at":"2025-03-21T23:15:33.000Z","dependencies_parsed_at":"2023-01-26T04:45:38.247Z","dependency_job_id":"946859ba-736f-4cb5-bc52-b2587f9c1c95","html_url":"https://github.com/writecrow/text_processing","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/writecrow%2Ftext_processing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/writecrow%2Ftext_processing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/writecrow%2Ftext_processing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/writecrow%2Ftext_processing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/writecrow","download_url":"https://codeload.github.com/writecrow/text_processing/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248660884,"owners_count":21141367,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["natural-language-processing","python-script"],"created_at":"2024-11-26T06:13:56.213Z","updated_at":"2025-10-06T05:53:40.236Z","avatar_url":"https://github.com/writecrow.png","language":"Python","readme":"# text_processing\nA repository for text_processing tools used by crow\n\nIncluded below is a description of tools used:\n\n#### conversion tools\n\n* convert_to_utf8.py is a command-line script for semi-intelligently converting files from known encoding \nformats into the UTF-8 charset. It will convert entire directories or individual files, and can either\noverwrite the files or put them in a parallel `output` directory\n\n* cp1251_to_utf8.py is a python script which converts files encoded in Windows 1251 into UTF-8\n\n* Mass_directory_unzipper.py is a python script that loops through all folders in a current directory and looks inside them for zipfiles.  If zipfiles exist, this script will unzip them and place the contents into a regular folder of the same name as the zip file.\n\n* Aextractzip2txt.py is a python script which takes a directory of zip files; assuming the name ends in 'zips' or zips/' and unzips the folder as well as converts all the docx files into txt files.  It then put the new txt into a new directory with 'zips' stripped off the dir name.  This script requires the dabfunctions.py script to run.\n\n* Fcheckdraftandfinal.py is a python script which checks filenames for draft and final in the name and then switches them.  Draft becomes the final, and final turns into a draft.  This script requires the dabfunctions.py script to run.\n\n* dabfunctions.py is a helper script that contains many useful functions (some are commented out).  Fcheckdraftandfinal.py and Aextractzip2txt.py both rely on this script.\n\n#### de-identification\n\n#### normalization\n\n* textnormalization.py is a text cleaning script, it replaces punctuation such as smart quotes, ellipsis, dashes with a regular hyphen, and other non-english characters\n\n* FS_general_formatter.py is a text cleaning script, it is used to process data from FLLOC and SPLLOC files.  It replaces lines that begin @ and encases the lines in \u003c\u003e brackets.  It also encases the interviewer of the transcriptions in \u003c\u003e brackets. It should be ran with an argument specifying the kind of file and directory depth like such: **/**/*.cha or **/*.cex depending on your directory structure.  This script will read the files, and after modifying their text, write them to txt in a directory called \"recoded\" that mimics the initial file structure within the folder.  \n\n#### tagging\n\n#### text-retrieval\n\n* FLLOC_Scraper.py is a script for downloading all zip files off of the FLLOC corpora website.  It is a slow script, last tested to take up to 17 minutes to complete all the downloads.  If this script errors for some reason, mentioning a blocked port, simply delete the data, wait a bit, and restart the script.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwritecrow%2Ftext_processing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwritecrow%2Ftext_processing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwritecrow%2Ftext_processing/lists"}