{"id":28158409,"url":"https://github.com/hardchor/ai-text-cleaner","last_synced_at":"2025-05-15T09:19:57.685Z","repository":{"id":291545878,"uuid":"977952502","full_name":"hardchor/ai-text-cleaner","owner":"hardchor","description":null,"archived":false,"fork":false,"pushed_at":"2025-05-05T09:12:23.000Z","size":34,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-05T09:50:55.491Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hardchor.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-05T08:36:39.000Z","updated_at":"2025-05-05T09:12:26.000Z","dependencies_parsed_at":"2025-05-05T09:52:09.710Z","dependency_job_id":null,"html_url":"https://github.com/hardchor/ai-text-cleaner","commit_stats":null,"previous_names":["hardchor/ai-text-cleaner"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hardchor%2Fai-text-cleaner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hardchor%2Fai-text-cleaner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hardchor%2Fai-text-cleaner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hardchor%2Fai-text-cleaner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hardchor","download_url":"https://codeload.github.com/hardchor/ai-text-cleaner/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254310506,"owners_count":22049471,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-15T09:19:13.769Z","updated_at":"2025-05-15T09:19:57.670Z","avatar_url":"https://github.com/hardchor.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AI Text Cleaner\n\nA simple UI, CLI and Python script to normalize text, often useful as a preprocessing step for AI models or text analysis tasks. It cleans up common typographic inconsistencies.\n\n![Web UI](web_ui.png)\n\n## Features\n\nThe script performs the following normalization steps:\n\n1. **Replace Dashes:** Converts em-dash (—, U+2014), en-dash (–, U+2013), non-breaking hyphen (-, U+2011), and mathematical minus (−, U+2212) to standard hyphens (-).\n2. **Normalize Quotes:** Converts curly double quotes (“ ”, U+201C/U+201D) and angle double quotes (« » U+00AB/U+00BB) to straight double quotes (\"). Converts curly single quotes (‘ ’, U+2018/U+2019) and angle single quotes (‹ › U+2039/U+203A) to straight single quotes (').\n3. **Strip Special Spaces:** Removes non-breaking spaces (U+00A0), narrow non-breaking spaces (U+202F), and zero-width spaces (U+200B).\n4. **Replace Ellipsis:** Converts single-character ellipsis (…, U+2026) to three periods (...).\n5. **Replace Ligatures:** Replaces fi (ﬁ, U+FB01) → 'fi', fl (ﬂ, U+FB02) → 'fl', ff (ﬀ, U+FB00) → 'ff', ffi (ﬃ, U+FB03) → 'ffi', ffl (ﬄ, U+FB04) → 'ffl'.\n6. **Replace Soft Hyphens:** Removes soft hyphens (U+00AD).\n7. **Replace Bullets:** Converts bullets (•, U+2022) to hyphens (-).\n8. **(Optional)** Collapses multiple whitespace characters into a single space (currently commented out in the code).\n\n## Requirements\n\n- Python 3\n\nNo external libraries are required.\n\n## Usage\n\n### Web UI\n\nUse Docker Compose to build and run services:\n\n```bash\ndocker-compose up --build\n```\n\nThen, access the app at `http://localhost:8501`.\n\n### Command Line Interface\n\nThe script can read text from a file specified as a command-line argument or from standard input. The normalized text is always written to standard output.\n\n**1. From a file:**\n\n```bash\npython main.py input.txt \u003e output.txt\n```\n\nReplace `input.txt` with the path to your text file. The normalized output will be saved to `output.txt`.\n\n**2. From standard input (e.g., piping):**\n\n```bash\ncat input.txt | python main.py \u003e output.txt\n```\n\nOr type directly into the terminal (press Ctrl+D to signal end-of-input):\n\n```bash\npython main.py\n\u003cPaste or type your text here\u003e\n^D\n\u003cNormalized text will be printed here\u003e\n```\n\n## Building an Executable\n\nYou can generate a standalone executable using PyInstaller via the `uvx` wrapper. This places the built binary in your local bin directory:\n\n```bash\nuvx pyinstaller normalize.py \\\n  --onefile \\\n  --name normalize \\\n  --distpath ~/.local/bin\n```\n\nAfter building, ensure `~/.local/bin` is in your `PATH` to run `normalize` directly from the command line.\n\n## License\n\nThis project does not currently have a license. Consider adding one if distributing.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhardchor%2Fai-text-cleaner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhardchor%2Fai-text-cleaner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhardchor%2Fai-text-cleaner/lists"}