Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/catseye/t-rext
A command-line tool that attempts to rectify punctuation and spacing in (generated) text files
https://github.com/catseye/t-rext
filtering sanitization text-processing text-sanitization
Last synced: 3 months ago
JSON representation
A command-line tool that attempts to rectify punctuation and spacing in (generated) text files
- Host: GitHub
- URL: https://github.com/catseye/t-rext
- Owner: catseye
- License: unlicense
- Created: 2015-10-12T09:43:35.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2022-03-27T12:54:19.000Z (almost 3 years ago)
- Last Synced: 2023-04-01T10:21:10.375Z (almost 2 years ago)
- Topics: filtering, sanitization, text-processing, text-sanitization
- Language: Python
- Homepage:
- Size: 21.5 KB
- Stars: 6
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
T-Rext
======T-Rext is a command-line filter that attempts to clean up spacing,
punctuation, and capitalization in a text file. Its purpose is so that,
when you are writing a text generator, such as a Markov processor, you
need not worry too much about its output format; just toss its output
through T-Rext when you're done to make it more presentable.The current version of T-Rext is 0.3, which runs under either Python 2.7
or Python 3.x. Docker images based on appropriate versions of cPython
for each version are [available on Docker Hub][].Usage
-----### Usage from the Command Line
bin/t-rext raw_output.txt > cleaned_output.txt
This will take lines that look like this:
" Well , " said the king , , " no . "
and reformat them to look like this:
“Well,” said the king, “no.”
To use T-Rext from any working directory, add the `bin` directory in this
repository to your `PATH`. For example, you might add this line to your
`.bashrc`:export PATH=/path/to/this/repo/bin:$PATH
An easy way to accomplish the above is to install [shelf][], then
dock T-Rext usingshelf_dockgh catseye/T-Rext
### Usage from Python
T-Rext is built on an over-engineered library of pipeline processors, which
you can use directly (note, its interface is not stable and liable to change.)
To use the T-Rext Python modules in other Python programs, make sure the
`src` directory of this repository is on your `PYTHONPATH`. For example,
you might add this line to your `.bashrc`:export PYTHONPATH=/path/to/this/repo/src:$PYTHONPATH
Then you can add imports like this to the top of your script:
from t_rext.processors import TrailingWhitespaceProcessor
Tests
-----This is a test suite, written in [Falderal][] format, for the `t-rext`
utility. It also serves as documentation for said utility.-> Tests for functionality "Clean up punctuation and spaces"
Spaces before commas and periods are elided.
| Well , that is good .
= Well, that is good.Multiple commas are collapsed into a single comma.
| Well , , that is good .
= Well, that is good.Multiple periods are not collapsed into a single period.
| Well . . . that is good.
= Well... that is good.Quotes are oriented.
| "Yes," he said.
= “Yes,” he said.Single spaces after opening quotes and before closing quotes are elided.
| " Yes , " he said.
= “Yes,” he said.But not the other way 'round.
| Muttering "Yes," he turned around.
= Muttering “Yes,” he turned around.Multiple spaces after opening quotes and before closing quotes are elided.
| " Yes , " he said.
= “Yes,” he said.But not the other way 'round.
| Muttering "Yes," he turned around.
= Muttering “Yes,” he turned around.Quotes do not match across paragraphs.
| Turbid "Waters" that "leak.
|
| You "don't" have a clue.
= Turbid “Waters” that “leak.
=
= You “don't” have a clue.Single spaces before apostrophes are elided in some situations.
| It wasn 't Arthur 's car.
= It wasn't Arthur's car.Punctuation at the beginning of a line is elided in some cases.
| , where he said so.
= Where he said so.Capitalization is applied at the beginning of a line, and the
beginning of a sentence.| , where. he said so.
= Where. He said so.| Really? that was... so
= Really? That was... soTwo full stops becomes an ellipsis. Full stop then comma becomes
just a comma.| It was.. the nice., thing.
= It was... the nice, thing.[Falderal]: https://catseye.tc/node/Falderal
[shelf]: https://catseye.tc/node/shelf
[available on Docker Hub]: https://hub.docker.com/r/catseye/t-rext