https://github.com/jkseppan/shyster
Add soft hyphens to HTML documents
https://github.com/jkseppan/shyster
hyphenation
Last synced: about 1 year ago
JSON representation
Add soft hyphens to HTML documents
- Host: GitHub
- URL: https://github.com/jkseppan/shyster
- Owner: jkseppan
- License: gpl-3.0
- Created: 2022-09-24T14:00:39.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2022-09-30T17:10:15.000Z (over 3 years ago)
- Last Synced: 2025-04-02T00:30:11.309Z (about 1 year ago)
- Topics: hyphenation
- Language: Jupyter Notebook
- Homepage: https://jkseppan.github.io/shyster/
- Size: 361 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
shyster
================
The problem this package is trying to solve is that while I can set
`hyphens: auto;` in CSS, many browsers do a poor job of hyphenating
Finnish. Even if they have Finnish hyphenation patterns, they often fail
to recognise compound words, which should be hyphenated at compound
boundaries (saippua-kauppias, not saip-pua-kaup-pias). One solution is
to set `hyphens: manual;` and add soft hyphens at acceptable hyphenation
spots.
## Install
``` sh
pip install shyster
```
## How to use
One top-level function does it all:
``` python
import shyster
shyster.hyphenate_html_file('input.html', 'output.html', 'patterns/hyphen.tex')
```
If more control is needed:
``` python
hyph_fi = hyphenator('patterns/hyph-fi.tex', righthyphenmin=2)
[hyph_fi(word) for word in
'Jukolan talo, eteläisessä Hämeessä, seisoo erään mäen pohjaisella rinteellä, liki Toukolan kylää'\
.replace(',','').split()]
```
['Ju-ko-lan',
'ta-lo',
'ete-läi-ses-sä',
'Hä-mees-sä',
'sei-soo',
'erään',
'mäen',
'poh-jai-sel-la',
'rin-teel-lä',
'li-ki',
'Tou-ko-lan',
'ky-lää']
``` python
html = """
Seitsemän veljestä
var veljekset = 7;
Jukolan talo, eteläisessä Hämeessä, seisoo erään mäen pohjaisella
rinteellä, liki Toukolan kylää. Sen läheisin ympäristö on kivinen
tanner, mutta alempana alkaa pellot, joissa, ennenkuin talo oli häviöön
mennyt, aaltoili teräinen vilja.
"""
soup = BeautifulSoup(html, 'lxml')
hyphenate_soup(soup, hyph_fi)
print(str(soup))
```
Seit-se-män vel-jes-tä
var veljekset = 7;
Ju-ko-lan ta-lo, ete-läi-ses-sä Hä-mees-sä, sei-soo erään mäen poh-jai-sel-la
rin-teel-lä, li-ki Tou-ko-lan ky-lää. Sen lä-hei-sin ym-pä-ris-tö on ki-vi-nen
tan-ner, mut-ta alem-pa-na al-kaa pel-lot, jois-sa, en-nen-kuin ta-lo oli hä-vi-öön
men-nyt, aal-toi-li te-räi-nen vil-ja.
``` python
pat, ex = read_patterns(open('patterns/hyphen.tex').readlines())
trie = convert_patterns(pat)
ex = convert_exceptions(ex)
del ex['present'] # remove an exception
ex['shyster'] = ('shy', 'ster') # add or alter an exception
ex['lawyer'] = ('l', 'a', 'w', 'y', 'e', 'r') # exceptions even override {left,right}hyphenmin
hyph_en = hyphenator(None, hyphen='•')
hyph_en.trie = trie
hyph_en.exceptions = ex
import textwrap
textwrap.wrap(' '.join(hyph_en(match.group(0))
for match in re.finditer(r'[\w]+', '''
shyster: noun; 1. someone, possibly a lawyer, who behaves in an unscrupulous way;
2. the present Python library
''')))
```
['shy•ster noun 1 some•one pos•si•bly a l•a•w•y•e•r who be•haves in an',
'un•scrupu•lous way 2 the pre•sent Python li•brary']
## Copying
This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
You should have received a copy of the GNU General Public License along
with this program. If not, see .
The above does not apply to the files in `patterns/`, which are
distributed with this program as example input files. The Finnish
patterns are covered by the terms “Patterns may be freely distributed”
and the English ones by “Unlimited copying and redistribution of this
file are permitted as long as this file is not modified.”