https://github.com/suhlig/httpspell

Spellchecker that recursively fetches HTML pages, converts them to plain text, and spellchecks them.
https://github.com/suhlig/httpspell

academic http spellcheck spider

Last synced: 2 months ago
JSON representation

Spellchecker that recursively fetches HTML pages, converts them to plain text, and spellchecks them.

Host: GitHub
URL: https://github.com/suhlig/httpspell
Owner: suhlig
Created: 2018-05-30T12:49:35.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2024-11-01T09:55:10.000Z (over 1 year ago)
Last Synced: 2025-01-28T23:38:13.953Z (about 1 year ago)
Topics: academic, http, spellcheck, spider
Language: Ruby
Size: 211 KB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 5
Metadata Files:
- Readme: README.markdown

Awesome Lists containing this project

README

# `httpspell`

This is a spellchecker that recursively fetches HTML pages, converts them to plain text (using [pandoc](http://pandoc.org/)), and spellchecks them with [hunspell](https://hunspell.github.io/). Unknown words will be printed to `stdout`, which makes the tool a good candidate for CI pipelines where you might want to take action when a spelling error is found on a web page.

Words that are not in the dictionary for the given language (inferred from the `lang` attribute of the HTML document's root element) can be added to a personal dictionary, which will mark the word as correctly spelled.

# Usage

* The following command will retrieve the HTML document at https://example.com, spellcheck it, and not print anything because there are no errors:

```bash
$ httpspell https://example.com
```

The exit code is `0`.

* The following command will spellcheck the README of this project as rendered by GitHub, and print a list of unknown words. Note that we set the language to `en_US` because GitHub declares 'en' as document language, but the installed dictionaries usually refer the a specific language variant like `en_US`:

```bash
$ httpspell https://github.com/suhlig/httpspell/blob/master/README.markdown --language en_US
suhlig
Permalink
httpspell
sloc
pandoc
hunspell
...
```

The exit code is `1`.

# What is *not* checked

* When spidering a site, `httpspell` will skip all responses with a `content-type` header other than `text/html` (unless pointing it to file, in which case it accepts anything).
* Before converting, `httpspell` removes the following nodes from the HTML DOM as they are not a good target for spellchecking:
- `code`
- `pre`
- Elements with `spellcheck='false'` (this is how HTML5 allows tagging elements as a being target for spellchecking or not)

# Misc

If you produce content with kramdown (e.g. using Jekyll), an [Inline Attribute List](https://kramdown.gettalong.org/syntax.html#inline-attribute-lists) can be used to set `spellcheck='false'` for an element by adding this line *after* the element (e.g. heading):

```
{: spellcheck="false"}
```

# Dictionaries

Hunspell uses the system dictionary paths; on the Mac this is `~/Library/Spelling/`. Get some dictionaries as explained in the [hunspell](https://github.com/hunspell/hunspell) project:

```command
$ wget -O ~/Library/Spelling/en_US.aff https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.aff
$ wget -O ~/Library/Spelling/en_US.dic https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.dic
```

German:

```command
$ wget -O ~/Library/Spelling/de_DE.dic https://cgit.freedesktop.org/libreoffice/dictionaries/plain/de/de_DE_frami.dic
$ wget -O ~/Library/Spelling/de_DE.aff https://cgit.freedesktop.org/libreoffice/dictionaries/plain/de/de_DE_frami.aff
```

Italian (for integration tests):

```command
$ wget -O ~/Library/Spelling/it_IT.dic https://cgit.freedesktop.org/libreoffice/dictionaries/plain/it_IT/it_IT.dic
$ wget -O ~/Library/Spelling/it_IT.aff https://cgit.freedesktop.org/libreoffice/dictionaries/plain/it_IT/it_IT.aff
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/suhlig/httpspell

Awesome Lists containing this project

README