https://github.com/derlin/swisstext
SwissText: a suite of automated tools for the creation of Swiss German corpora
https://github.com/derlin/swisstext
Last synced: about 1 year ago
JSON representation
SwissText: a suite of automated tools for the creation of Swiss German corpora
- Host: GitHub
- URL: https://github.com/derlin/swisstext
- Owner: derlin
- License: other
- Created: 2018-08-07T12:40:37.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2020-07-07T15:09:09.000Z (almost 6 years ago)
- Last Synced: 2025-03-09T11:01:55.371Z (about 1 year ago)
- Language: Python
- Homepage: https://derlin.github.io/swisstext
- Size: 23.6 MB
- Stars: 5
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SwissText
A suite of automated tools for the creation of Swiss German corpora by semi-automated web crawling.
✻ Read the docs https://derlin.github.io/swisstext ✻
✻ Overview (Google Slides) http://bit.ly/swisstext-slides ✻
-------------------------------------------------
**Paper**: (LREC, 2020) \
[Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German](https://arxiv.org/abs/1912.00159)
**Citation**:
```bibtex
@article{linder2019automatic,
title={Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German},
author={Linder, Lucy and Jungo, Michael and Hennebert, Jean and Musat, Claudiu and Fischer, Andreas},
journal={arXiv preprint arXiv:1912.00159},
year={2019}
}
```
→ See the **branch [lrec](https://github.com/derlin/swisstext/tree/lrec)** and
the **repository [swisstext-lrec](https://github.com/derlin/swisstext-lrec)** for the exact version used in the paper.
-------------------------------------------------
## Installation and usage
Install the tools:
```bash
# clone
git clone git@github.com:derlin/swisstext.git
# install (note: pass the option -d for editable mode)
./install.sh
```
This will make the following commands available:
* `st_search`: make queries to a search engine and retrieve potentially interesting URLs;
* `st_scrape`: visit URLs in order to find new sentences;
* `st_frontend`: launch a webapp to manage the database, validate and label sentences, propose new seeds and much more.
Find more in the [documentation](https://derlin.github.io/swisstext) or the [overview slides](http://bit.ly/swisstext-slides).
-------------------------------------------------
SwissText Crawler (c) by Lucy Linder
The SwissText Crawler is licensed under a Creative Commons Attribution-NonCommercial 4.0 Unported License.
You should have received a copy of the license along with this work.
If not, see .