Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/spraakbanken/svt-crawler
Programme for crawling SVT's API for news articles and converting the data to XML.
https://github.com/spraakbanken/svt-crawler
corpus crawler
Last synced: about 1 month ago
JSON representation
Programme for crawling SVT's API for news articles and converting the data to XML.
- Host: GitHub
- URL: https://github.com/spraakbanken/svt-crawler
- Owner: spraakbanken
- License: mit
- Created: 2022-03-08T13:52:13.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-02-08T16:51:55.000Z (almost 2 years ago)
- Last Synced: 2023-02-27T06:32:55.344Z (almost 2 years ago)
- Topics: corpus, crawler
- Language: Python
- Homepage:
- Size: 37.1 KB
- Stars: 1
- Watchers: 2
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SVT crawler
Programme for crawling SVT's API for news articles and converting the data to XML.
## How to run
Setup virtual environment and install dependencies from `requirements.txt`.
With activated virtual environment run:
```
python crawler.py
```Follow the instructions given by the command line interface.
The crawling process will stop automatically when encountering too many articles that have been downloaded already.
**Note:** Due to caching issues in the SVT API it may happen that not all articles are downloaded on the first attempt.
The XML conversion step will not override any files created previously unless this is specified with the `--override`
option.