Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/thesoenke/news-crawler
Crawler that collects and extracts content of daily published news articles
https://github.com/thesoenke/news-crawler
crawler news
Last synced: 14 days ago
JSON representation
Crawler that collects and extracts content of daily published news articles
- Host: GitHub
- URL: https://github.com/thesoenke/news-crawler
- Owner: theSoenke
- License: mit
- Created: 2017-04-06T18:36:21.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2023-02-18T16:46:01.000Z (over 1 year ago)
- Last Synced: 2024-06-20T15:57:26.572Z (5 months ago)
- Topics: crawler, news
- Language: Go
- Homepage:
- Size: 2.64 MB
- Stars: 11
- Watchers: 5
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# news-crawler
This project is crawling, extracting, indexing and processing the content of daily published news articles. The extracted content is indexed in ElasticSearch for further processing. This project also provides tooling to extract and preprocess the content for the [NoD](https://github.com/uhh-lt/NoDCore) project.
## Docker Compose
1. `git clone github.com/thesoenke/news-crawler`
2. `cd news-crawler`
3. `docker-compose up --build`This will start the crawler, Elastisearch and Kibana. In case ElasticSearch is crashing you might need to increase the virtual memory limit for Elasticsearch
sysctl -w vm.max_map_count=262144
When everything worked 3 containers should be running.
It is also possible to generate the NoDCore input in docker by running `make nod-docker`. To generate only the data for the day before run `make nod-yesterday`. In case you want to run `make nod-yesterday` in a cronjob you have to provide the full path to docker-compose in the Makefile e.g. `/usr/local/bin/docker-compose`. The output will be available in `out/nod/german`. This will only work when the docker-compose setup is already running and the scraper has run at least once.
## Docker Compose NoD
Starts the crawler and [NoDWeb](https://github.com/uhh-lt/NoDWeb) and automatically runs [NoDCore](https://github.com/uhh-lt/NoDCore) when new data is available.
1. `git clone github.com/thesoenke/news-crawler`
2. `cd news-crawler`
3. `docker-compose -f docker-compose.nod.yml up --build`
4. Open [localhost:9000](localhost:9000)## Local setup
### Install
1. Make sure [go](https://golang.org) is installed
2. `go get github.com/thesoenke/news-crawler`
3. `$GOPATH/bin` should be in your `PATH` or run it directly with `$GOPATH/bin/news-crawler`### Run
#### Feedreader
Download articles from a list of feedsnews-crawler feeds data/feeds_de.txt --lang german
#### Scraper
The scraper downloads articles found by the feedreader and indexes them in Elasticsearch.news-crawler scrape out/feeds/german/26-6-2017.json --lang german
An ElasticSearch instance needs to be running. Credentials can be set with the environment variables `ELASTIC_URL`, `ELASTIC_USER` and `ELASTIC_PASSWORD`. Defaults are url: `http://localhost:9200`, user: `elastic`, password: `changeme`
#### Create NoDCore input
newscrawler nod --lang german
Generated files can be found in `out/nod/` by default. This command will only work when the scraper has run at least once to insert data into ElasticSearch.
It is also possible to only output the content from yesterday with the following command:
news-crawler nod --from yesterday --lang german
## Logs
- Successful runs of the feedreader, scraper and nod exporter are logged with in `out/events.log`
- The feedreader writes a log of feeds that could not be fetched to `out/feeds//failures.log`
- Articles that could not be fetched are logged in the ElasticSearch index `failures-`## Archive
All fetched webpages are by default stored on disk in `out/content///.html`## Using Kibana
When using the docker-compose setup open 4. Open [localhost:5601](localhost:5601) and add the index mapping. Index `news-*` will contain all languages. Language specific index mapping can be created by using `news-`. The index `failures-` logs all failures. \
Warning: The scraper has to be run at least once to create the indices## Adding a new language
Before a new language can be added a list of feeds is required. After that the 3 environment variables need to be change in `docker-compose.yml`
- `CRAWLER_FEEDS_FILE` Path to a file with a list of feeds
- `CRAWLER_LANGUAGE` Language of the parser
- `CRAWLER_TIMEZONE` Timezone that should be used when storing the article publish dates