https://github.com/dasantonym/node-cesspoll
:poop: Turd Miner Node Module
https://github.com/dasantonym/node-cesspoll
crawler news poopetry potty-humour
Last synced: 8 months ago
JSON representation
:poop: Turd Miner Node Module
- Host: GitHub
- URL: https://github.com/dasantonym/node-cesspoll
- Owner: dasantonym
- Created: 2014-10-24T17:29:59.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2014-10-29T16:45:20.000Z (over 11 years ago)
- Last Synced: 2024-03-15T16:11:15.248Z (over 2 years ago)
- Topics: crawler, news, poopetry, potty-humour
- Language: JavaScript
- Homepage:
- Size: 230 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Cesspoll #
## About ##
Node Module to retrieve and save reader comments from major german news sources.
Uses jsdom to extract posts from the website's homepages and then stores the news article and the comments to the article.
The resulting mongodb entries can then be further indexed and analysed.
## Sources ##
Current news sources are:
* [Spiegel Online](http://www.spiegel.de/)
* [taz](http://www.taz.de/)
## Install ##
You need nodejs, redis and mongodb.
Install with
```
npm install git://github.com/dasantonym/node-cesspoll.git
```
To run it go to ``example/``, copy ``config.default.js`` to ``config.js`` and run
```
node app.js
```
## Analysis ##
As an optional basic form of analysis the comments are broken up into basic fragments, whitespace is removed and then the example from the [Hyphen](http://sourceforge.net/projects/hunspell/files/Hyphen/) library together with a [hyphenation dictionary](https://www.openoffice.org/lingucomponent/download_dictionary.html) is used to extract syllables (see config file). The analysis results are then stored in the mongodb and are constantly analysed while updating the index.
## Notes from the author ##
This is a quick and dirty crawler for a specific art installation so it is not meant to be a fully optimized super-fancy news crawler or something.
It is not very performant, currently redownloads already crawled pages again and does only pull new articles from the front page as well as comments for already pulled articles.