https://github.com/dasantonym/node-cesspoll

:poop: Turd Miner Node Module
https://github.com/dasantonym/node-cesspoll

crawler news poopetry potty-humour

Last synced: 8 months ago
JSON representation

:poop: Turd Miner Node Module

Host: GitHub
URL: https://github.com/dasantonym/node-cesspoll
Owner: dasantonym
Created: 2014-10-24T17:29:59.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2014-10-29T16:45:20.000Z (over 11 years ago)
Last Synced: 2024-03-15T16:11:15.248Z (over 2 years ago)
Topics: crawler, news, poopetry, potty-humour
Language: JavaScript
Homepage:
Size: 230 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Cesspoll #

## About ##

Node Module to retrieve and save reader comments from major german news sources.

Uses jsdom to extract posts from the website's homepages and then stores the news article and the comments to the article.

The resulting mongodb entries can then be further indexed and analysed.

## Sources ##

Current news sources are:

* [Spiegel Online](http://www.spiegel.de/)
* [taz](http://www.taz.de/)

## Install ##

You need nodejs, redis and mongodb.

Install with

```
npm install git://github.com/dasantonym/node-cesspoll.git
```

To run it go to ``example/``, copy ``config.default.js`` to ``config.js`` and run

```
node app.js
```

## Analysis ##

As an optional basic form of analysis the comments are broken up into basic fragments, whitespace is removed and then the example from the [Hyphen](http://sourceforge.net/projects/hunspell/files/Hyphen/) library together with a [hyphenation dictionary](https://www.openoffice.org/lingucomponent/download_dictionary.html) is used to extract syllables (see config file). The analysis results are then stored in the mongodb and are constantly analysed while updating the index.

## Notes from the author ##

This is a quick and dirty crawler for a specific art installation so it is not meant to be a fully optimized super-fancy news crawler or something.

It is not very performant, currently redownloads already crawled pages again and does only pull new articles from the front page as well as comments for already pulled articles.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dasantonym/node-cesspoll

Awesome Lists containing this project

README