https://github.com/clemsos/dumb-crawler

A crawler to extract keywords from all links related to a page
https://github.com/clemsos/dumb-crawler

Last synced: 2 months ago
JSON representation

A crawler to extract keywords from all links related to a page

Host: GitHub
URL: https://github.com/clemsos/dumb-crawler
Owner: clemsos
Created: 2012-02-15T12:49:57.000Z (over 13 years ago)
Default Branch: master
Last Pushed: 2012-02-19T17:20:21.000Z (over 13 years ago)
Last Synced: 2025-01-22T13:25:30.620Z (4 months ago)
Language: Python
Homepage:
Size: 97.7 KB
Stars: 3
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.markdown

Awesome Lists containing this project

README

Dumb crawler
----
A simple crawler to retrieve keywords from all links contained in a webpage

HOW TO USE IT
----
To extract keywords from a file :
python termtopia.py file.html

HOW IT WORKS
---
1. retrieve all urls from a page (using beautiful soup)
2. retrieve content from each page (using boilerpipe java-python)
3. extract keywords from content (using topia.termextract)
4. match all keywords (using difflib)
5. using (simple json)

Installation
---
This script reauires JAVA HOME configured and jdk7 to run Boilerpipe
''''bash
on Ubuntu :

$ sudo apt-get install python-jpype
$ find /usr/lib/jvm/ | grep jni.h
$ export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.xx
''''
More here http://blog.notmyidea.org/using-jpype-to-bridge-python-and-java.html

Thanks
---
Python-Boilerpipe : https://github.com/misja/python-boilerpipe
Onnno for topia use : http://goo.gl/0BMHr

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/clemsos/dumb-crawler

Awesome Lists containing this project

README