Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/clemsos/dumb-crawler
A crawler to extract keywords from all links related to a page
https://github.com/clemsos/dumb-crawler
Last synced: about 1 month ago
JSON representation
A crawler to extract keywords from all links related to a page
- Host: GitHub
- URL: https://github.com/clemsos/dumb-crawler
- Owner: clemsos
- Created: 2012-02-15T12:49:57.000Z (almost 13 years ago)
- Default Branch: master
- Last Pushed: 2012-02-19T17:20:21.000Z (almost 13 years ago)
- Last Synced: 2024-10-14T01:23:31.794Z (2 months ago)
- Language: Python
- Homepage:
- Size: 97.7 KB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.markdown
Awesome Lists containing this project
README
Dumb crawler
----
A simple crawler to retrieve keywords from all links contained in a webpageHOW TO USE IT
----
To extract keywords from a file :
python termtopia.py file.html
HOW IT WORKS
---
1. retrieve all urls from a page (using beautiful soup)
2. retrieve content from each page (using boilerpipe java-python)
3. extract keywords from content (using topia.termextract)
4. match all keywords (using difflib)
5. using (simple json)Installation
---
This script reauires JAVA HOME configured and jdk7 to run Boilerpipe
''''bash
on Ubuntu :$ sudo apt-get install python-jpype
$ find /usr/lib/jvm/ | grep jni.h
$ export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.xx
''''
More here http://blog.notmyidea.org/using-jpype-to-bridge-python-and-java.htmlThanks
---
Python-Boilerpipe : https://github.com/misja/python-boilerpipe
Onnno for topia use : http://goo.gl/0BMHr