Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gdamdam/sumo
Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more
https://github.com/gdamdam/sumo
automatic-summarization content-extraction entity-recognition nlp nltk semantic-analysis sentence-extraction
Last synced: 3 months ago
JSON representation
Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more
- Host: GitHub
- URL: https://github.com/gdamdam/sumo
- Owner: gdamdam
- License: mit
- Created: 2014-11-04T06:38:37.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2019-01-15T15:46:39.000Z (about 6 years ago)
- Last Synced: 2023-02-26T18:05:46.288Z (almost 2 years ago)
- Topics: automatic-summarization, content-extraction, entity-recognition, nlp, nltk, semantic-analysis, sentence-extraction
- Language: Python
- Homepage:
- Size: 34.2 KB
- Stars: 18
- Watchers: 2
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: COPYING
Awesome Lists containing this project
README
# Sumo 0.1
Sumo it's a tool for the semantic analysis of web articles.
It extracts the content from an article web page and analyzing it an returning:
frequency words, entity recognition, automatic summarization.
It returns also the releted articles previously analized, using the term vector distance.## Main requirements
MongoDB >=2.6.5 Python >=2.7.5
for debian and ubuntu:
apt-get install mongodb python python-dev python-virtualenv libxml2-dev libxslt-dev zlib1g-dev libjpeg-dev gcc## Using Docker
We provide a Dockerfile to run a dockerized Sumo server.
docker build -t sumoserver .
docker run -p 5000:5000 sumoserver## Basic Installation
git clone https://github.com/gdamdam/sumo.git
cd sumo
virtualenv ./venv
source venv/bin/activate
pip install -r requirements.txt
python requirements_nltk.py## Start
Just lunch the server
sudo service mongodb start
python ./sumo_server.py -s IPfor help and all the options you can use
python ./sumo_server.py --helpThe server provides a REST resource for analyze and store the analysis data of a web document.
## API Usage
The following comand returns the list of all the documents stored
curl http://host:5000/sumoThe stored documents are labeled with a ID_DOC, where the / caracter in the URL
are substitued with \_\_ (double underscore).e.g.:
TARGET_URL: www.google.com/test
ID_DOC: www.google.com__testTo analyze and store a document and store it on the db:
curl http://host:5000/sumo -X POST -d 'url=TARGET_URL'
HTTP Status returned:
201: Created - the document at TARGET_URL sucessfully analyzed and stored
409: Conflict - if the TARGET_URL already exists in the storade
415: Unsupported - the TARGET_URL is malformedTo retrieve a stored document analysis:
curl http://host:500/sumo/ID_DOC
HTTP Status returned:
200: OK
404: Not Found - the document does not existTo delete a stored document:
curl http://host:500/sumo/ID_DOC -X DELETE
HTTP Status returned:
204: No Content - document deleted
404: Not Found - the document does not existIt is possible retrieve the cluster of similar documents using the cluster resource
curl http://host:500/sumo/cluster/ID_DOC
HTTP Status returned:
200: OK
404: Not Found - the document does not exist## Web Interface
The running server provides also a very minimal javascript web interface to interact with the API.
The interface is reacheable at:
http://host:5000Tips:
- single click on an ID_DOC in the index to fill the form and click analyze to retrieve the analysis.
- double click on an ID_DOC in the index to delete it.