Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gdamdam/sumo

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more
https://github.com/gdamdam/sumo

automatic-summarization content-extraction entity-recognition nlp nltk semantic-analysis sentence-extraction

Last synced: 3 months ago
JSON representation

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

Awesome Lists containing this project

README

        

# Sumo 0.1
Sumo it's a tool for the semantic analysis of web articles.
It extracts the content from an article web page and analyzing it an returning:
frequency words, entity recognition, automatic summarization.
It returns also the releted articles previously analized, using the term vector distance.

## Main requirements

MongoDB >=2.6.5 Python >=2.7.5

for debian and ubuntu:


apt-get install mongodb python python-dev python-virtualenv libxml2-dev libxslt-dev zlib1g-dev libjpeg-dev gcc

## Using Docker

We provide a Dockerfile to run a dockerized Sumo server.


docker build -t sumoserver .
docker run -p 5000:5000 sumoserver

## Basic Installation


git clone https://github.com/gdamdam/sumo.git
cd sumo
virtualenv ./venv
source venv/bin/activate
pip install -r requirements.txt
python requirements_nltk.py

## Start

Just lunch the server


sudo service mongodb start
python ./sumo_server.py -s IP

for help and all the options you can use


python ./sumo_server.py --help

The server provides a REST resource for analyze and store the analysis data of a web document.

## API Usage

The following comand returns the list of all the documents stored


curl http://host:5000/sumo

The stored documents are labeled with a ID_DOC, where the / caracter in the URL
are substitued with \_\_ (double underscore).

e.g.:


TARGET_URL: www.google.com/test
ID_DOC: www.google.com__test

To analyze and store a document and store it on the db:


curl http://host:5000/sumo -X POST -d 'url=TARGET_URL'

HTTP Status returned:

201: Created - the document at TARGET_URL sucessfully analyzed and stored
409: Conflict - if the TARGET_URL already exists in the storade
415: Unsupported - the TARGET_URL is malformed

To retrieve a stored document analysis:


curl http://host:500/sumo/ID_DOC

HTTP Status returned:

200: OK
404: Not Found - the document does not exist

To delete a stored document:


curl http://host:500/sumo/ID_DOC -X DELETE

HTTP Status returned:

204: No Content - document deleted
404: Not Found - the document does not exist

It is possible retrieve the cluster of similar documents using the cluster resource


curl http://host:500/sumo/cluster/ID_DOC

HTTP Status returned:

200: OK
404: Not Found - the document does not exist

## Web Interface

The running server provides also a very minimal javascript web interface to interact with the API.
The interface is reacheable at:


http://host:5000

Tips:
- single click on an ID_DOC in the index to fill the form and click analyze to retrieve the analysis.
- double click on an ID_DOC in the index to delete it.