Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gdamdam/sumo

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more
https://github.com/gdamdam/sumo

automatic-summarization content-extraction entity-recognition nlp nltk semantic-analysis sentence-extraction

Last synced: 3 months ago
JSON representation

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

Host: GitHub
URL: https://github.com/gdamdam/sumo
Owner: gdamdam
License: mit
Created: 2014-11-04T06:38:37.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2019-01-15T15:46:39.000Z (about 6 years ago)
Last Synced: 2023-02-26T18:05:46.288Z (almost 2 years ago)
Topics: automatic-summarization, content-extraction, entity-recognition, nlp, nltk, semantic-analysis, sentence-extraction
Language: Python
Homepage:
Size: 34.2 KB
Stars: 18
Watchers: 2
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: COPYING

Awesome Lists containing this project

README

# Sumo 0.1
Sumo it's a tool for the semantic analysis of web articles.
It extracts the content from an article web page and analyzing it an returning:
frequency words, entity recognition, automatic summarization.
It returns also the releted articles previously analized, using the term vector distance.

## Main requirements

MongoDB >=2.6.5 Python >=2.7.5

for debian and ubuntu:


apt-get install mongodb python python-dev python-virtualenv libxml2-dev libxslt-dev zlib1g-dev libjpeg-dev gcc

## Using Docker

We provide a Dockerfile to run a dockerized Sumo server.


docker build -t sumoserver .

docker run -p 5000:5000 sumoserver

## Basic Installation


git clone https://github.com/gdamdam/sumo.git

cd sumo

virtualenv ./venv

source venv/bin/activate

pip install -r requirements.txt

python requirements_nltk.py

## Start

Just lunch the server


sudo service mongodb start

python ./sumo_server.py -s IP

for help and all the options you can use


python ./sumo_server.py --help

The server provides a REST resource for analyze and store the analysis data of a web document.

## API Usage

The following comand returns the list of all the documents stored


curl http://host:5000/sumo

The stored documents are labeled with a ID_DOC, where the / caracter in the URL
are substitued with \_\_ (double underscore).

e.g.:


 TARGET_URL: www.google.com/test

     ID_DOC: www.google.com__test

To analyze and store a document and store it on the db:


curl http://host:5000/sumo -X POST -d 'url=TARGET_URL'

HTTP Status returned:


	201:	Created		- the document at TARGET_URL sucessfully analyzed and stored

	409:	Conflict	- if the TARGET_URL already exists in the storade

	415:	Unsupported	- the TARGET_URL is malformed

To retrieve a stored document analysis:


curl http://host:500/sumo/ID_DOC

HTTP Status returned:


	200:	OK			

	404:	Not Found 	- the document does not exist

To delete a stored document:


curl http://host:500/sumo/ID_DOC -X DELETE

HTTP Status returned:


	204:	No Content	- document deleted 

	404:	Not Found 	- the document does not exist

It is possible retrieve the cluster of similar documents using the cluster resource


curl http://host:500/sumo/cluster/ID_DOC

HTTP Status returned:


	200:	OK

	404:	Not Found 	- the document does not exist

## Web Interface

The running server provides also a very minimal javascript web interface to interact with the API.
The interface is reacheable at:


http://host:5000

Tips:
- single click on an ID_DOC in the index to fill the form and click analyze to retrieve the analysis.
- double click on an ID_DOC in the index to delete it.