{"id":20317377,"url":"https://github.com/gdamdam/sumo","last_synced_at":"2025-10-08T05:33:18.459Z","repository":{"id":22809605,"uuid":"26156314","full_name":"gdamdam/sumo","owner":"gdamdam","description":"Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more","archived":false,"fork":false,"pushed_at":"2019-01-15T15:46:39.000Z","size":35,"stargazers_count":20,"open_issues_count":0,"forks_count":5,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-25T13:46:12.397Z","etag":null,"topics":["automatic-summarization","content-extraction","entity-recognition","nlp","nltk","semantic-analysis","sentence-extraction"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gdamdam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-11-04T06:38:37.000Z","updated_at":"2024-05-30T03:07:02.000Z","dependencies_parsed_at":"2022-07-17T09:46:11.505Z","dependency_job_id":null,"html_url":"https://github.com/gdamdam/sumo","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gdamdam%2Fsumo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gdamdam%2Fsumo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gdamdam%2Fsumo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gdamdam%2Fsumo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gdamdam","download_url":"https://codeload.github.com/gdamdam/sumo/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248455277,"owners_count":21106590,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automatic-summarization","content-extraction","entity-recognition","nlp","nltk","semantic-analysis","sentence-extraction"],"created_at":"2024-11-14T18:31:35.659Z","updated_at":"2025-10-08T05:33:13.420Z","avatar_url":"https://github.com/gdamdam.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Sumo 0.1 \nSumo it's a tool for the semantic analysis of web articles.\nIt extracts the content from an article web page and analyzing it an returning:\nfrequency words, entity recognition, automatic summarization.\nIt returns also the releted articles previously analized, using the term vector distance.\n\n## Main requirements\n\nMongoDB \u003e=2.6.5  Python \u003e=2.7.5\n\nfor debian and ubuntu:\n\u003cpre\u003e\napt-get install mongodb python python-dev python-virtualenv libxml2-dev libxslt-dev zlib1g-dev libjpeg-dev gcc\n\u003c/pre\u003e\n\n\n## Using Docker\n\nWe provide a Dockerfile to run a dockerized Sumo server.\n\n\u003cpre\u003e\ndocker build -t sumoserver .\ndocker run -p 5000:5000 sumoserver\n\u003c/pre\u003e\n\n\n## Basic Installation\n\n\u003cpre\u003e\ngit clone https://github.com/gdamdam/sumo.git\ncd sumo\nvirtualenv ./venv\nsource venv/bin/activate\npip install -r requirements.txt\npython requirements_nltk.py\n\u003c/pre\u003e\n\n## Start\n\nJust lunch the server\n\n\u003cpre\u003e\nsudo service mongodb start\npython ./sumo_server.py -s IP\n\u003c/pre\u003e\n\nfor help and all the options you can use\n\u003cpre\u003e\npython ./sumo_server.py --help\n\u003c/pre\u003e\n\nThe server provides a REST resource for analyze and store the analysis data of a web document.\n\n## API Usage\n\nThe following comand returns the \u003cb\u003elist of all the documents stored\u003c/b\u003e\n\u003cpre\u003e\ncurl http://host:5000/sumo\n\u003c/pre\u003e\n\nThe stored documents are labeled with a ID_DOC, where the \u003ci\u003e/\u003c/i\u003e caracter in the URL\nare substitued with \u003ci\u003e\\_\\_\u003c/i\u003e (double underscore).\n\ne.g.: \n\u003cpre\u003e\n TARGET_URL: www.google.com/test\n     ID_DOC: www.google.com__test\n\u003c/pre\u003e\n\n\u003cb\u003eTo analyze and store a document\u003c/b\u003e and store it on the db:\n\u003cpre\u003e\ncurl http://host:5000/sumo -X POST -d 'url=TARGET_URL'\n\u003c/pre\u003e\nHTTP Status returned:\n\u003cpre\u003e\n\t201:\tCreated\t\t- the document at TARGET_URL sucessfully analyzed and stored\n\t409:\tConflict\t- if the TARGET_URL already exists in the storade\n\t415:\tUnsupported\t- the TARGET_URL is malformed\n\u003c/pre\u003e\n\n\u003cb\u003eTo retrieve a stored document\u003c/b\u003e analysis:\n\u003cpre\u003e\ncurl http://host:500/sumo/ID_DOC\n\u003c/pre\u003e\nHTTP Status returned:\n\u003cpre\u003e\n\t200:\tOK\t\t\t\n\t404:\tNot Found \t- the document does not exist\n\u003c/pre\u003e\n\n\u003cb\u003eTo delete a stored document\u003c/b\u003e:\n\u003cpre\u003e\ncurl http://host:500/sumo/ID_DOC -X DELETE\n\u003c/pre\u003e\nHTTP Status returned:\n\u003cpre\u003e\n\t204:\tNo Content\t- document deleted \n\t404:\tNot Found \t- the document does not exist\n\u003c/pre\u003e\n\nIt is possible \u003cb\u003eretrieve the cluster of similar documents\u003c/b\u003e using the cluster resource\n\u003cpre\u003e\ncurl http://host:500/sumo/cluster/ID_DOC\n\u003c/pre\u003e\nHTTP Status returned:\n\u003cpre\u003e\n\t200:\tOK\n\t404:\tNot Found \t- the document does not exist\n\u003c/pre\u003e\n\n\n## Web Interface\n\nThe running server provides also a very minimal javascript web interface to interact with the API.\nThe interface is reacheable at:\n\u003cpre\u003e\nhttp://host:5000\n\u003c/pre\u003e\n\nTips:\n- single click on an ID_DOC in the index to fill the form and click analyze to retrieve the analysis.\n- double click on an ID_DOC in the index to delete it.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgdamdam%2Fsumo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgdamdam%2Fsumo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgdamdam%2Fsumo/lists"}