Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/scrapinghub/webstruct
NER toolkit for HTML data
https://github.com/scrapinghub/webstruct
crfsuite data-science ner
Last synced: 3 days ago
JSON representation
NER toolkit for HTML data
- Host: GitHub
- URL: https://github.com/scrapinghub/webstruct
- Owner: scrapinghub
- Created: 2013-07-22T10:05:49.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2024-05-03T19:37:19.000Z (6 months ago)
- Last Synced: 2024-11-06T23:12:09.074Z (6 days ago)
- Topics: crfsuite, data-science, ner
- Language: HTML
- Homepage:
- Size: 14.6 MB
- Stars: 256
- Watchers: 135
- Forks: 59
- Open Issues: 23
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.rst
Awesome Lists containing this project
README
Webstruct
=========.. image:: https://img.shields.io/pypi/v/webstruct.svg
:target: https://pypi.python.org/pypi/webstruct
:alt: PyPI Version.. image:: https://travis-ci.org/scrapinghub/webstruct.svg?branch=master
:target: https://travis-ci.org/scrapinghub/webstruct
:alt: Build Status.. image:: https://codecov.io/gh/scrapinghub/webstruct/branch/master/graph/badge.svg
:target: https://codecov.io/gh/scrapinghub/webstruct
:alt: Code Coverage.. image:: https://readthedocs.org/projects/webstruct/badge/?version=latest
:target: http://webstruct.readthedocs.io/en/latest/
:alt: DocumentationWebstruct is a library for creating statistical NER_ systems that work
on HTML data, i.e. a library for building tools that extract named
entities (addresses, organization names, open hours, etc) from webpages.Unlike most NER systems, webstruct works on HTML data, not only
on text data. This allows to define features that use HTML structure,
and also to embed annotation results back into HTML.Read the docs_ for more info.
License is MIT.
.. _docs: http://webstruct.readthedocs.io/en/latest/
.. _NER: http://en.wikipedia.org/wiki/Named-entity_recognitionContributing
------------* Source code: https://github.com/scrapinghub/webstruct
* Bug tracker: https://github.com/scrapinghub/webstruct/issuesTo run tests, make sure tox_ is installed, then run
``tox`` from the source root... _tox: https://tox.readthedocs.io/en/latest/