https://github.com/scrapinghub/webstruct
NER toolkit for HTML data
https://github.com/scrapinghub/webstruct
crfsuite data-science ner
Last synced: 5 months ago
JSON representation
NER toolkit for HTML data
- Host: GitHub
- URL: https://github.com/scrapinghub/webstruct
- Owner: scrapinghub
- Created: 2013-07-22T10:05:49.000Z (almost 13 years ago)
- Default Branch: master
- Last Pushed: 2024-05-03T19:37:19.000Z (about 2 years ago)
- Last Synced: 2025-09-25T12:22:12.599Z (9 months ago)
- Topics: crfsuite, data-science, ner
- Language: HTML
- Homepage:
- Size: 14.6 MB
- Stars: 259
- Watchers: 129
- Forks: 59
- Open Issues: 23
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.rst
Awesome Lists containing this project
README
Webstruct
=========
.. image:: https://img.shields.io/pypi/v/webstruct.svg
:target: https://pypi.python.org/pypi/webstruct
:alt: PyPI Version
.. image:: https://travis-ci.org/scrapinghub/webstruct.svg?branch=master
:target: https://travis-ci.org/scrapinghub/webstruct
:alt: Build Status
.. image:: https://codecov.io/gh/scrapinghub/webstruct/branch/master/graph/badge.svg
:target: https://codecov.io/gh/scrapinghub/webstruct
:alt: Code Coverage
.. image:: https://readthedocs.org/projects/webstruct/badge/?version=latest
:target: http://webstruct.readthedocs.io/en/latest/
:alt: Documentation
Webstruct is a library for creating statistical NER_ systems that work
on HTML data, i.e. a library for building tools that extract named
entities (addresses, organization names, open hours, etc) from webpages.
Unlike most NER systems, webstruct works on HTML data, not only
on text data. This allows to define features that use HTML structure,
and also to embed annotation results back into HTML.
Read the docs_ for more info.
License is MIT.
.. _docs: http://webstruct.readthedocs.io/en/latest/
.. _NER: http://en.wikipedia.org/wiki/Named-entity_recognition
Contributing
------------
* Source code: https://github.com/scrapinghub/webstruct
* Bug tracker: https://github.com/scrapinghub/webstruct/issues
To run tests, make sure tox_ is installed, then run
``tox`` from the source root.
.. _tox: https://tox.readthedocs.io/en/latest/