https://github.com/norconex/importer
Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
https://github.com/norconex/importer
extract html java java-library manipulation norconex-importer parse pdf
Last synced: about 1 year ago
JSON representation
Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
- Host: GitHub
- URL: https://github.com/norconex/importer
- Owner: Norconex
- License: apache-2.0
- Created: 2013-09-17T15:24:48.000Z (almost 13 years ago)
- Default Branch: master
- Last Pushed: 2024-10-15T01:10:46.000Z (over 1 year ago)
- Last Synced: 2025-05-20T09:06:32.373Z (about 1 year ago)
- Topics: extract, html, java, java-library, manipulation, norconex-importer, parse, pdf
- Language: Java
- Homepage: http://www.norconex.com/collectors/importer/
- Size: 6.4 MB
- Stars: 34
- Watchers: 15
- Forks: 23
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.xml
- License: LICENSE.txt
Awesome Lists containing this project
README
Importer
==========
Norconex Importer is a Java library and command-line application meant to
"parse" and "extract" content out of a computer file as plain text, whatever
its format (HTML, PDF, Word, etc). In addition, it allows you to perform any
manipulation on the extracted text before importing/using it in your own
service or application.
Website: https://opensource.norconex.com/importer/