https://github.com/nissl-lab/toxy
.net text extraction & export framework
https://github.com/nissl-lab/toxy
dataset export extraction fileformats
Last synced: 6 months ago
JSON representation
.net text extraction & export framework
- Host: GitHub
- URL: https://github.com/nissl-lab/toxy
- Owner: nissl-lab
- License: apache-2.0
- Created: 2013-11-26T23:56:11.000Z (almost 12 years ago)
- Default Branch: master
- Last Pushed: 2025-02-23T14:03:11.000Z (9 months ago)
- Last Synced: 2025-05-14T17:05:35.603Z (6 months ago)
- Topics: dataset, export, extraction, fileformats
- Language: C#
- Homepage:
- Size: 58.4 MB
- Stars: 385
- Watchers: 39
- Forks: 109
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
- awsome-dotnet - Toxy - .NET text extraction framework supports a few file formats (Office)
- awesome-csharp - Toxy - .NET text extraction framework supports a few file formats (Office)
- fucking-awesome-dotnet - Toxy - .NET text extraction framework supports a few file formats (Office / GUI - other)
README
[](https://www.nuget.org/packages/Toxy)
[](https://github.com/nissl-lab/npoi/discussions/923)
[](https://img.shields.io/badge/netstandard-2.0-brightgreen.svg)
[](License.md)
What's Toxy
============
Toxy is a .NET data/text extraction framework similar to Apache Tika in Java. It supports a lot of popular formats such as docx, xlsx, xls, pdf, csv, txt, epub, html and so on.

Why Toxy
============
In the past, we have to use IFilter to extract texts from other documents. But Toxy is platform independent. It will try to support not only Windows but also Linux. Toxy is very easy to use and friendly. You don't need to care much about what extension you are extracting because it is a clever framework to help identify the file formats and extract the data/text into a unified structure.
Toxy Objects
==================
- ToxyDocument - the data structure extracted for documents
- ToxySpreadsheet - the data structure extracted for spreadsheets
- ToxyEmail - the data structure extracted for emails
- ToxyBusinessCard - the data structure extracted for business cards
- ToxyDom - the data structure extracted for DOM based document
- ToxyMetadata - the data structure extracted for other files with meta data