Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/amrsa1/uniparser
https://github.com/amrsa1/uniparser
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/amrsa1/uniparser
- Owner: amrsa1
- License: mit
- Created: 2024-10-16T23:43:39.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-10-19T12:13:45.000Z (3 months ago)
- Last Synced: 2024-10-26T21:35:14.885Z (2 months ago)
- Language: JavaScript
- Size: 223 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# π **UniParser**: Universal File Parsing for Node.js
**UniParser** is a powerful, lightweight Node.js library designed to handle parsing of multiple file formatsβsuch as **PDF**, **DOCX**, **TXT**, **HTML**, and **Markdown**βand convert them into **plain text** with ease.
π **Say goodbye to file format limitations!** UniParser extracts text content from all these formats, providing a consistent text output for your applications.
---
## β¨ **Features**
- π **PDF Parsing**: Extracts plain text from PDF documents.
- π **DOCX Parsing**: Reads and extracts text from Microsoft Word `.docx` files.
- π **TXT Parsing**: Handles plain text files with no special formatting.
- π **HTML Parsing**: Extracts text from the body of HTML documents.
- π¨ **Markdown Parsing**: Converts Markdown files to plain text, stripping out all formatting syntax.
- π **Auto-detection**: Automatically detects the file format and parses it using the `autoParse` function.---
## π¦ **Installation**
To install **UniParser**, simply run:
```bash
npm install uniparser
```---
## π οΈ **Usage**
### **CommonJS (CJS) Example**
If youβre working in a Node.js environment with CommonJS (CJS), use `require()` to import UniParser:
```javascript
const { autoParse, parsePDF, parseDOCX, parseTXT, parseHTML, parseMarkdown } = require('uniparser');// Example: Automatically detect and parse a file
(async () => {
const parsedText = await autoParse('./path/to/sample-file.pdf');
console.log(parsedText);
})();// Example: Parse specific file types
const pdfText = await parsePDF('./path/to/sample-file.pdf');
const docxText = await parseDOCX('./path/to/sample-file.docx');
const txtText = parseTXT('./path/to/sample-file.txt');
const htmlText = parseHTML('./path/to/sample-file.html');
const markdownText = parseMarkdown('./path/to/sample-file.md');
```### **ES Modules (ESM) Example**
If youβre working in an ES Module environment (modern JavaScript), use `import` to load the functions:
```javascript
import { autoParse, parsePDF, parseDOCX, parseTXT, parseHTML, parseMarkdown } from 'uniparser';// Example: Automatically detect and parse a file
(async () => {
const parsedText = await autoParse('./path/to/sample-file.pdf');
console.log(parsedText);
})();// Example: Parse specific file types
const pdfText = await parsePDF('./path/to/sample-file.pdf');
const docxText = await parseDOCX('./path/to/sample-file.docx');
const txtText = parseTXT('./path/to/sample-file.txt');
const htmlText = parseHTML('./path/to/sample-file.html');
const markdownText = parseMarkdown('./path/to/sample-file.md');
```### β‘ **Synchronous Usage (for small files)**
For small files, you can use UniParser synchronously, but this should only be done for very lightweight files.
#### CommonJS (CJS):
```javascript
const { parseTXT, parseMarkdown } = require('uniparser');// Synchronously read small text files
const txtContent = parseTXT('./path/to/sample-file.txt');
console.log(txtContent);const markdownContent = parseMarkdown('./path/to/sample-file.md');
console.log(markdownContent);
```#### ES Modules (ESM):
```javascript
import { parseTXT, parseMarkdown } from 'uniparser';// Synchronously read small text files
const txtContent = parseTXT('./path/to/sample-file.txt');
console.log(txtContent);const markdownContent = parseMarkdown('./path/to/sample-file.md');
console.log(markdownContent);
```---
## π **Supported File Formats**
- π **PDF** (`.pdf`): Converts PDF documents to plain text.
- π **DOCX** (`.docx`): Extracts text from Microsoft Word `.docx` files.
- ποΈ **TXT** (`.txt`): Reads plain text from simple text files.
- π **HTML** (`.html`): Strips HTML tags and returns the text content.
- βοΈ **Markdown** (`.md`): Converts Markdown files to plain text, removing all formatting.
- π **Auto-detection**: Detects file types automatically via `autoParse` and processes them accordingly.---
## π― **Example**
Here's a quick example to get you started with DOCX parsing:
### CommonJS (CJS):
```javascript
const { parseDOCX } = require('uniparser');(async () => {
const docxText = await parseDOCX('./path/to/sample-file.docx');
console.log(docxText);
})();
```### ES Modules (ESM):
```javascript
import { parseDOCX } from 'uniparser';(async () => {
const docxText = await parseDOCX('./path/to/sample-file.docx');
console.log(docxText);
})();
```---
## π **License**
This project is licensed under the **MIT License**. See the [LICENSE](./LICENSE) file for more information.
---
## π€ **Contributing**
Contributions are welcome! If you'd like to improve UniParser, feel free to fork the repository and submit a pull request. We appreciate your feedback and contributions!
---
π‘ **UniParser** makes it easier than ever to extract content from a wide range of file formatsβ**Try it now and streamline your file processing tasks!** π