Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dead8309/markitdown-ts
Convert various file formats to Markdown for indexing, text analysis, and other applications that benefit from structured text. TS port of the python ibrary.
https://github.com/dead8309/markitdown-ts
atom bing csv docx gif html jpg mp3 pdf png pptx rss typescript typescript-library wav xlsx xml zip
Last synced: about 1 month ago
JSON representation
Convert various file formats to Markdown for indexing, text analysis, and other applications that benefit from structured text. TS port of the python ibrary.
- Host: GitHub
- URL: https://github.com/dead8309/markitdown-ts
- Owner: dead8309
- License: mit
- Created: 2024-12-24T06:56:24.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2024-12-25T16:52:28.000Z (about 1 month ago)
- Last Synced: 2024-12-25T17:28:44.308Z (about 1 month ago)
- Topics: atom, bing, csv, docx, gif, html, jpg, mp3, pdf, png, pptx, rss, typescript, typescript-library, wav, xlsx, xml, zip
- Language: HTML
- Homepage: https://www.npmjs.com/package/markitdown-ts
- Size: 5.29 MB
- Stars: 17
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# markitdown-ts
[![CI](https://github.com/dead8309/markitdown-ts/actions/workflows/ci.yml/badge.svg)](https://github.com/dead8309/markitdown/actions/workflows/ci.yml)
`markitdown-ts` is a TypeScript library designed for converting various file formats to Markdown. This makes it suitable for indexing, text analysis, and other applications that benefit from structured text. It is a TypeScript implementation of the original `markitdown` [Python library.](https://github.com/microsoft/markitdown)
It supports:
- [x] PDF
- [x] Word (.docx)
- [x] Excel (.xlsx)
- [x] Images (EXIF metadata extraction and optional LLM-based description)
- [x] Audio (EXIF metadata extraction only)
- [x] HTML
- [x] Text-based formats (plain text, .csv, .xml, .rss, .atom)
- [x] Jupyter Notebooks (.ipynb)
- [x] Bing Search Result Pages (SERP)
- [x] ZIP files (recursively iterates over contents)
- [ ] PowerPoint> [!NOTE]
>
> Speech Recognition for audio converter has not been implemented yet. I'm happy to accept contributions for this feature.## Installation
Install `markitdown-ts` using your preferred package manager:
```bash
pnpm add markitdown-ts
```## Usage
```typescript
import { MarkItDown } from "markitdown-ts";const markitdown = new MarkItDown();
try {
const result = await markitdown.convert("path/to/your/file.pdf");
if (result) {
console.log(result.text_content);
}
} catch (error) {
console.error("Conversion failed:", error);
}
```Pass additional options as needed for specific functionality.
## YouTube Transcript Support
When converting YouTube files, you can pass the `enableYoutubeTranscript` and the `youtubeTranscriptLanguage` option to control the transcript extraction. By default it will use `"en"` if the `youtubeTranscriptLanguage` is not provided.
```typescript
const markitdown = new MarkItDown();
const result = await markitdown.convert("https://www.youtube.com/watch?v=V2qZ_lgxTzg", {
enableYoutubeTranscript: true,
youtubeTranscriptLanguage: "en"
});
```## LLM Image Description Support
To enable LLM functionality, you need to configure a model and client in the `options` for the image converter. You can use the `@ai-sdk/openai` to get an LLM client.
```typescript
import { openai } from "@ai-sdk/openai";const markitdown = new MarkItDown();
const result = await markitdown.convert("test.jpg", {
llmModel: openai("gpt-4o-mini"),
llmPrompt: "Write a detailed description of this image"
});
```## API
The library uses a single function `convert` for all conversions, with the options and the response type defined as such:
```typescript
export interface DocumentConverter {
convert(local_path: string, options: ConverterOptions): Promise;
}export type ConverterResult =
| {
title: string | null;
text_content: string;
}
| null
| undefined;export type ConverterOption = {
file_extension?: string;
url?: string;
fetch?: typeof fetch;
enableYoutubeTranscript?: boolean; // false by default
youtubeTranscriptLanguage?: string; // "en" by default
llmModel: string;
llmPrompt?: string;
styleMap?: string | Array;
_parent_converters?: DocumentConverter[];
cleanup_extracted?: boolean;
};
```## Examples
Check out the [examples](./examples) folder.
## License
MIT License © 2024 [Vaibhav Raj](https://github.com/dead8309)