Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/olibridge01/texocr

Optical Character Recognition (OCR) model for Image-to-LaTeX conversion
https://github.com/olibridge01/texocr

deep-learning latex machine-learning optical-character-recognition

Last synced: 3 days ago
JSON representation

Optical Character Recognition (OCR) model for Image-to-LaTeX conversion

Host: GitHub
URL: https://github.com/olibridge01/texocr
Owner: olibridge01
Created: 2024-10-17T15:11:58.000Z (4 months ago)
Default Branch: main
Last Pushed: 2024-12-19T12:22:15.000Z (about 2 months ago)
Last Synced: 2025-01-31T01:51:10.443Z (3 days ago)
Topics: deep-learning, latex, machine-learning, optical-character-recognition
Language: Python
Homepage:
Size: 9.52 MB
Stars: 7
Watchers: 3
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        [![Python 3.11](https://img.shields.io/badge/Python-3.11-3776AB.svg?logo=python&logoColor=white)](#)

[![PyTorch](https://img.shields.io/badge/PyTorch-EE4C2C.svg?logo=pytorch&logoColor=white)](#)

[![FastAPI](https://img.shields.io/badge/FastAPI-009485.svg?logo=fastapi&logoColor=white)](#)

[![Docker](https://img.shields.io/badge/Docker-2496ED.svg?logo=docker&logoColor=white)](#)



  



## Image to $\LaTeX$: Optical Character Recognition with PyTorch

This repository contains an OCR (Optical Character Recognition) model for recognizing LaTeX code from images. This repository allows for custom dataset generation, training, and evaluation of the model. The implementation is written with PyTorch.

I have also written a web application to get fast predictions of LaTeX code from images. The app is built with a FastAPI backend to serve the model and a depiction of the UI is shown below:

![texocr](https://github.com/user-attachments/assets/7cfbc643-ee16-4295-8adf-897ffd8c04ef)

### Model Overview

The model consists of an encoder-decoder architecture that is common for many current OCR systems. *TeXOCR* is based on the TrOCR model [[1]](#ref1) which utilises a Vision Transformer (ViT) [[2]](#ref2) encoder and a Transformer [[3]](#ref3) decoder. The model architecture is depicted in the figure below:

![TeXOCR_model](https://github.com/user-attachments/assets/de4a23d6-bed2-453f-9743-1b2b647ecbfd)

The vision encoder receives images of LaTeX equations and processes them into a series of embeddings $\mathbf{z}^{(i)} \in \mathbb{R}^{d}$ for each of the $N$ patches. The embeddings are passed into a Transformer decoder along with sequences of tokenized LaTeX code. The decoder generates a probability distribution over the vocabulary of LaTeX tokens to sample the next token in the sequence. The solution is then generated in an autoregressive manner to yield an overall prediction.

### Installation

To clone the repository, run the following:

  ```bash

  git clone https://github.com/olibridge01/TeXOCR.git

  cd TeXOCR

  ```

For package management, set up a conda environment and install the required packages as follows:

  ```bash

  conda create -n texocr python=3.11 anaconda

  conda activate texocr

  pip install -r requirements.txt

  ```

For dataset rendering, `latex`, `dvipng`, and `imagemagick` are required. To install these dependencies, follow the instructions in the [`data_wrangling/`](data_wrangling/) directory.

### Data

The data used in this project is taken from the [Im2LaTeX-230k](https://www.kaggle.com/datasets/gregoryeritsyan/im2latex-230k) dataset (equations only). For use with a model consisting of a ViT encoder, I created custom scripts to generate the full dataset of image-label pairs, where each image has its dimensions altered to the nearest multiple of the patch size. To generate the dataset, simply execute:

  

  ```bash

  ./generate_dataset.sh

  ```

This takes the original equation data `data/master_labels.txt`, creates the data splits with `split_data.py` and renders the images with `render_images.py` (located in the `data_wrangling` directory). The rendered images are stored in `data/train`, `data/val`, and `data/test` directories. To create the dataset pickle files used in the training/testing scripts, run:

  

  ```bash

  ./generate_pickles.sh

  ```

### Tokenizer

This repository contains an implementation of the Byte Pair Encoding (BPE) [4] algorithm for tokenizing LaTeX code. To train the tokenizer on the Im2LaTeX-230k equation data, run:

  ```bash

  ./train_tokenizer.sh

  ```

To train the tokenizer on any text data, you can play around with the `tokenizer/tokenizer.py` script:

```bash

python tokenizer/tokenizer.py -v [vocab_size] -t -d [data_path] -s [save_path] --special [special_tokens] --verbose

```

where `vocab_size` is the desired vocabulary size, `data_path` is the path to the training data, `save_path` is the path to save the tokenizer (.txt file), and `special_tokens` is the path to a .txt file containing special tokens (e.g. [BOS], [PAD], etc.). Additionally, one can tinker with the `RegExTokenizer` class in Python as follows:

```python

from TeXOCR.tokenizer import RegExTokenizer

tokenizer = RegExTokenizer()

text = open('path/to/train.txt').read()

tokenizer.train(text)

tokenizer.save('path/to/tokenizer.txt')

# Tokenize a LaTeX string

tokens = tokenizer.encode('\int _ { 0 } ^ { 1 } x ^ 2 d x')

print(tokens)

```

where `train.txt` is some file containing tokenization training data. The tokenizer can be saved and loaded using the `save()` and `load()` methods.

### References 

[1] [Li *et al*. - TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models (2021)](https://arxiv.org/abs/2109.10282)

[2]  [Dosovitskiy *et al*. - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020)](https://arxiv.org/abs/2010.11929) 

[3]  [Vaswani *et al*. - Attention is All You Need (2017)](https://arxiv.org/abs/1706.03762)