https://github.com/bazilsuhail/resume-dataset
Resume-Dataset is a Python-based project to generate PDF CVs from a LaTeX template and a CSV dataset, designed for creating a structured dataset for training LayoutLM, a model for document understanding.
https://github.com/bazilsuhail/resume-dataset
data-set latex-document latex-resume-template resume-builder resume-dataset resume-template
Last synced: 24 days ago
JSON representation
Resume-Dataset is a Python-based project to generate PDF CVs from a LaTeX template and a CSV dataset, designed for creating a structured dataset for training LayoutLM, a model for document understanding.
- Host: GitHub
- URL: https://github.com/bazilsuhail/resume-dataset
- Owner: BazilSuhail
- License: mit
- Created: 2025-08-09T16:43:59.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-08-09T17:21:05.000Z (10 months ago)
- Last Synced: 2025-08-30T23:03:06.459Z (9 months ago)
- Topics: data-set, latex-document, latex-resume-template, resume-builder, resume-dataset, resume-template
- Language: Jupyter Notebook
- Homepage:
- Size: 906 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Resume-Dataset
## Overview
**Resume-Dataset** is a Python-based project to generate PDF CVs from a LaTeX template and a CSV dataset, designed for creating a structured dataset for training LayoutLM, a model for document understanding. The repository includes a LaTeX template (`cv_template.tex`) and a sample CSV dataset (`cv_data.csv`) to generate CVs, which can be processed for text and layout extraction.
## Repository Contents
- `cv_template.tex`: Injectable LaTeX template with placeholders for CV data.
- `cv_data.csv`: Sample dataset with resume data for two individuals (Sourabh Bajaj, Jane Smith).
## Approach
- Parse CSV data containing resume details (name, email, mobile, website, education, experience, projects, languages, technologies).
- Inject CSV data into LaTeX template placeholders.
- Compile LaTeX files to PDFs using `pdflatex`.
- Clean up auxiliary files (`.aux`, `.log`, `.tex`) after compilation.
- Enable dataset creation for LayoutLM by providing PDFs for text and bounding box extraction.
## Usage
1. Install dependencies: `texlive`, `pandas`.
2. Place `cv_template.tex` and `cv_data.csv` in the working directory.
3. Run `generate_cvs.py` to produce PDFs (e.g., `CV_Sourabh_Bajaj.pdf`, `CV_Jane_Smith.pdf`).
4. Use PDFs for LayoutLM dataset preparation (e.g., extract text and bounding boxes with `pdfplumber`).
## Prerequisites
- Python 3.6+
- LaTeX distribution (TeX Live/MiKTeX)
- `pandas` library
## Installation (Google Colab)
```python
!apt-get update
!apt-get install -y texlive texlive-latex-extra texlive-fonts-extra
!pip install pandas
```