{"id":28415760,"url":"https://github.com/travvy88/documentgenerator_doge","last_synced_at":"2025-10-04T00:33:11.285Z","repository":{"id":260755339,"uuid":"882254213","full_name":"Travvy88/DocumentGenerator_DoGe","owner":"Travvy88","description":"Synthetic Document Generator for Document AI. Creates document images annotated with text and bounding boxes of each word. Images contain headings, tables, paragraphs with different formatting and fonts. Can be used in OCR, document transformers pretraining, text detection and more other tasks.   ","archived":false,"fork":false,"pushed_at":"2025-06-05T14:38:03.000Z","size":23387,"stargazers_count":21,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-21T23:02:45.335Z","etag":null,"topics":["ai","bounding-boxes","data-generator","dataset","document","document-generation","document-generator","ocr","synthetic-data","synthetic-dataset-generation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Travvy88.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-02T10:15:44.000Z","updated_at":"2025-06-05T14:38:04.000Z","dependencies_parsed_at":"2024-12-15T10:23:17.314Z","dependency_job_id":"67eea223-936b-4855-a177-f55dd56889bc","html_url":"https://github.com/Travvy88/DocumentGenerator_DoGe","commit_stats":null,"previous_names":["travvy88/doge","travvy88/documentgenerator_doge"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Travvy88/DocumentGenerator_DoGe","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Travvy88%2FDocumentGenerator_DoGe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Travvy88%2FDocumentGenerator_DoGe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Travvy88%2FDocumentGenerator_DoGe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Travvy88%2FDocumentGenerator_DoGe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Travvy88","download_url":"https://codeload.github.com/Travvy88/DocumentGenerator_DoGe/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Travvy88%2FDocumentGenerator_DoGe/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261206103,"owners_count":23124836,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","bounding-boxes","data-generator","dataset","document","document-generation","document-generator","ocr","synthetic-data","synthetic-dataset-generation"],"created_at":"2025-06-03T18:34:30.009Z","updated_at":"2025-10-04T00:33:11.280Z","avatar_url":"https://github.com/Travvy88.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DoGe — Synthetic DOcument GEnerator for Document AI\n\nDoGe is designed to synthesize a dataset of realistic document scans. Each document contains meaningful text, headings, \ntables, paragraphs with different formatting and fonts which are parsed from Wikipedia. The coordinates \nof the words are extracted using the No-OCR method we invented for faster generation on CPU.\n\n## Document examples\n\n\u003cdiv style=\"display: flex; flex-wrap: wrap;\"\u003e\n    \u003cimg src=\"resources/im_2.png\" width=\"300\" style=\"margin-right: 10px; margin-bottom: 10px;\"\u003e\n    \u003cimg src=\"resources/im_9.png\" width=\"300\" style=\"margin-right: 10px; margin-bottom: 10px;\"\u003e\n    \u003cimg src=\"resources/im_10.png\" width=\"300\" style=\"margin-right: 10px; margin-bottom: 10px;\"\u003e\n    \u003cimg src=\"resources/im_12.png\" width=\"300\" style=\"margin-right: 10px; margin-bottom: 10px;\"\u003e\n\u003c/div\u003e\n\nCheck the full size (1024x1024) in [resources](./resources) folder.\n\n## Usage\n\n### Docker installation\n\nYou can use Docker image with predefined environment to run DoGe:\n```bash\ngit clone https://github.com/Travvy88/DocumentGenerator_DoGe\ncd DocumentGenerator_DoGe\ndocker build -t doge .\n```\n\nReplace `/path/to/output/folder/on/host` and run commands. Inside the docker container you can\n[start document generation](#start-data-generation). \n\n### Ubuntu installation\n\nFor faster generation, it is recommended to install all dependencies without Docker. \nDoge is tested on Ubuntu 22.04.\n```bash\nsudo apt-get update \nsudo apt-get install libreoffice libjpeg-dev zlib1g-dev poppler-utils\n/usr/bin/python3 -m pip install --user unoserver  # install unoserver on system python\n\ngit clone https://github.com/Travvy88/DocumentGenerator_DoGe\ncd DocumentGenerator_DoGe\n# there you can make venv if needed!\npip3 install -r requirements.txt \n```\n\n## Start Data Generation\n\nDocker:\n\n```bash\ndocker run -v /full/path/to/output/folder/on/your/computer:/app/data doge python3 main.py --out_dir data --image_size 244 --max_urls 4 --num_processes 2 --ports 4000 4001 4002 4003\n``` \n\nUbuntu:\n\n```bash\npython3 main.py --out_dir data --image_size 244 --max_urls 4 --num_processes 2 --ports 4000 4001 4002 4003\n```\n\n### Main.py\n\nThe following arguments can be passed to the script:\n\n- `--out_dir`: The output directory for saving results. This argument is required.\n- `--remove_existing_dir`: If set, the output directory will be deleted before creating a new one.\n- `--image_size`: The size of the final images. Default is `244`.\n- `--start_page`: The starting page URL. Default is the Wikipedia main English page. You can use another language Wiki main page URL.\n- `--languages`: Permitted languages. Pages with other localizations will be ignored. Default is `['en']`.\n- `--max_urls`: The maximum number of URLs to process. Default is `100`.\n- `--num_processes`: The number of processes to use. Default is `1`. Each process will start DocumentGenerator and start Unoserver for each generator.\n- `--max_threads`: The maximum threads inside a process. Default is `3`.\n- `--ports`: The list of ports to use. Default is `[8145, 8146]`. The number of ports should be 2 times larger than `num_processes` (each Unoserver instance needs 2 ports for proper multicore work)\n- `--debug`: If set, draws bounding boxes + words on each image and saves itermediate images with highlighted words.\n\n\n### Docx_config.json\n\n| Parameter | Description |\n|-------------|-------------|\n| `max_words` | The maximum number of words allowed in the generated documents. |\n| `p_2columns` | The probability that the document will be formatted into two columns. |\n| `font_size_interval` | The font size range from which the size is randomly selected for each document. |\n| `p_line_spacing` | A list of probabilities controlling the line spacing of the document (1.5 or double). |\n| `p_text_alignment` | A list of probabilities controlling the text alignment of the document (center, left, right, justify). |\n| `p_heading_bold` | The probability that headings will be displayed in bold font. |\n| `heading_relative_size_interval` | The range of relative font sizes for headings. The relative font size is chosen randomly. |\n| `p_heading_alignment` | A list of probabilities controlling the alignment of headings (center, left, right, justify). |\n| `table_max_rows` | The maximum number of rows allowed in a table. Tables with more than the specified number of rows are dropped. |\n| `table_max_cols` | The maximum number of columns allowed in a table. Tables with more than the specified number of columns are dropped. |\n\nParameters with probabilities and intervals calculate its values for each document randomly.\n\nAccording to my experience, generator produces an average about 14 images for each url\nwith the above Docx settings. \n\n### Augmentations \n\nAugmentation pipeline applies on a final stage. You can manage different augmentations \nin `src/augmentations.py` file. Read the [Augraphy Docs](https://augraphy.readthedocs.io/en/latest/) for detailed explanation. \n\n\n## How it works\n![General scheme](resources/DoGe_Scheme.png \"General scheme of DoGe\")\n\nFirstly, the `Manager` class creates the `DocumentGenerator` instances in separate processes. For \neach `DocumentGenerator`, a Unoserver instance is started.\n\nThen, the `UrlParser` generates a list of URLs by crawling the web, starting from a given start page \nand following links on each page. It uses `BeautifulSoup` to parse HTML content and extract links, \nthen checks each link's validity and language, adding it to the list if it meets certain conditions. \nThe process continues until a maximum number of URLs is reached, and the method returns the list of \ngenerated URLs, excluding the starting URL. \n\nWhen data generation begins, the list of URLs is divided into several chunks for each `DocumentGenerator`.\nEach `DocumentGenerator` instance retrieves a Wikipedia HTML page by URL from its chunk.\nHeaders, paragraphs formatting, and tables are extracted and placed into a Docx document via the `DocxDocument` class. \nAt this stage, some random parametrization is applied according to `docx_config.json`. \nFor example, font size, text alignment, one or two columns, and other parameters \nare chosen for each document randomly. \n\nAfter that, each word in the Docx is filled with a unique color. As a result, a colored rectangle\nappears in place of each word. The image will be encoded with 24-bit color depth, \nso the maximum number of words per document is 16,777,216. The text of each word is saved to a hashmap of type color_code -\u003e word. \n\nThe next step is Docx to image conversion. DoGe uses Unoserver to convert Docx to Pdf and\npdf2image for image rendering.\n\nThen, all rectangle coordinates are detected via OpenCV on converted images. The word for each bounding box is retrieved from the hashmap. \nDoGe saves annotations to JSON files in the following format:\n\n```json\n{\n  \"words\": [\n    \"Hello\", \n    \"World\"\n  ],\n  \"bboxes\": [\n    [0.1, 0.1, 0.03, 0.02],\n    [0.4, 0.3, 0.11, 0.02]\n  ]\n}\n```\n\nThe bboxes are normalized and saved in XYWH format. \n\nThe final step is deleting all color fills from words in the Docx document, rendering images, applying Augraphy augmentations, \nand saving the augmented images to disk. That's it!\n\n## Join us!\nDoGe is the perspective method of producing synthetic document datasets. There are some features that will help many developers:\n- Download and place images into documents\n- Add annotations of headers, tables, paragraphs and images (if added)\n- Add different output formats (Parquet for example)\n- Add additional information via LLMs\n- Performance improvement: the **bottleneck** of generation is transforming Docx -\u003e Pdf -\u003e Png! I look for more simple way of converting Docx to Png.  \n\nIf have any ideas or you want to take part in the development of DoGe, write me:\n- travvy88@yandex.ru\n- https://t.me/travvy88\n\nOr create a Pull Request to this repo. I will be glad to improve the project with the power of community.\n\n## Acknowledgments\nHere are some great open-source projects I benefit from:\n- [ISP RAS Dedoc Team](https://github.com/ispras/dedoc) for support and assistance. \n- [Augraphy](https://github.com/sparkfish/augraphy) for augmentation code of final images. \n- [Unoserver](https://github.com/unoconv/unoserver) for Docx to Pdf converter.\n- [Pdf2image](https://github.com/Belval/pdf2image) for image from Pdf rendering module.\n- [Pillow-SIMD](https://github.com/uploadcare/pillow-simd) for faster image processing. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftravvy88%2Fdocumentgenerator_doge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftravvy88%2Fdocumentgenerator_doge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftravvy88%2Fdocumentgenerator_doge/lists"}