{"id":20317675,"url":"https://github.com/kuro337/textract","last_synced_at":"2026-04-18T14:05:25.289Z","repository":{"id":218147900,"uuid":"745710993","full_name":"kuro337/textract","owner":"kuro337","description":"Single Header High Performance C++ Image Processing Library to read content from Images and transform Images to text files.","archived":false,"fork":false,"pushed_at":"2024-08-20T23:29:48.000Z","size":335,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-25T01:59:28.277Z","etag":null,"topics":["cpp","cpp20","cryptography","folly","opencv","openmp-parallelization"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kuro337.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-19T23:38:20.000Z","updated_at":"2024-08-20T23:29:51.000Z","dependencies_parsed_at":"2024-05-20T08:48:47.270Z","dependency_job_id":"f2b6f744-d9db-4356-9230-78af5e8aa353","html_url":"https://github.com/kuro337/textract","commit_stats":null,"previous_names":["kuro337/textract"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kuro337/textract","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuro337%2Ftextract","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuro337%2Ftextract/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuro337%2Ftextract/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuro337%2Ftextract/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kuro337","download_url":"https://codeload.github.com/kuro337/textract/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuro337%2Ftextract/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31971500,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T00:39:45.007Z","status":"online","status_checked_at":"2026-04-18T02:00:07.018Z","response_time":103,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","cpp20","cryptography","folly","opencv","openmp-parallelization"],"created_at":"2024-11-14T18:35:05.872Z","updated_at":"2026-04-18T14:05:20.280Z","avatar_url":"https://github.com/kuro337.png","language":"C++","readme":"\u003cbr/\u003e\n\u003cbr/\u003e\n\n\u003cdiv align=\"left\"\u003e\n  \u003cimg alt=\"textract logo\" height=\"75px\" src=\"assets/logo.png\"\u003e\n\u003c/div\u003e\n\n\u003cbr/\u003e\n\u003cbr/\u003e\n\n# textract\n\n\u003cbr/\u003e\n\u003cbr/\u003e\n\n_Single Header High Performance_ **C++ Image Processing** Library to read content from Images and transform Images to text files.\n\n\u003cbr/\u003e\n\n\u003chr/\u003e\n\n\u003cbr/\u003e\n\nBuild from Source using **CMake**\n\n#### Dependencies\n\n\u003cbr/\u003e\n\n```bash\nbrew install openssl leptonica tesseract libomp\n```\n\n\u003cbr/\u003e\n\n#### Build\n\n\u003cbr/\u003e\n\n```bash\n\ncd textract \u0026\u0026 mkdir build \u0026\u0026 cd build\ncmake ..\nmake\n\n# or use the build script to build for all releases and run tests\n./build_all.sh --all --tests\n\n## Alternative Compilers ##\n\n# using LLVM and Clang++ directly\ncmake -DCMAKE_CXX_COMPILER=/path/to/clang++ -DCMAKE_C_COMPILER=/path/to/clang ..\nmake\n\n# getting clang++ and clang paths\necho $(brew --prefix llvm)/bin/clang++\necho $(brew --prefix llvm)/bin/clang\n\n# to build all releases\n./build_all.sh\n\n```\n\n\u003cbr/\u003e\n\n[Linux (AArch64, x86_64)](docs/linux.md)\n\n\u003cbr/\u003e\n\n## Design\n\n\u003cbr/\u003e\n\n#### leptonica and Tesseract\n\nFor Processing images and using _Tesseract OCR_ to extract text from Images.\n\n\u003chr/\u003e\n\n#### OpenSSL\n\nFor generating **SHA256** hashes from Image bytes and metadata.\n\n\u003chr/\u003e\n\n#### OpenMP\n\nTo provide **parallelization** on systems for processing.\n\n\u003chr/\u003e\n\n#### Folly\n\n_textract_ uses Folly's **AtomicUnorderedInsertMap** for a fast Cache implementation to provide wait free parallel access to the Cache\n\n[Folly](https://github.com/facebook/folly)::[AtomicUnorderedInsertMap](https://github.com/facebook/folly/blob/main/folly/AtomicUnorderedMap.h)\n\n\u003chr\u003e\n\n\u003cbr/\u003e\n\n## Usage\n\n\u003cbr/\u003e\n\nProcess Images and get their textual content\n\n\u003cbr/\u003e\n\n```cpp\n#include \u003ctextract/textract\u003e\n\nint main() {\n    imgstr::ImgProcessor app = imgstr::ImgProcessor();\n\n    std::vector\u003cstd::string\u003e results = app.processImages(\"cs101_notes.png\",\"bio.jpeg\");\n\n    app.writeImageTextOut(\"cs101_notes.png\", \"cs_notes.txt\");\n\n\n\n    app.createPDF(\"transcript_img.png\");    /* Create searchable PDF's from images */\n\n    return 0;\n}\n\n```\n\n\u003cbr/\u003e\n\nProcess all valid Image files from a directory and create text files\n\n\u003cbr/\u003e\n\n```cpp\n#include \u003ctextract/textract\u003e\n\nint main() {\n\n    /* Process 10000+ images with parallelism */\n\n    imgstr::ImageTranslator app = imgstr::ImageTranslator(10000);\n\n    app.setCores(6);\n\n    app.setMode(ImgMode::document);\n\n    app.processImagesDir(\"/path/to/dir\");\n\n    app.getResults();\n\n    return 0;\n}\n\n```\n\n\u003chr\u003e\n\n\u003cbr/\u003e\n\n### In Memory Cache Performance\n\n\u003cbr/\u003e\n\n```bash\n\nDebug -O0\n\n============================================================================\n/textract/benchmarks/cache_benchmark.cc       relative    time/iter  iters/s\n============================================================================\nUnorderedMapMutexedSingleThreaded                          60.70ns    16.48M\nUnorderedMapMutexedMultiThreaded                          771.14ns     1.30M\nUnorderedMapMutexedMaxThreads                               1.62us   618.92K\nConcurrentHashMapSingleThreaded                           255.11ns     3.92M\nConcurrentHashMapMultiThreaded                            978.99ns     1.02M\nConcurrentHashMapMaxThreads                                 3.92us   254.82K\nConcurrentHashMapComplexSingleThreaded                    394.03ns     2.54M\nConcurrentHashMapComplexMultiThreaded                       1.68us   595.64K\nConcurrentHashMapComplexMaxThreads                         13.05us    76.62K\nAtomicUnorderedMapSingleThreaded                           38.80ns    25.77M\nAtomicUnorderedMapMultiThreaded                            75.43ns    13.26M\nAtomicUnorderedMapMaxThreads                              451.00ns     2.22M\nAtomicUnorderedMapComplexSingleThreaded                   256.42ns     3.90M\nAtomicUnorderedMapComplexMultiThreaded                      1.68us   593.57K\nAtomicUnorderedMapComplexMaxThreads                         8.76us   114.17K\n============================================================================\n\n\nFast Clang -Ofast\n\n============================================================================\n[...]/imgapp/benchmarks/cache_benchmark.cc     relative  time/iter   iters/s\n============================================================================\nUnorderedMapMutexedSingleThreaded                          61.15ns    16.35M\nUnorderedMapMutexedMultiThreaded                          781.00ns     1.28M\nUnorderedMapMutexedMaxThreads                               1.62us   616.71K\nConcurrentHashMapSingleThreaded                           215.90ns     4.63M\nConcurrentHashMapMultiThreaded                            864.01ns     1.16M\nConcurrentHashMapMaxThreads                                 3.50us   285.73K\nConcurrentHashMapComplexSingleThreaded                    282.30ns     3.54M\nConcurrentHashMapComplexMultiThreaded                       1.32us   757.35K\nConcurrentHashMapComplexMaxThreads                         14.95us    66.90K\nAtomicUnorderedMapSingleThreaded                           41.16ns    24.30M\nAtomicUnorderedMapMultiThreaded                            81.05ns    12.34M\nAtomicUnorderedMapMaxThreads                              440.18ns     2.27M\nAtomicUnorderedMapComplexSingleThreaded                   127.24ns     7.86M\nAtomicUnorderedMapComplexMultiThreaded                    511.31ns     1.96M\nAtomicUnorderedMapComplexMaxThreads                         6.31us   158.55K\n============================================================================\n\n```\n\n\u003cbr/\u003e\n\nIt can be seen **AtomicUnorderedInsertMap** is over **8x** faster than the Concurrent HashMap.\n\n[AtomicUnorderedInsertMap](https://github.com/facebook/folly/blob/main/folly/AtomicUnorderedMap.h) provides an overview of the tradeoffs.\n\n\u003chr\u003e\n\n\u003cbr/\u003e\n\n**Note**\n\n_These are for reference and will vary depending on:_\n\n- cpu arch\n- release Mode\n- number of cores\n- types of images\n\n\u003cbr/\u003e\n\n**Average Latencies** for reference using **4 cores** and batches of **10mb** upto **4gb**\n\n\u003chr\u003e\n\nDocument Images to Text Generations: **0.423081 ms**\n\nComplex Images to Text Generations: **0.674621 ms**\n\nText Document to Searchable PDF Conversion: **0.381223 ms**\n\n\u003chr\u003e\n\n\u003cbr/\u003e\n\n`Latest Test Suite Run`\n\n```bash\n      Start  1: ImageTest.MutexNullIfNoWrite\n 1/37 Test  #1: ImageTest.MutexNullIfNoWrite ..............................   Passed    0.02 sec\n      Start  2: ImageTest.MutexLazyInitialization\n 2/37 Test  #2: ImageTest.MutexLazyInitialization .........................   Passed    0.02 sec\n      Start  3: ImageTest.UpdateWriteMetadata\n 3/37 Test  #3: ImageTest.UpdateWriteMetadata .............................   Passed    0.02 sec\n      Start  4: ImageTest.ReadWriteMetadataBeforeWrite\n 4/37 Test  #4: ImageTest.ReadWriteMetadataBeforeWrite ....................   Passed    0.02 sec\n      Start  5: ImageConcurrentWriteTest.ConcurrentWriteAttempts\n 5/37 Test  #5: ImageConcurrentWriteTest.ConcurrentWriteAttempts ..........   Passed    0.02 sec\n      Start  6: ConstTests.BasicAssertions\n 6/37 Test  #6: ConstTests.BasicAssertions ................................   Passed    0.01 sec\n      Start  7: ConstTests.BasicTest\n 7/37 Test  #7: ConstTests.BasicTest ......................................   Passed    0.01 sec\n      Start  8: DebugTest.BasicAssertions\n 8/37 Test  #8: DebugTest.BasicAssertions .................................   Passed    0.01 sec\n      Start  9: LLVMFsTests.WriteFileTest\n 9/37 Test  #9: LLVMFsTests.WriteFileTest .................................   Passed    0.01 sec\n      Start 10: LLVMFsTests.InvalidWriteThrowTest\n10/37 Test #10: LLVMFsTests.InvalidWriteThrowTest .........................   Passed    0.01 sec\n      Start 11: LLVMFsTests.GetFilePathsEmpty\n11/37 Test #11: LLVMFsTests.GetFilePathsEmpty .............................   Passed    0.01 sec\n      Start 12: LLVMFsTests.GetFilePaths\n12/37 Test #12: LLVMFsTests.GetFilePaths ..................................   Passed    0.01 sec\n      Start 13: LLVMFsTests.GetFileInfo\n13/37 Test #13: LLVMFsTests.GetFileInfo ...................................   Passed    0.01 sec\n      Start 14: LLVMFsTests.DirectoryCreationTests\n14/37 Test #14: LLVMFsTests.DirectoryCreationTests ........................   Passed    0.01 sec\n      Start 15: LLVMFsTests.CreateQualifiedFilePath\n15/37 Test #15: LLVMFsTests.CreateQualifiedFilePath .......................   Passed    0.01 sec\n      Start 16: LLVMFsTests.CreateQualifiedFilePathExistingDirectory\n16/37 Test #16: LLVMFsTests.CreateQualifiedFilePathExistingDirectory ......   Passed    0.01 sec\n      Start 17: LLVMFsTests.CreateQualifiedFilePathNonExistingDirectory\n17/37 Test #17: LLVMFsTests.CreateQualifiedFilePathNonExistingDirectory ...   Passed    0.01 sec\n      Start 18: LLVMFsTests.CreateQualifiedFilePathExtensionChange\n18/37 Test #18: LLVMFsTests.CreateQualifiedFilePathExtensionChange ........   Passed    0.01 sec\n      Start 19: Imgclasstest.BasicAssertions\n19/37 Test #19: Imgclasstest.BasicAssertions ..............................   Passed    0.07 sec\n      Start 20: ImageProcessingTests.EnvironmentTest\n20/37 Test #20: ImageProcessingTests.EnvironmentTest ......................   Passed    0.02 sec\n      Start 21: ImageProcessingTests.ConvertSingleImageToTextFile\n21/37 Test #21: ImageProcessingTests.ConvertSingleImageToTextFile .........   Passed    0.07 sec\n      Start 22: ImageProcessingTests.ConvertSingleImageToTextFileLeak\n22/37 Test #22: ImageProcessingTests.ConvertSingleImageToTextFileLeak .....   Passed    0.07 sec\n      Start 23: ImageProcessingTests.WriteFileTest\n23/37 Test #23: ImageProcessingTests.WriteFileTest ........................   Passed    6.05 sec\n      Start 24: ImageProcessingTests.CheckForUniqueTextFile\n24/37 Test #24: ImageProcessingTests.CheckForUniqueTextFile ...............   Passed    1.23 sec\n      Start 25: ImageProcessingTests.BasicAssertions\n25/37 Test #25: ImageProcessingTests.BasicAssertions ......................   Passed    0.27 sec\n      Start 26: ImageProcessingTests.OEMvsLSTMAnalysis\n26/37 Test #26: ImageProcessingTests.OEMvsLSTMAnalysis ....................   Passed    0.21 sec\n      Start 27: PDFSuite.SinglePDF\n27/37 Test #27: PDFSuite.SinglePDF ........................................   Passed    1.27 sec\n      Start 28: SimilaritySuite.SingleString\n28/37 Test #28: SimilaritySuite.SingleString ..............................   Passed    0.01 sec\n      Start 29: SimilaritySuite.ImageSHA256Equal\n29/37 Test #29: SimilaritySuite.ImageSHA256Equal ..........................   Passed    0.02 sec\n      Start 30: SimilaritySuite.ImageSHA256Unequal\n30/37 Test #30: SimilaritySuite.ImageSHA256Unequal ........................   Passed    0.02 sec\n      Start 31: ParallelPerfTesseract.OEMvsLSTMAnalysisDebug\n31/37 Test #31: ParallelPerfTesseract.OEMvsLSTMAnalysisDebug ..............   Passed    7.16 sec\n      Start 32: PublicAPITests.GetTextFromOneImage\n32/37 Test #32: PublicAPITests.GetTextFromOneImage ........................   Passed    0.08 sec\n      Start 33: PublicAPITests.ProcessSimpleDir\n33/37 Test #33: PublicAPITests.ProcessSimpleDir ...........................   Passed    3.33 sec\n      Start 34: PublicAPITests.ProcessFilesFromDir\n34/37 Test #34: PublicAPITests.ProcessFilesFromDir ........................   Passed    6.03 sec\n      Start 35: PublicAPITests.Results\n35/37 Test #35: PublicAPITests.Results ....................................   Passed    0.02 sec\n      Start 36: PublicAPITests.AddImagesThenConvertToTextDocumentMode\n36/37 Test #36: PublicAPITests.AddImagesThenConvertToTextDocumentMode .....   Passed    6.01 sec\n      Start 37: PublicAPITests.AddImagesThenConvertToTextImageMode\n37/37 Test #37: PublicAPITests.AddImagesThenConvertToTextImageMode ........   Passed    3.50 sec\n\n# Nix Flake \n```\n\n\u003cbr/\u003e\n\nAuthor: [kuro337](https://github.com/kuro337)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkuro337%2Ftextract","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkuro337%2Ftextract","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkuro337%2Ftextract/lists"}