An open API service indexing awesome lists of open source software.

https://github.com/tboenig/gt_corpus_benchmark

This repo provides a collection of ground truth data. The collection was compiled under different aspects (complexity of the layouts and use of the fonts). The individual data are also characterized by metadata. The metadata is based on the labeling scheme of OCR-D/PrimaLab.
https://github.com/tboenig/gt_corpus_benchmark

corp ground-truth ocr-d pagexml

Last synced: 4 months ago
JSON representation

This repo provides a collection of ground truth data. The collection was compiled under different aspects (complexity of the layouts and use of the fonts). The individual data are also characterized by metadata. The metadata is based on the labeling scheme of OCR-D/PrimaLab.

Awesome Lists containing this project

README

          



📚 Corpus


This corpus includes Ground Truth (GT) data compiled considering the following feature:



  1. Classification into font groups: Gothic/Blackletter, Antiqua and FontMix (Antiqua and Blackletter)

    distinction of the selected print type or combinations

  2. Classification into simple and complex

    compelexity of the layout (columns, footnotes,...)


The data are also divided according to the time of creation or production.


🖉 Creation


The data were created according to the OCR-D Ground Truth Guideline (https://ocr-d.de/en/gt-guidelines/trans/).


💻 Repositories





Analyzed collection


The GT data has been labeled. The labeling is based on an ontology defined by the Pattern Recognition
and Image Analysis Research Lab (PRImA-Research-Lab) at the University of Salford. The labeling metadata
is created for each available page. The following labeling metadata is available for the different collections.


see: gt-labelling : semantic-labelling OCR ground truth data (https://github.com/OCR-D/gt-labelling)



FontMix (Antiqua and Blackletter)




simple



  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples:
    Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related:
    "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.






  • activityDomain/computing/visual/analysisRecognition/ocr






  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples:
    OCR
    Mathematical equation recognition

    Related:
    Text processing (separate category)
    Table recognition
    Map reading






  • condition/acquisition/method-flaws/imaging/uneven-illumination

    Uneven illumination leading to brightness or contrast variations






  • condition/production-related/document-characteristics/low-contrast

    The contrast bwtween the paper and the page content is very low






  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page






  • condition/wear/additions/informative/annotations

    Annotations regarding the content






  • content-encoding/structured

    E.g. XML






  • content-type/corpus


    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples:
    A text corpus,
    An image database






  • contentOfInterest/visual/graphical


    Description coming soon.






  • contentOfInterest/visual/graphical/separator


    Description coming soon.






  • contentOfInterest/visual/text


    Description coming soon.






  • data-attributes/document-related/structural/running-titles

    Titles repeated each page






  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)






  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used






  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used






  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)






  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur






  • data-attributes/language/mixed

    More than one language used






  • granularity/logical/document-related/paragraph


    Description coming soon.






  • granularity/physical/document-related/page


    Description coming soon.






  • granularity/physical/document-related/region

    Region, zone, block






  • granularity/physical/document-related/text-line


    Description coming soon.






  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example






  • platform/platform-independent


    Description coming soon.









complex



  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples:
    Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related:
    "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.






  • activityDomain/computing/visual/analysisRecognition/ocr






  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples:
    OCR
    Mathematical equation recognition

    Related:
    Text processing (separate category)
    Table recognition
    Map reading






  • condition/acquisition/content-or-background/included-objects/preceeding-or-proceeding

    Part of preceeding or succeeding object included (e.g. other page)






  • condition/acquisition/geometric/page-curl

    Visible page curl (e.g. book scanning)






  • condition/acquisition/geometric/perspective-distortions

    Perspective distortions (e.g. due to camera-based acquisition)






  • condition/acquisition/method-flaws/imaging/uneven-illumination

    Uneven illumination leading to brightness or contrast variations






  • condition/production-related/document-characteristics/low-contrast

    The contrast bwtween the paper and the page content is very low






  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page






  • content-encoding/structured

    E.g. XML






  • content-type/corpus


    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples:
    A text corpus,
    An image database






  • contentOfInterest/visual/graphical/separator


    Description coming soon.






  • contentOfInterest/visual/text


    Description coming soon.






  • data-attributes/document-related/structural/footnote-continued






  • data-attributes/document-related/structural/footnotes

    Footnotes at bottom of page






  • data-attributes/document-related/structural/running-titles

    Titles repeated each page






  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)






  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used






  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used






  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)






  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur






  • data-attributes/language/mixed

    More than one language used






  • granularity/logical/document-related/paragraph


    Description coming soon.






  • granularity/physical/document-related/page


    Description coming soon.






  • granularity/physical/document-related/region

    Region, zone, block






  • granularity/physical/document-related/text-line


    Description coming soon.






  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example






  • platform/platform-independent


    Description coming soon.









Gothic/Blackletter




simple



  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples:
    Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related:
    "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.






  • activityDomain/computing/visual/analysisRecognition/ocr






  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples:
    OCR
    Mathematical equation recognition

    Related:
    Text processing (separate category)
    Table recognition
    Map reading






  • condition/acquisition/geometric/page-curl

    Visible page curl (e.g. book scanning)






  • condition/acquisition/geometric/perspective-distortions

    Perspective distortions (e.g. due to camera-based acquisition)






  • condition/ageing/warping

    Arbitrary warping (e.g. due to moisture)






  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page






  • condition/wear/additions/informative/annotations

    Annotations regarding the content






  • condition/wear/medium-damage/stains

    Noticeable stains on medium






  • content-encoding/structured

    E.g. XML






  • content-type/corpus


    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples:
    A text corpus,
    An image database






  • contentOfInterest/visual/graphical


    Description coming soon.






  • contentOfInterest/visual/graphical/separator


    Description coming soon.






  • contentOfInterest/visual/text


    Description coming soon.






  • data-attributes/document-related/structural/running-titles

    Titles repeated each page






  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)






  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used






  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used






  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)






  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur






  • granularity/logical/document-related/paragraph


    Description coming soon.






  • granularity/physical/document-related/page


    Description coming soon.






  • granularity/physical/document-related/region

    Region, zone, block






  • granularity/physical/document-related/text-line


    Description coming soon.






  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example






  • platform/platform-independent


    Description coming soon.









complex



  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples:
    Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related:
    "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.






  • activityDomain/computing/visual/analysisRecognition/ocr






  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples:
    OCR
    Mathematical equation recognition

    Related:
    Text processing (separate category)
    Table recognition
    Map reading






  • condition/acquisition/content-or-background/included-objects/preceeding-or-proceeding

    Part of preceeding or succeeding object included (e.g. other page)






  • condition/acquisition/geometric/page-curl

    Visible page curl (e.g. book scanning)






  • condition/acquisition/geometric/perspective-distortions

    Perspective distortions (e.g. due to camera-based acquisition)






  • condition/acquisition/method-flaws/imaging/uneven-illumination

    Uneven illumination leading to brightness or contrast variations






  • condition/ageing/warping

    Arbitrary warping (e.g. due to moisture)






  • condition/production-related/document-characteristics/low-contrast

    The contrast bwtween the paper and the page content is very low






  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page






  • condition/wear/additions/informative/annotations

    Annotations regarding the content






  • condition/wear/additions/informative/stamps

    The medium was stamped






  • condition/wear/medium-damage/stains

    Noticeable stains on medium






  • content-encoding/structured

    E.g. XML






  • content-type/corpus


    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples:
    A text corpus,
    An image database






  • contentOfInterest/visual/composite/music


    Description coming soon.






  • contentOfInterest/visual/graphical


    Description coming soon.






  • contentOfInterest/visual/graphical/separator


    Description coming soon.






  • contentOfInterest/visual/text


    Description coming soon.






  • data-attributes/document-related/structural/footnotes

    Footnotes at bottom of page






  • data-attributes/document-related/structural/running-titles

    Titles repeated each page






  • data-attributes/document-related/visual/decorations

    Decorations of some kind






  • data-attributes/document-related/visual/illustrations

    Illustrations in content






  • data-attributes/document-related/visual/illustrations/multi-colour

    Multi-colour illustrations in content






  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)






  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used






  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used






  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)






  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur






  • data-attributes/language/mixed

    More than one language used






  • granularity/logical/document-related/paragraph


    Description coming soon.






  • granularity/physical/document-related/page


    Description coming soon.






  • granularity/physical/document-related/region

    Region, zone, block






  • granularity/physical/document-related/text-line


    Description coming soon.






  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example






  • platform/platform-independent


    Description coming soon.









Antiqua




simple



  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples:
    Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related:
    "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.






  • activityDomain/computing/visual/analysisRecognition/ocr






  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples:
    OCR
    Mathematical equation recognition

    Related:
    Text processing (separate category)
    Table recognition
    Map reading






  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page






  • condition/wear/medium-damage/stains

    Noticeable stains on medium






  • content-encoding/structured

    E.g. XML






  • content-type/corpus


    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples:
    A text corpus,
    An image database






  • contentOfInterest/visual/graphical/separator


    Description coming soon.






  • contentOfInterest/visual/text


    Description coming soon.






  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)






  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used






  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)






  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur






  • granularity/logical/document-related/paragraph


    Description coming soon.






  • granularity/physical/document-related/page


    Description coming soon.






  • granularity/physical/document-related/region

    Region, zone, block






  • granularity/physical/document-related/text-line


    Description coming soon.






  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example






  • platform/platform-independent


    Description coming soon.









complex



  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples:
    Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related:
    "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.






  • activityDomain/computing/visual/analysisRecognition/ocr






  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples:
    OCR
    Mathematical equation recognition

    Related:
    Text processing (separate category)
    Table recognition
    Map reading






  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page






  • condition/wear/additions/informative/annotations

    Annotations regarding the content






  • condition/wear/medium-damage/stains

    Noticeable stains on medium






  • content-encoding/structured

    E.g. XML






  • content-type/corpus


    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples:
    A text corpus,
    An image database






  • contentOfInterest/visual/text


    Description coming soon.






  • data-attributes/document-related/structural/footnote-continued






  • data-attributes/document-related/structural/footnotes

    Footnotes at bottom of page






  • data-attributes/document-related/structural/running-titles

    Titles repeated each page






  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)






  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used






  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used






  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)






  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur






  • data-attributes/language/mixed

    More than one language used






  • granularity/logical/document-related/paragraph


    Description coming soon.






  • granularity/physical/document-related/page


    Description coming soon.






  • granularity/physical/document-related/region

    Region, zone, block






  • granularity/physical/document-related/text-line


    Description coming soon.






  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example






  • platform/platform-independent


    Description coming soon.