{"id":21029376,"url":"https://github.com/explosion/spacy-layout","last_synced_at":"2025-05-14T19:10:07.650Z","repository":{"id":263391323,"uuid":"889850064","full_name":"explosion/spacy-layout","owner":"explosion","description":"📚 Process PDFs, Word documents and more with spaCy","archived":false,"fork":false,"pushed_at":"2025-03-08T06:51:45.000Z","size":2320,"stargazers_count":535,"open_issues_count":18,"forks_count":36,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-04-14T00:57:17.663Z","etag":null,"topics":["document-layout","document-layout-analysis","docx","generative-ai","natural-language-processing","nlp","pdf","pdf-converter","rag","spacy"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/explosion.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-17T12:05:30.000Z","updated_at":"2025-04-12T23:18:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"02d43c65-bda1-410d-8b3f-32c2c046b44c","html_url":"https://github.com/explosion/spacy-layout","commit_stats":null,"previous_names":["explosion/spacy-layout"],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-layout","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-layout/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-layout/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-layout/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/explosion","download_url":"https://codeload.github.com/explosion/spacy-layout/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254209859,"owners_count":22032897,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-layout","document-layout-analysis","docx","generative-ai","natural-language-processing","nlp","pdf","pdf-converter","rag","spacy"],"created_at":"2024-11-19T12:12:01.271Z","updated_at":"2025-05-14T19:10:06.512Z","avatar_url":"https://github.com/explosion.png","language":"Python","readme":"\u003ca href=\"https://explosion.ai\"\u003e\u003cimg src=\"https://explosion.ai/assets/img/logo.svg\" width=\"125\" height=\"125\" align=\"right\" /\u003e\u003c/a\u003e\n\n# spaCy Layout: Process PDFs, Word documents and more with spaCy\n\nThis plugin integrates with [Docling](https://ds4sd.github.io/docling/) to bring structured processing of **PDFs**, **Word documents** and other input formats to your [spaCy](https://spacy.io) pipeline. It outputs clean, **structured data** in a text-based format and creates spaCy's familiar [`Doc`](https://spacy.io/api/doc) objects that let you access labelled text spans like sections or headings, and tables with their data converted to a `pandas.DataFrame`.\n\nThis workflow makes it easy to apply powerful **NLP techniques** to your documents, including linguistic analysis, named entity recognition, text classification and more. It's also great for implementing **chunking for RAG** pipelines.\n\n\u003e 📖 **Blog post:** [\"From PDFs to AI-ready structured data: a deep dive\"\n](https://explosion.ai/blog/pdfs-nlp-structured-data) – A new modular workflow for converting PDFs and similar documents to structured data, featuring `spacy-layout` and Docling.\n\n[![Test](https://github.com/explosion/spacy-layout/actions/workflows/test.yml/badge.svg)](https://github.com/explosion/spacy-layout/actions/workflows/test.yml)\n[![Current Release Version](https://img.shields.io/github/release/explosion/spacy-layout.svg?style=flat-square\u0026logo=github\u0026include_prereleases)](https://github.com/explosion/spacy-layout/releases)\n[![pypi Version](https://img.shields.io/pypi/v/spacy-layout.svg?style=flat-square\u0026logo=pypi\u0026logoColor=white)](https://pypi.org/project/spacy-layout/)\n[![Built with spaCy](https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg?style=flat-square)](https://spacy.io)\n\n## 📝 Usage\n\n\u003e ⚠️ This package requires **Python 3.10** or above.\n\n```bash\npip install spacy-layout\n```\n\nAfter initializing the `spaCyLayout` preprocessor with an `nlp` object for tokenization, you can call it on a document path to convert it to structured data. The resulting `Doc` object includes layout spans that map into the original raw text and expose various attributes, including the content type and layout features.\n\n```python\nimport spacy\nfrom spacy_layout import spaCyLayout\n\nnlp = spacy.blank(\"en\")\nlayout = spaCyLayout(nlp)\n\n# Process a document and create a spaCy Doc object\ndoc = layout(\"./starcraft.pdf\")\n\n# The text-based contents of the document\nprint(doc.text)\n# Document layout including pages and page sizes\nprint(doc._.layout)\n# Tables in the document and their extracted data\nprint(doc._.tables)\n# Markdown representation of the document\nprint(doc._.markdown)\n\n# Layout spans for different sections\nfor span in doc.spans[\"layout\"]:\n    # Document section and token and character offsets into the text\n    print(span.text, span.start, span.end, span.start_char, span.end_char)\n    # Section type, e.g. \"text\", \"title\", \"section_header\" etc.\n    print(span.label_)\n    # Layout features of the section, including bounding box\n    print(span._.layout)\n    # Closest heading to the span (accuracy depends on document structure)\n    print(span._.heading)\n```\n\nIf you need to process larger volumes of documents at scale, you can use the `spaCyLayout.pipe` method, which takes an iterable of paths or bytes instead and yields `Doc` objects:\n\n```python\npaths = [\"one.pdf\", \"two.pdf\", \"three.pdf\", ...]\nfor doc in layout.pipe(paths):\n    print(doc._.layout)\n```\n\nspaCy also allows you to call the `nlp` object on an already created `Doc`, so you can easily apply a pipeline of components for [linguistic analysis](https://spacy.io/usage/linguistic-features) or [named entity recognition](https://spacy.io/usage/linguistic-features#named-entities), use [rule-based matching](https://spacy.io/usage/rule-based-matching) or anything else you can do with spaCy.\n\n```python\n# Load the transformer-based English pipeline\n# Installation: python -m spacy download en_core_web_trf\nnlp = spacy.load(\"en_core_web_trf\")\nlayout = spaCyLayout(nlp)\n\ndoc = layout(\"./starcraft.pdf\")\n# Apply the pipeline to access POS tags, dependencies, entities etc.\ndoc = nlp(doc)\n```\n\n### Tables and tabular data\n\nTables are included in the layout spans with the label `\"table\"` and under the shortcut `Doc._.tables`. They expose a `layout` extension attribute, as well as an attribute `data`, which includes the tabular data converted to a `pandas.DataFrame`.\n\n```python\nfor table in doc._.tables:\n    # Token position and bounding box\n    print(table.start, table.end, table._.layout)\n    # pandas.DataFrame of contents\n    print(table._.data)\n```\n\nBy default, the span text is a placeholder `TABLE`, but you can customize how a table is rendered by providing a `display_table` callback to `spaCyLayout`, which receives the `pandas.DataFrame` of the data. This allows you to include the table figures in the document text and use them later on, e.g. during information extraction with a trained named entity recognizer or text classifier.\n\n```python\ndef display_table(df: pd.DataFrame) -\u003e str:\n    return f\"Table with columns: {', '.join(df.columns.tolist())}\"\n\nlayout = spaCyLayout(nlp, display_table=display_table)\n```\n\n### Serialization\n\nAfter you've processed the documents, you can [serialize](https://spacy.io/usage/saving-loading#docs) the structured `Doc` objects in spaCy's efficient binary format, so you don't have to re-run the resource-intensive conversion.\n\n```python\nfrom spacy.tokens import DocBin\n\ndocs = layout.pipe([\"one.pdf\", \"two.pdf\", \"three.pdf\"])\ndoc_bin = DocBin(docs=docs, store_user_data=True)\ndoc_bin.to_disk(\"./file.spacy\")\n```\n\n\u003e ⚠️ **Note on deserializing with extension attributes:** The custom extension attributes like `Doc._.layout` are currently registered when `spaCyLayout` is initialized. So if you're loading back `Doc` objects with layout information from a binary file, you'll need to initialize it so the custom attributes can be repopulated. We're planning on making this more elegant in an upcoming version.\n\u003e\n\u003e ```diff\n\u003e + layout = spacyLayout(nlp)\n\u003e doc_bin = DocBin(store_user_data=True).from_disk(\"./file.spacy\")\n\u003e docs = list(doc_bin.get_docs(nlp.vocab))\n\u003e ```\n\n\n## 🎛️ API\n\n### Data and extension attributes\n\n```python\nlayout = spaCyLayout(nlp)\ndoc = layout(\"./starcraft.pdf\")\nprint(doc._.layout)\nfor span in doc.spans[\"layout\"]:\n    print(span.label_, span._.layout)\n```\n\n| Attribute | Type | Description |\n| --- | --- | --- |\n| `Doc._.layout` | `DocLayout` | Layout features of the document. |\n| `Doc._.pages` | `list[tuple[PageLayout, list[Span]]]` | Pages in the document and the spans they contain. |\n| `Doc._.tables` | `list[Span]` | All tables in the document. |\n| `Doc._.markdown` | `str` | Markdown representation of the document. |\n| `Doc.spans[\"layout\"]` | `spacy.tokens.SpanGroup` | The layout spans in the document. |\n| `Span.label_` | `str` | The type of the extracted layout span, e.g. `\"text\"` or `\"section_header\"`. [See here](https://github.com/DS4SD/docling-core/blob/14cad33ae7f8dc011a79dd364361d2647c635466/docling_core/types/doc/labels.py) for options. |\n| `Span.label` | `int` | The integer ID of the span label. |\n| `Span.id` | `int` | Running index of layout span. |\n| `Span._.layout` | `SpanLayout \\| None` | Layout features of a layout span. |\n| `Span._.heading` | `Span \\| None` | Closest heading to a span, if available. |\n| `Span._.data` | `pandas.DataFrame \\| None` | The extracted data for table spans.\n\n### \u003ckbd\u003edataclass\u003c/kbd\u003e PageLayout\n\n| Attribute | Type | Description |\n| --- | --- | --- |\n| `page_no` | `int` | The page number (1-indexed). |\n| `width` | `float` | Page width in pixels. |\n| `height` | `float` | Page height in pixels. |\n\n### \u003ckbd\u003edataclass\u003c/kbd\u003e DocLayout\n\n| Attribute | Type | Description |\n| --- | --- | --- |\n| `pages` | `list[PageLayout]` | The pages in the document. |\n\n### \u003ckbd\u003edataclass\u003c/kbd\u003e SpanLayout\n\n| Attribute | Type | Description |\n| --- | --- | --- |\n| `x` | `float` | Horizontal offset of the bounding box in pixels. |\n| `y` | `float` | Vertical offset of the bounding box in pixels. |\n| `width` | `float` | Width of the bounding box in pixels. |\n| `height` | `float` | Height of the bounding box in pixels. |\n| `page_no` | `int` | Number of page the span is on. |\n\n### \u003ckbd\u003eclass\u003c/kbd\u003e `spaCyLayout`\n\n#### \u003ckbd\u003emethod\u003c/kbd\u003e `spaCyLayout.__init__`\n\nInitialize the document processor.\n\n```python\nnlp = spacy.blank(\"en\")\nlayout = spaCyLayout(nlp)\n```\n\n| Argument | Type | Description |\n| --- | --- | --- |\n| `nlp` | `spacy.language.Language` | The initialized `nlp` object to use for tokenization. |\n| `separator` | `str` | Token used to separate sections in the created `Doc` object. The separator won't be part of the layout span. If `None`, no separator will be added. Defaults to `\"\\n\\n\"`. |\n| `attrs` | `dict[str, str]` | Override the custom spaCy attributes. Can include `\"doc_layout\"`, `\"doc_pages\"`, `\"doc_tables\"`, `\"doc_markdown\"`, `\"span_layout\"`, `\"span_data\"`, `\"span_heading\"` and `\"span_group\"`. |\n| `headings` | `list[str]` | Labels of headings to consider for `Span._.heading` detection. Defaults to `[\"section_header\", \"page_header\", \"title\"]`. |\n| `display_table` | `Callable[[pandas.DataFrame], str] \\| str` | Function to generate the text-based representation of the table in the `Doc.text` or placeholder text. Defaults to `\"TABLE\"`. |\n| `docling_options` | `dict[InputFormat, FormatOption]` | [Format options](https://ds4sd.github.io/docling/usage/#advanced-options) passed to Docling's `DocumentConverter`. |\n| **RETURNS** | `spaCyLayout` | The initialized object. |\n\n#### \u003ckbd\u003emethod\u003c/kbd\u003e `spaCyLayout.__call__`\n\nProcess a document and create a spaCy [`Doc`](https://spacy.io/api/doc) object containing the text content and layout spans, available via `Doc.spans[\"layout\"]` by default.\n\n```python\nlayout = spaCyLayout(nlp)\ndoc = layout(\"./starcraft.pdf\")\n```\n\n| Argument | Type | Description |\n| --- | --- | --- |\n| `source` | `str \\| Path \\| bytes \\| DoclingDocument` | Path of document to process, bytes or already created `DoclingDocument`. |\n| **RETURNS** | `Doc` | The processed spaCy `Doc` object. |\n\n#### \u003ckbd\u003emethod\u003c/kbd\u003e `spaCyLayout.pipe`\n\nProcess multiple documents and create spaCy [`Doc`](https://spacy.io/api/doc) objects. You should use this method if you're processing larger volumes of documents at scale. The behavior of `as_tuples` works like it does in spaCy's [`Language.pipe`](https://spacy.io/api/language#pipe).\n\n```python\nlayout = spaCyLayout(nlp)\npaths = [\"one.pdf\", \"two.pdf\", \"three.pdf\", ...]\ndocs = layout.pipe(paths)\n```\n\n```python\nsources = [(\"one.pdf\", {\"id\": 1}), (\"two.pdf\", {\"id\": 2})]\nfor doc, context in layout.pipe(sources, as_tuples=True):\n    ...\n```\n\n| Argument | Type | Description |\n| --- | --- | --- |\n| `sources` | `Iterable[str \\| Path \\| bytes] \\| Iterable[tuple[str \\| Path \\| bytes, Any]]` | Paths of documents to process or bytes, or `(source, context)` tuples if `as_tuples` is set to `True`. |\n| `as_tuples` | `bool` | If set to `True`, inputs should be an iterable of `(source, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. |\n| **YIELDS** | `Doc \\| tuple[Doc, Any]` | The processed spaCy `Doc` objects or `(doc, context)` tuples if `as_tuples` is set to `True`. |\n\n## 💡 Examples and code snippets\n\nThis section includes further examples of what you can do with `spacy-layout`. If you have an example that could be a good fit, feel free to submit a [pull request](https://github.com/explosion/spacy-layout/pulls)!\n\n### Visualize a page and bounding boxes with matplotlib\n\n```python\nimport pypdfium2 as pdfium\nimport matplotlib.pyplot as plt\nfrom matplotlib.patches import Rectangle\nimport spacy\nfrom spacy_layout import spaCyLayout\n\nDOCUMENT_PATH = \"./document.pdf\"\n\n# Load and convert the PDF page to an image\npdf = pdfium.PdfDocument(DOCUMENT_PATH)\npage_image = pdf[2].render(scale=1)  # get page 3 (index 2)\nnumpy_array = page_image.to_numpy()\n# Process document with spaCy\nnlp = spacy.blank(\"en\")\nlayout = spaCyLayout(nlp)\ndoc = layout(DOCUMENT_PATH)\n\n# Get page 3 layout and sections\npage = doc._.pages[2]\npage_layout = doc._.layout.pages[2]\n# Create figure and axis with page dimensions\nfig, ax = plt.subplots(figsize=(12, 16))\n# Display the PDF image\nax.imshow(numpy_array)\n# Add rectangles for each section's bounding box\nfor section in page[1]:\n    # Create rectangle patch\n    rect = Rectangle(\n        (section._.layout.x, section._.layout.y),\n        section._.layout.width,\n        section._.layout.height,\n        fill=False,\n        color=\"blue\",\n        linewidth=1,\n        alpha=0.5\n    )\n    ax.add_patch(rect)\n    # Add text label at top of box\n    ax.text(\n        section._.layout.x,\n        section._.layout.y,\n        section.label_,\n        fontsize=8,\n        color=\"red\",\n        verticalalignment=\"bottom\"\n    )\n\nax.axis(\"off\")  # hide axes\nplt.show()\n```\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fspacy-layout","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexplosion%2Fspacy-layout","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fspacy-layout/lists"}