{"id":30344309,"url":"https://github.com/Unstructured-IO/unstructured-api","last_synced_at":"2025-08-18T12:03:04.281Z","repository":{"id":148800957,"uuid":"576425334","full_name":"Unstructured-IO/unstructured-api","owner":"Unstructured-IO","description":null,"archived":false,"fork":false,"pushed_at":"2025-07-05T19:32:42.000Z","size":40786,"stargazers_count":764,"open_issues_count":39,"forks_count":166,"subscribers_count":28,"default_branch":"main","last_synced_at":"2025-07-05T19:56:36.699Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Unstructured-IO.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-12-09T20:37:36.000Z","updated_at":"2025-07-05T18:37:50.000Z","dependencies_parsed_at":"2023-09-26T03:45:18.303Z","dependency_job_id":"40fce2a2-8a6b-4d1f-a4b5-d7afb2ed646f","html_url":"https://github.com/Unstructured-IO/unstructured-api","commit_stats":{"total_commits":274,"total_committers":36,"mean_commits":7.611111111111111,"dds":0.6532846715328466,"last_synced_commit":"d42a6cfcbfc826c511ab34849c805ae4c2279c11"},"previous_names":[],"tags_count":47,"template":false,"template_full_name":null,"purl":"pkg:github/Unstructured-IO/unstructured-api","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unstructured-IO%2Funstructured-api","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unstructured-IO%2Funstructured-api/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unstructured-IO%2Funstructured-api/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unstructured-IO%2Funstructured-api/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Unstructured-IO","download_url":"https://codeload.github.com/Unstructured-IO/unstructured-api/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unstructured-IO%2Funstructured-api/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270989147,"owners_count":24680688,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-18T02:00:08.743Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-18T12:02:06.733Z","updated_at":"2025-08-18T12:03:04.265Z","avatar_url":"https://github.com/Unstructured-IO.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003ch3 align=\"center\"\u003e\n  \u003cimg src=\"img/unstructured_logo.png\" height=\"200\"\u003e\n\u003c/h3\u003e\n\n\u003ch3 align=\"center\"\u003e\n  \u003cp\u003eAPI Announcement!\u003c/p\u003e\n\u003c/h3\u003e\n\nWe are thrilled to announce our newly launched [Unstructured API](https://unstructured-io.github.io/unstructured/api.html). While access to the hosted Unstructured API will remain free, API Keys are required to make requests. To prevent disruption, get yours [here](https://www.unstructured.io/#get-api-key) now and start using it today! Check out the [readme](https://github.com/Unstructured-IO/unstructured-api#--) here to get started making API calls.\u003c/p\u003e\n\n#### :rocket: Beta Feature: Chipper Model\n\nWe are releasing the beta version of our Chipper model to deliver superior performance when processing high-resolution, complex documents. To start using the Chipper model in your API request, you can utilize the `hi_res` strategy. Please refer to the documentation [here](https://unstructured-io.github.io/unstructured/api.html#strategies).\n\nAs the Chipper model is in beta version, we welcome feedback and suggestions. For those interested in testing the Chipper model, we encourage you to connect with us on [Slack community](https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ).\n\n\u003cdiv align=\"center\"\u003e\n\n \u003ca\n   href=\"https://www.phorm.ai/query?projectId=34efc517-2201-4376-af43-40c4b9da3dc5\"\u003e\n\t\u003cimg src=\"https://img.shields.io/badge/Phorm-Ask_AI-%23F2777A.svg?\u0026logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNSIgaGVpZ2h0PSI0IiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgogIDxwYXRoIGQ9Ik00LjQzIDEuODgyYTEuNDQgMS40NCAwIDAgMS0uMDk4LjQyNmMtLjA1LjEyMy0uMTE1LjIzLS4xOTIuMzIyLS4wNzUuMDktLjE2LjE2NS0uMjU1LjIyNmExLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxMmMtLjA5OS4wMTItLjE5Mi4wMTQtLjI3OS4wMDZsLTEuNTkzLS4xNHYtLjQwNmgxLjY1OGMuMDkuMDAxLjE3LS4xNjkuMjQ2LS4xOTFhLjYwMy42MDMgMCAwIDAgLjItLjEwNi41MjkuNTI5IDAgMCAwIC4xMzgtLjE3LjY1NC42NTQgMCAwIDAgLjA2NS0uMjRsLjAyOC0uMzJhLjkzLjkzIDAgMCAwLS4wMzYtLjI0OS41NjcuNTY3IDAgMCAwLS4xMDMtLjIuNTAyLjUwMiAwIDAgMC0uMTY4LS4xMzguNjA4LjYwOCAwIDAgMC0uMjQtLjA2N0wyLjQzNy43MjkgMS42MjUuNjcxYS4zMjIuMzIyIDAgMCAwLS4yMzIuMDU4LjM3NS4zNzUgMCAwIDAtLjExNi4yMzJsLS4xMTYgMS40NS0uMDU4LjY5Ny0uMDU4Ljc1NEwuNzA1IDRsLS4zNTctLjA3OUwuNjAyLjkwNkMuNjE3LjcyNi42NjMuNTc0LjczOS40NTRhLjk1OC45NTggMCAwIDEgLjI3NC0uMjg1Ljk3MS45NzEgMCAwIDEgLjMzNy0uMTRjLjExOS0uMDI2LjIyNy0uMDM0LjMyNS0uMDI2TDMuMjMyLjE2Yy4xNTkuMDE0LjMzNi4wMy40NTkuMDgyYTEuMTczIDEuMTczIDAgMCAxIC41NDUuNDQ3Yy4wNi4wOTQuMTA5LjE5Mi4xNDQuMjkzYTEuMzkyIDEuMzkyIDAgMCAxIC4wNzguNThsLS4wMjkuMzJaIiBmaWxsPSIjRjI3NzdBIi8+CiAgPHBhdGggZD0iTTQuMDgyIDIuMDA3YTEuNDU1IDEuNDU1IDAgMCAxLS4wOTguNDI3Yy0uMDUuMTI0LS4xMTQuMjMyLS4xOTIuMzI0YTEuMTMgMS4xMyAwIDAgMS0uMjU0LjIyNyAxLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxNGMtLjEuMDEyLS4xOTMuMDE0LS4yOC4wMDZsLTEuNTYtLjEwOC4wMzQtLjQwNi4wMy0uMzQ4IDEuNTU5LjE1NGMuMDkgMCAuMTczLS4wMS4yNDgtLjAzM2EuNjAzLjYwMyAwIDAgMCAuMi0uMTA2LjUzMi41MzIgMCAwIDAgLjEzOS0uMTcyLjY2LjY2IDAgMCAwIC4wNjQtLjI0MWwuMDI5LS4zMjFhLjk0Ljk0IDAgMCAwLS4wMzYtLjI1LjU3LjU3IDAgMCAwLS4xMDMtLjIwMi41MDIuNTAyIDAgMCAwLS4xNjgtLjEzOC42MDUuNjA1IDAgMCAwLS4yNC0uMDY3TDEuMjczLjgyN2MtLjA5NC0uMDA4LS4xNjguMDEtLjIyMS4wNTUtLjA1My4wNDUtLjA4NC4xMTQtLjA5Mi4yMDZMLjcwNSA0IDAgMy45MzhsLjI1NS0yLjkxMUExLjAxIDEuMDEgMCAwIDEgLjM5My41NzIuOTYyLjk2MiAwIDAgMSAuNjY2LjI4NmEuOTcuOTcgMCAwIDEgLjMzOC0uMTRDMS4xMjIuMTIgMS4yMy4xMSAxLjMyOC4xMTlsMS41OTMuMTRjLjE2LjAxNC4zLjA0Ny40MjMuMWExLjE3IDEuMTcgMCAwIDEgLjU0NS40NDhjLjA2MS4wOTUuMTA5LjE5My4xNDQuMjk1YTEuNDA2IDEuNDA2IDAgMCAxIC4wNzcuNTgzbC0uMDI4LjMyMloiIGZpbGw9IndoaXRlIi8+CiAgPHBhdGggZD0iTTQuMDgyIDIuMDA3YTEuNDU1IDEuNDU1IDAgMCAxLS4wOTguNDI3Yy0uMDUuMTI0LS4xMTQuMjMyLS4xOTIuMzI0YTEuMTMgMS4xMyAwIDAgMS0uMjU0LjIyNyAxLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxNGMtLjEuMDEyLS4xOTMuMDE0LS4yOC4wMDZsLTEuNTYtLjEwOC4wMzQtLjQwNi4wMy0uMzQ4IDEuNTU5LjE1NGMuMDkgMCAuMTczLS4wMS4yNDgtLjAzM2EuNjAzLjYwMyAwIDAgMCAuMi0uMTA2LjUzMi41MzIgMCAwIDAgLjEzOS0uMTcyLjY2LjY2IDAgMCAwIC4wNjQtLjI0MWwuMDI5LS4zMjFhLjk0Ljk0IDAgMCAwLS4wMzYtLjI1LjU3LjU3IDAgMCAwLS4xMDMtLjIwMi41MDIuNTAyIDAgMCAwLS4xNjgtLjEzOC42MDUuNjA1IDAgMCAwLS4yNC0uMDY3TDEuMjczLjgyN2MtLjA5NC0uMDA4LS4xNjguMDEtLjIyMS4wNTUtLjA1My4wNDUtLjA4NC4xMTQtLjA5Mi4yMDZMLjcwNSA0IDAgMy45MzhsLjI1NS0yLjkxMUExLjAxIDEuMDEgMCAwIDEgLjM5My41NzIuOTYyLjk2MiAwIDAgMSAuNjY2LjI4NmEuOTcuOTcgMCAwIDEgLjMzOC0uMTRDMS4xMjIuMTIgMS4yMy4xMSAxLjMyOC4xMTlsMS41OTMuMTRjLjE2LjAxNC4zLjA0Ny40MjMuMWExLjE3IDEuMTcgMCAwIDEgLjU0NS40NDhjLjA2MS4wOTUuMTA5LjE5My4xNDQuMjk1YTEuNDA2IDEuNDA2IDAgMCAxIC4wNzcuNTgzbC0uMDI4LjMyMloiIGZpbGw9IndoaXRlIi8+Cjwvc3ZnPgo=\" /\u003e\n   \u003c/a\u003e\n\n\u003c/div\u003e\n\n\n---\n\n\u003ch3 align=\"center\"\u003e\n  \u003cp\u003eGeneral Pre-Processing Pipeline for Documents\u003c/p\u003e\n\u003c/h3\u003e\n\nThis repo implements a pre-processing pipeline for the following documents. Currently, the pipeline is capable of recognizing the file type and choosing the relevant partition function to process the file.\n\n\n| Category  | Document Types                |\n|-----------|-------------------------------|\n| Plaintext | `.txt`, `.eml`, `.msg`, `.xml`, `.html`, `.md`, `.rst`, `.json`, `.rtf` |\n| Images    | `.jpeg`, `.png`               |\n| Documents | `.doc`, `.docx`, `.ppt`, `.pptx`, `.pdf`, `.odt`, `.epub`, `.csv`, `.tsv`, `.xlsx` |\n| Zipped    | `.gz`                         |\n\n\n## :rocket: Unstructured API\n\nTry our hosted API! It's freely available to use with any of the filetypes listed above. This is the easiest way to get started. If you'd like to host your own version of the API, jump down to the [Developer Quickstart Guide](#developer-quick-start).\n\n```\n curl -X 'POST' \\\n  'https://api.unstructured.io/general/v0/general' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -H 'unstructured-api-key: \u003cYOUR API KEY\u003e' \\\n  -F 'files=@sample-docs/family-day.eml' \\\n  | jq -C . | less -R\n```\n\n### Parameters\n\n#### Strategies\n\nFour strategies are available for processing PDF/Images files: `hi_res`, `fast`, `ocr_only` and `auto`. `fast` is the default `strategy` and works well for documents that do not have text embedded in images.\n\nOn the other hand, `hi_res` is the better choice for PDFs that may have text within embedded images, or for achieving greater precision of [element types](https://unstructured-io.github.io/unstructured/getting_started.html#document-elements) in the response JSON. Please be aware that, as of writing, `hi_res` requests may take 20 times longer to process compared to the `fast` option. See the example below for making a `hi_res` request.\n\n```\n curl -X 'POST' \\\n  'https://api.unstructured.io/general/v0/general' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -F 'files=@sample-docs/layout-parser-paper.pdf' \\\n  -F 'strategy=hi_res' \\\n  | jq -C . | less -R\n```\n\nThe `ocr_only` strategy runs the document through Tesseract for OCR. Currently, `hi_res` has difficulty ordering elements for documents with multiple columns. If you have a document with multiple columns that do not have extractable text, we recommend using the `ocr_only` strategy. Please be aware that `ocr_only` will fall back to another strategy if Tesseract is not available.\n\nFor the best of all worlds, `auto` will determine when a page can be extracted using `fast` or `ocr_only` mode, otherwise it will fall back to `hi_res`.\n\n#### Hi Res model name\n\nThe `hi_res` strategy supports different models, and the default is `detectron2onnx`. You can also specify `hi_res_model_name` parameter to run `hi_res` strategy with the chipper model while using the host API:\n\n```\n curl -X 'POST' \\\n  'https://api.unstructured.io/general/v0/general' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -F 'files=@sample-docs/layout-parser-paper.pdf' \\\n  -F 'strategy=hi_res' \\\n  -F 'hi_res_model_name=chipper'  \\\n  | jq -C . | less -R\n```\n\nWe also support models to be used locally, for example, `yolox`. Please refer to the `using-the-api-locally` section for more information on how to use the local API.\n\n#### OCR languages\n\nNote: This kwarg will eventually be deprecated. Please use `languages`.\nYou can also specify what languages to use for OCR with the `ocr_languages` kwarg. See the [Tesseract documentation](https://github.com/tesseract-ocr/tessdata) for a full list of languages and install instructions. OCR is only applied if the text is not already available in the PDF document.\n\n```\ncurl -X 'POST' \\\n  'https://api.unstructured.io/general/v0/general' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -F 'files=@sample-docs/english-and-korean.png' \\\n  -F 'strategy=ocr_only' \\\n  -F 'ocr_languages=eng'  \\\n  -F 'ocr_languages=kor'  \\\n  | jq -C . | less -R\n```\n\n#### Languages\n\nYou can also specify what languages to use for OCR with the `languages` kwarg. See the [Tesseract documentation](https://github.com/tesseract-ocr/tessdata) for a full list of languages and install instructions. OCR is only applied if the text is not already available in the PDF document.\n\n```\ncurl -X 'POST' \\\n  'https://api.unstructured.io/general/v0/general' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -F 'files=@sample-docs/english-and-korean.png' \\\n  -F 'strategy=ocr_only' \\\n  -F 'languages=eng'  \\\n  -F 'languages=kor'  \\\n  | jq -C . | less -R\n```\n\n#### Coordinates\n\nWhen elements are extracted from PDFs or images, it may be useful to get their bounding boxes as well. Set the `coordinates` parameter to `true` to add this field to the elements in the response.\n\n```\n curl -X 'POST' \\\n  'https://api.unstructured.io/general/v0/general' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -F 'files=@sample-docs/layout-parser-paper.pdf' \\\n  -F 'coordinates=true' \\\n  | jq -C . | less -R\n```\n\n#### Skip Table Extraction\n\nCurrently, we provide support for enabling and disabling table extraction for all file types. Set parameter `skip_infer_table_types` to specify the document types that you want to skip table extraction with. By default, we enable table extraction\nfor all file types (`skip_infer_table_types=[]`). Again, please note that table extraction only works with `hi_res` strategy. For example, if you want to skip table extraction for images, you can pass a list with matching image file types:\n\n```\n curl -X 'POST' \\\n  'https://api.unstructured.io/general/v0/general' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \\\n  -F 'strategy=hi_res' \\\n  -F 'skip_infer_table_types=[\"jpg\"]' \\\n  | jq -C . | less -R\n```\n\n#### Encoding\n\nYou can specify the encoding to use to decode the text input. If no value is provided, utf-8 will be used.\n\n```\ncurl -X 'POST' \\\n 'https://api.unstructured.io/general/v0/general' \\\n -H 'accept: application/json'  \\\n -H 'Content-Type: multipart/form-data' \\\n -F 'files=@sample-docs/fake-power-point.pptx' \\\n -F 'encoding=utf_8' \\\n | jq -C . | less -R\n```\n\n#### Gzipped files\n\nYou can send gzipped file and api will un-gzip it. \n\n```\ncurl -X 'POST' \\\n 'https://api.unstructured.io/general/v0/general' \\\n -H 'accept: application/json'  \\\n -H 'Content-Type: multipart/form-data' \\\n -F 'gz_uncompressed_content_type=application/pdf' \\\n -F 'files=@sample-docs/layout-parser-paper.pdf.gz' \n```\n\nIf field `gz_uncompressed_content_type` is set, the API will use its value as content-type of all files\nafter uncompressing the .gz files that are sent in single batch. If not set, the API will use\nvarious heuristics to detect the filetypes after uncompressing from .gz.\n\n#### XML Tags\n\nWhen processing XML documents, set the `xml_keep_tags` parameter to `true` to retain the XML tags in the output. If not specified, it will simply extract the text from within the tags.\n\n```\ncurl -X 'POST' \\\n 'https://api.unstructured.io/general/v0/general' \\\n -H 'accept: application/json'  \\\n -H 'Content-Type: multipart/form-data' \\\n -F 'files=@sample-docs/fake-xml.xml' \\\n -F 'xml_keep_tags=true' \\\n | jq -C . | less -R\n```\n\n#### Page Breaks\n\nFor supported filetypes, set the `include_page_breaks` parameter to `true` to include `PageBreak` elements in the output.\n\n```\ncurl -X 'POST' \\\n 'https://api.unstructured.io/general/v0/general' \\\n -H 'accept: application/json'  \\\n -H 'Content-Type: multipart/form-data' \\\n -F 'files=@sample-docs/layout-parser-paper-fast.pdf' \\\n -F 'include_page_breaks=true' \\\n | jq -C . | less -R\n```\n\n\n#### Unique element IDs\n\nBy default, the element ID is a SHA-256 hash of the element text. This is to ensure that\nthe ID is deterministic. One downside is that the ID is not guaranteed to be unique.\nDifferent elements with the same text will have the same ID, and there could also be hash collisions.\nTo use UUIDs in the output instead, set ``unique_element_ids=true``. Note: this means that the element IDs\nwill be random, so with every partition of the same file, you will get different IDs. \nThis can be helpful if you'd like to use the IDs as a primary key in a database, for example.\n\n```\ncurl -X 'POST' \\ \n 'https://api.unstructured.io/general/v0/general' \\\n -H 'accept: application/json'  \\\n -H 'Content-Type: multipart/form-data' \\\n -F 'files=@sample-docs/layout-parser-paper-fast.pdf' \\\n -F 'unique_element_ids=true' \\\n | jq -C . | less -R\n```\n\n\n#### Chunking Elements\n\nUse the `chunking_strategy` form-field to chunk text into larger or smaller elements. Defaults to `None` which performs no chunking. The available chunking strategies are `basic` and `by_title`.\n\nThe `basic` strategy combines whole consecutive document elements to maximally fill chunks of `max_characters` length. A single element that by itself exceeds `max_characters` is divided into two or more chunks by text-splitting (on a word boundary).\n\nThe `by_title` strategy has the same behaviors except document section boundaries are respected, meaning elements from two different sections never occur in the same chunk. A `Title` (section heading) element introduces a new section, hence the name.\n\n  Additional Parameters (all optional):\n\n    `max_characters`\n      The hard maximum chunk size. No chunk will exceed this length. Defaults to 500.\n\n    `new_after_n_chars`\n      A chunk of this length or greater is considered \"full\" and will not receive an additional element, even if it would fit within `max_characters`.\n      This \"soft-maximum\" defaults to `max_characters`.\n\n    `overlap`\n      Specifies the length of a string (\"tail\") to be drawn from each chunk and prefixed to the\n      next chunk as a context-preserving mechanism. By default, this only applies to split-chunks\n      where an oversized element is divided into multiple chunks by text-splitting.\n\n    `overlap_all`\n      Default: `False`. When `True`, apply overlap between \"normal\" chunks formed from whole\n      elements and not subject to text-splitting. Use this with caution as it entails a certain\n      level of \"pollution\" of otherwise clean semantic chunk boundaries.\n\n    `combine_under_n_chars`\n      Combines elements (for example a series of titles) until a section reaches a\n      length of n characters. Defaults to 500. Only operative for the \"by_title\"\n      strategy.\n\n    `multipage_sections`\n      If True, sections can span multiple pages. Defaults to True. Only operative for\n      the \"by_title\" strategy.\n\n\n```\ncurl -X 'POST' \n 'https://api.unstructured.io/general/v0/general' \\\n -H 'accept: application/json'  \\\n -H 'Content-Type: multipart/form-data' \\\n -F 'files=@sample-docs/layout-parser-paper-fast.pdf' \\\n -F 'chunking_strategy=by_title' \\\n | jq -C . | less -R\n```\n\n## Developer Quick Start\n\n* Using `pyenv` to manage virtualenv's is recommended\n\t* Mac install instructions. See [here](https://github.com/Unstructured-IO/community#mac--homebrew) for more detailed instructions.\n\t\t* `brew install pyenv-virtualenv`\n\t  * `pyenv install 3.12`\n  * Linux instructions are available [here](https://github.com/Unstructured-IO/community#linux).\n\n  * Create a virtualenv to work in and activate it, e.g. for one named `document-processing`:\n\n\t`pyenv  virtualenv 3.12\n   unstructured-api` \u003cbr /\u003e\n\t`pyenv activate unstructured-api`\n\nSee the [Unstructured Quick Start](https://github.com/Unstructured-IO/unstructured#eight_pointed_black_star-quick-start) for the many OS dependencies that are required, if the ability to process all file types is desired.\n\n* Run `make install`\n* Start a local jupyter notebook server with `make run-jupyter` \u003cbr /\u003e\n\t**OR** \u003cbr /\u003e\n\tjust start the fast-API locally with `make run-web-app`\n\n#### Using the API locally\n\nAfter running `make run-web-app` (or `make docker-start-api` to run in the container), you can now hit the API locally at port 8000. The `sample-docs` directory has a number of example file types that are currently supported.\n\nFor example:\n```\n curl -X 'POST' \\\n  'http://localhost:8000/general/v0/general' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -F 'files=@sample-docs/family-day.eml' \\\n  | jq -C . | less -R\n```\n\nThe response will be a list of the extracted elements:\n```\n[\n  {\n    \"element_id\": \"db1ca22813f01feda8759ff04a844e56\",\n    \"coordinates\": null,\n    \"text\": \"Hi All,\",\n    \"type\": \"UncategorizedText\",\n    \"metadata\": {\n      \"date\": \"2022-12-21T10:28:53-06:00\",\n      \"sent_from\": [\n        \"Mallori Harrell \u003cmallori@unstructured.io\u003e\"\n      ],\n      \"sent_to\": [\n        \"Mallori Harrell \u003cmallori@unstructured.io\u003e\"\n      ],\n      \"subject\": \"Family Day\",\n      \"filename\": \"family-day.eml\"\n    }\n  },\n...\n...\n```\n\nThe output format can also be set to `text/csv` to get the data in csv format rather than json:\n```\n curl -X 'POST' \\\n  'http://localhost:8000/general/v0/general' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -F 'files=@sample-docs/family-day.eml' \\\n  -F 'output_format=\"text/csv\"'\n```\n\nThe response will be a list of the extracted elements in csv format:\n```\ntype,element_id,text,filename,sent_from,sent_to,subject,languages,filetype\nUncategorizedText,db1ca22813f01feda8759ff04a844e56,\"Hi All,\",family-day.eml,['Mallori Harrell \u003cmallori@unstructured.io\u003e'],['Mallori Harrell \u003cmallori@unstructured.io\u003e'],Family Day,['eng'],message/rfc822\nNarrativeText,a663c393a5e143c01ef2bb5c98efa2c1,Get excited for our first annual family day! ,family-day.eml,['Mallori Harrell \u003cmallori@unstructured.io\u003e'],['Mallori Harrell \u003cmallori@unstructured.io\u003e'],Family Day,['eng'],message/rfc822\nNarrativeText,ce65ca3bef59957d3f1c2bab5725c82f,\"There will be face painting, a petting zoo, funnel cake and more.\",family-day.eml,['Mallori Harrell \u003cmallori@unstructured.io\u003e'],['Mallori Harrell \u003cmallori@unstructured.io\u003e'],Family Day,['eng'],message/rfc822\nNarrativeText,d7bcf988af9f06042d83e25c531e5744,Make sure to RSVP!,family-day.eml,['Mallori Harrell \u003cmallori@unstructured.io\u003e'],['Mallori Harrell \u003cmallori@unstructured.io\u003e'],Family Day,['eng'],message/rfc822\nTitle,5550577db69c2c8aabcd90979698120a,Best.,family-day.eml,['Mallori Harrell \u003cmallori@unstructured.io\u003e'],['Mallori Harrell \u003cmallori@unstructured.io\u003e'],Family Day,['eng'],message/rfc822\nTitle,ca1c571d993b6c1ed8ef56a06c16ba22,Mallori Harrell,family-day.eml,['Mallori Harrell \u003cmallori@unstructured.io\u003e'],['Mallori Harrell \u003cmallori@unstructured.io\u003e'],Family Day,['eng'],message/rfc822\nTitle,d5b612de8cd918addd9569b0255b65b2,Unstructured Technologies,family-day.eml,['Mallori Harrell \u003cmallori@unstructured.io\u003e'],['Mallori Harrell \u003cmallori@unstructured.io\u003e'],Family Day,['eng'],message/rfc822\nTitle,2e0b9e8ee04b9594a9c26d8535b818ff,Data Scientist,family-day.eml,['Mallori Harrell \u003cmallori@unstructured.io\u003e'],['Mallori Harrell \u003cmallori@unstructured.io\u003e'],Family Day,['eng'],message/rfc822\n```\n\n#### Parallel Mode for PDFs\nAs mentioned above, processing a pdf using `hi_res` is currently a slow operation. One workaround is to split the pdf into smaller files, process these asynchronously, and merge the results. You can enable parallel processing mode with the following env variables:\n\n* `UNSTRUCTURED_PARALLEL_MODE_ENABLED` - set to `true` to process individual pdf pages remotely, default is `false`.\n* `UNSTRUCTURED_PARALLEL_MODE_URL` - the location to send pdf page asynchronously, no default setting at the moment.\n* `UNSTRUCTURED_PARALLEL_MODE_THREADS` - the number of threads making requests at once, default is `3`.\n* `UNSTRUCTURED_PARALLEL_MODE_SPLIT_SIZE` - the number of pages to be processed in one request, default is `1`.\n* `UNSTRUCTURED_PARALLEL_RETRY_ATTEMPTS` - the number of retry attempts on a retryable error, default is `2`. (i.e. 3 attempts are made in total)\n\nDue to the overhead associated with file splitting, parallel processing mode is only recommended for the `hi_res` strategy. Additionally users of the official [Python client](https://github.com/Unstructured-IO/unstructured-python-client?tab=readme-ov-file#splitting-pdf-by-pages) can enable client-side splitting by setting `split_pdf_page=True`.\n\n#### Security\nYou may also set the optional `UNSTRUCTURED_API_KEY` env variable to enable request validation for your self-hosted instance of Unstructured. If set, only requests including an `unstructured-api-key` header with the same value will be fulfilled. Otherwise, the server will return a 401 indicating that the request is unauthorized.\n\n#### Controlling Server Load\nSome documents will use a lot of memory as they're being processed. To mitigate OOM errors, the server will return a 503 if the host's available memory drops below 2GB. This is configured with the environment variable `UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB`, which defaults to 2048. You can lower this value to reduce these messages, that is, allow the server to use more memory. Otherwise, you can set to 0 to fully remove this check.\n\n#### Controlling server life time\nBy default server will run for indefinitely. To change that the `MAX_LIFETIME_SECONDS` environmental variable can be set. If server is run with this variable set, it will enter a graceful shutdown period after `MAX_LIFETIME_SECONDS` from its initialization. Graceful shutdown period lasts for up to 3600 seconds and during it:\n- server denies any new requests - they're met with an empty response,\n- server continues processing active requests and shuts down (ending graceful period) if all of them are processed.\n\nAfter the graceful period is over if server is still running, it is shutdown forcefully, cancelling all active requests and sending empty responses to each of them.\n\n*Max lifetime requires gnu [timeout](https://www.gnu.org/software/coreutils/manual/html_node/timeout-invocation.html#timeout-invocation) to be installed, available by default on most linux systems. Downloadable on macOS as gtimeout with gnu coreutils.*\n\n## :dizzy: Instructions for using the Docker image\n\nThe following instructions are intended to help you get up and running using Docker to interact with `unstructured-api`.\nSee [here](https://docs.docker.com/get-docker/) if you don't already have docker installed on your machine.\n\nNOTE: we build multi-platform images to support both x86_64 and Apple silicon hardware. Docker pull should download the corresponding image for your architecture, but you can specify with `--platform` (e.g. --platform linux/amd64) if needed.\n\nWe build Docker images for all pushes to `main`. We tag each image with the corresponding short commit hash (e.g. `fbc7a69`) and the application version (e.g. `0.5.5-dev1`). We also tag the most recent image with `latest`. To leverage this, `docker pull` from our image repository.\n\n```bash\ndocker pull downloads.unstructured.io/unstructured-io/unstructured-api:latest\n```\n\nOnce pulled, you can launch the container as a web app on localhost:8000.\n\n```bash\ndocker run -p 8000:8000 -d --rm --name unstructured-api downloads.unstructured.io/unstructured-io/unstructured-api:latest\n```\n\nYou can pass in a PORT variable to run the server on a different port in the container.\n\n```bash\ndocker run -p 9500:9500 -d --rm --name unstructured-api -e PORT=9500 downloads.unstructured.io/unstructured-io/unstructured-api:latest\n```\n\n## Security Policy\n\nSee our [security policy](https://github.com/Unstructured-IO/pipeline-emails/security/policy) for\ninformation on how to report security vulnerabilities.\n\n## Learn more\n\n| Section | Description |\n|-|-|\n| [Unstructured Community GitHub](https://github.com/Unstructured-IO/community) | Information about Unstructured.io community projects  |\n| [Unstructured GitHub](https://github.com/Unstructured-IO) | Unstructured.io open source repositories |\n| [Company Website](https://unstructured.io) | Unstructured.io product and company info |\n\n## :chart_with_upwards_trend: Analytics\n\nWe’ve partnered with Scarf (https://scarf.sh) to collect anonymized user statistics to understand which features our community is using and how to prioritize product decision-making in the future. To learn more about how we collect and use this data, please read our [Privacy Policy](https://unstructured.io/privacy-policy).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FUnstructured-IO%2Funstructured-api","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FUnstructured-IO%2Funstructured-api","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FUnstructured-IO%2Funstructured-api/lists"}