{"id":18369594,"url":"https://github.com/unstructured-io/pipeline-sec-filings","last_synced_at":"2025-04-06T17:32:15.554Z","repository":{"id":60929272,"uuid":"542247261","full_name":"Unstructured-IO/pipeline-sec-filings","owner":"Unstructured-IO","description":"Preprocessing pipeline notebooks and API supporting text extraction from SEC documents","archived":true,"fork":false,"pushed_at":"2024-01-01T14:22:11.000Z","size":1378,"stargazers_count":143,"open_issues_count":12,"forks_count":31,"subscribers_count":22,"default_branch":"main","last_synced_at":"2025-03-01T16:29:36.386Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Unstructured-IO.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-09-27T19:05:14.000Z","updated_at":"2025-01-30T10:34:17.000Z","dependencies_parsed_at":"2024-11-05T23:32:11.916Z","dependency_job_id":"384ae28d-019f-45cd-81bf-494dbfffb64d","html_url":"https://github.com/Unstructured-IO/pipeline-sec-filings","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unstructured-IO%2Fpipeline-sec-filings","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unstructured-IO%2Fpipeline-sec-filings/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unstructured-IO%2Fpipeline-sec-filings/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unstructured-IO%2Fpipeline-sec-filings/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Unstructured-IO","download_url":"https://codeload.github.com/Unstructured-IO/pipeline-sec-filings/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247522576,"owners_count":20952580,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-05T23:29:54.585Z","updated_at":"2025-04-06T17:32:12.015Z","avatar_url":"https://github.com/Unstructured-IO.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"﻿\u003ch3 align=\"center\"\u003e\n  \u003cimg src=\"img/unstructured_logo.png\" height=\"200\"\u003e\n\u003c/h3\u003e\n\n\u003ch3 align=\"center\"\u003e\n  \u003cp\u003ePre-Processing Pipeline for SEC Filings\u003c/p\u003e\n\u003c/h3\u003e\n\n\nThis repo implements a document pre-processing pipeline for SEC filings. Currently, the pipeline is capable of extracting narrative text from user-specified sections in 10-K, 10-Q, and S-1 filings.\n\n## Developer Quick Start\n\n* Using `pyenv` to manage virtualenv's is recommended\n\t* Mac install instructions. See [here](https://github.com/Unstructured-IO/community#mac--homebrew) for more detailed instructions.\n\t\t* `brew install pyenv-virtualenv`\n\t  * `pyenv install 3.8.15`\n  * Linux instructions are available [here](https://github.com/Unstructured-IO/community#linux).\n\n* Create a virtualenv to work in and activate it, e.g. for one named `sec-filings`:\n\n\t`pyenv  virtualenv 3.8.15 sec-filings` \u003cbr /\u003e\n\t`pyenv activate sec-filings`\n\n* Run `make install`\n* Start a local jupyter notebook server with `make run-jupyter` \u003cbr /\u003e\n\t**OR** \u003cbr /\u003e\n\tjust start the fast-API locally with `make run-web-app`\n\n## Quick Tour\n\nYou can run this [Colab notebook](https://colab.research.google.com/drive/1W9jCOGbIrE43f7fHMUSn3g3xXhOIjx_v) to see how [pipeline-section.ipynb](/pipeline-notebooks/pipeline-section.ipynb) extracts the narrative text sections from an SEC Filing and defines an API.\n\n## Extracting Narrative Text from an SEC Filing\n\nTo retrieve narrative text section(s) from an iXBRL S-1, 10-K, or 10-Q document (or amended version S-1/A, 10-K/A, or 10-Q/A), post the document to the `/section` API. You can try this out by downloading the sample documents using `make dl-test-artifacts`. Then, from\nthe `sample-docs` folder, run:\n\n```\ncurl -X 'POST' \\\n  'http://localhost:8000/sec-filings/v0.2.1/section' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -F 'text_files=@rgld-10-K-85535-000155837021011343.xbrl' \\\n  -F section=RISK_FACTORS | jq -C . | less -R\n```\n\nNote that additional `-F section` parameters may be included in the curl request to fetch\nmultiple sections at once. Valid sections for [10-Ks](https://www.sec.gov/files/form10-k.pdf),\n[10-Qs](https://www.sec.gov/files/form10-q.pdf), and [S-1s](https://www.sec.gov/files/forms-1.pdf)\nare available on the SEC website. You can also reference\n[this file](https://github.com/Unstructured-IO/pipeline-sec-filings/blob/main/prepline_sec_filings/sections.py)\nfor a list of valid `section` parameters, e.g. `RISK_FACTORS` OR `MANAGEMENT_DISCUSSION`.\n\n\nYou'll get back a response that looks like the following. Piping through `jq` and `less`\nformats/colors the outputs and lets your scroll through the results.\n\n```\n{\n  \"RISK_FACTORS\": [\n    {\n      \"text\": \"You should carefully consider the risks described in this section. Our future performance is subject to risks and uncertainties that could have a material adverse effect on our business, results of operations, and financial condition and the trading price of our common stock. We may be subject to other risks and uncertainties not presently known to us. In addition, please see our note about forward-looking statements included in the MD\u0026A.\",\n      \"type\": \"NarrativeText\"\n    },\n    {\n      \"text\": \"Our revenue is subject to volatility in metal prices, which could negatively affect our results of operations or cash flow.\",\n      \"type\": \"NarrativeText\"\n    },\n    {\n      \"text\": \"Market prices for gold, silver, copper, nickel, and other metals may fluctuate widely over time and are affected by numerous factors beyond our control. These factors include metal supply and demand, industrial and jewelry fabrication, investment demand, central banking actions, inflation expectations, currency values, interest rates, forward sales by metal producers, and political, trade, economic, or banking conditions.\",\n      \"type\": \"NarrativeText\"\n    },\n    ...\n  ]\n}\n```\n\n\nYou can also pass in custom section regex patterns using the `section_regex` parameter. For\nexample, you can run the following command to request the risk factors section:\n\n```\ncurl -X 'POST' \\\n  'http://localhost:8000/sec-filings/v0.2.1/section' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -F 'text_files=@rgld-10-K-85535-000155837021011343.xbrl' \\\n  -F 'section_regex=risk factors'  | jq -C . | less -R\n```\n\nThe result will be:\n\n```\n{\n  \"REGEX_0\": [\n    {\n      \"text\": \"You should carefully consider the risks described in this section. Our future performance is subject to risks and uncertainties that could have a material adverse effect on our business, results of operations, and financial condition and the trading price of our common stock. We may be subject to other risks and uncertainties not presently known to us. In addition, please see our note about forward-looking statements included in the MD\u0026A.\",\n      \"type\": \"NarrativeText\"\n    },\n    {\n      \"text\": \"Our revenue is subject to volatility in metal prices, which could negatively affect our results of operations or cash flow.\",\n      \"type\": \"NarrativeText\"\n    },\n    {\n      \"text\": \"Market prices for gold, silver, copper, nickel, and other metals may fluctuate widely over time and are affected by numerous factors beyond our control. These factors include metal supply and demand, industrial and jewelry fabrication, investment demand, central banking actions, inflation expectations, currency values, interest rates, forward sales by metal producers, and political, trade, economic, or banking conditions.\",\n      \"type\": \"NarrativeText\"\n    },\n    ...\n  ]\n}\n```\n\nAs with the `section` parameter, you can request multiple regexes by passing in multiple values\nfor the `section_regex` parameter. The requested pattern will be treated as a raw string.\n\nYou can also use special regex characters in your pattern, as shown in the example below:\n\n```\n curl -X 'POST' \\\n  'http://localhost:8000/sec-filings/v0.2.1/section' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -F 'text_files=@rgld-10-K-85535-000155837021011343.xbrl' \\\n  -F \"section_regex=^(\\S+\\W?)+$\"\n```\n\nYou can always replace the header `-H 'accept: application/json'` with `-H 'accept: text/csv'` depending on the format you want to fetch from the API as follows:\n\n```\n curl -X 'POST' \\\n  'http://localhost:8000/sec-filings/v0.2.1/section' \\\n  -H 'accept: text/csv' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -F 'text_files=@rgld-10-K-85535-000155837021011343.xbrl' \\\n  -F section=RISK_FACTORS | jq -C . | less -R\n```\nThe result will be:\n```\n\"section,element_type,text\\r\\nRISK_FACTORS,NarrativeText,\\\"You should carefully consider the risks described in this section. Our future performance is subject to risks and uncertainties that could have a material adverse effect on our business, results of operations, and financial condition and the trading price of our common stock. We may be subject to other risks and uncertainties not presently known to us. In addition, please see our note about forward-looking statements included in the MD\u0026A.\\\"\\r\\nRISK_FACTORS,NarrativeText,\\\"Our revenue is subject to volatility in metal prices, which could negatively affect our results of operations or cash flow.\\\"\\r\\nRISK_FACTORS,NarrativeText,\\\"Market prices for gold, silver, copper, nickel, and other metals may fluctuate widely over time and are affected by numerous factors beyond our control. These factors include metal supply and demand, industrial and jewelry fabrication, investment demand, central banking actions, inflation expectations, currency values, interest rates, forward sales by metal producers, and political, trade, economic, or banking conditions.\\\"\\r\\n\n```\n\nIn addition, you can add the form `-F 'output_schema=labelstudio'` if you want an output to be compatible with [labelstudio](https://labelstud.io) as follows:\n\n```\n curl -X 'POST' \\\n  'http://localhost:8000/sec-filings/v0.2.1/section' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -F 'text_files=@rgld-10-K-85535-000155837021011343.xbrl' \\\n  -F 'output_schema=labelstudio' \\\n  -F section=RISK_FACTORS | jq -C . | less -R\n\n```\nThe result will be:\n```\n{\n  \"RISK_FACTORS\": [\n    {\n      \"data\": {\n        \"text\": \"You should carefully consider the risks described in this section. Our future performance is subject to risks and uncertainties that could have a material adverse effect on our business, results of operations, and financial condition and the trading price of our common stock. We may be subject to other risks and uncertainties not presently known to us. In addition, please see our note about forward-looking statements included in the MD\u0026A.\",\n        \"ref_id\": \"7a912bb639b547404be4ceaf5d9083a9\"\n      }\n    },\n    {\n      \"data\": {\n        \"text\": \"Our revenue is subject to volatility in metal prices, which could negatively affect our results of operations or cash flow.\",\n        \"ref_id\": \"d4cc8e0e0c2b68ef69282c5250b721c9\"\n      }\n    },\n    ...\n    ]\n}\n```\n\n### Helper functions for SEC EDGAR API\n\nYou can use some of the functions provided in `prepline_sec_filings.fetch` to directly view or manipulate the filings available from the SEC's [EDGAR API](https://www.sec.gov/edgar/searchedgar/companysearch.html).\nFor example, `get_filing(cik, accession_number, your_organization_name, your_email)` will return the text of the filing with accession number `accession_number` for the organization with CIK number `cik`.\n`your_organization_name` and `your_email` should be your information.\nThe parameters `your_organization_name` and `your_email` are passed along to Edgar's API to identify the caller and are required by Edgar.\nAlternatively, the parameters may be omitted if the environment variables `SEC_API_ORGANIZATION` and `SEC_API_EMAIL` are defined.\n\n\nHelper functions are also provided for cases where the CIK and/or accession numbers are not known. For example,\n`get_form_by_ticker('mmm', '10-K', your_organization_name, your_email)` returns the text of the latest 10-K filing from 3M,\nand `open_form_by_ticker('mmm', '10-K', your_organization_name, your_email)` opens the SEC index page for the same filing in a web browser.\n\n### Generating Python files from the pipeline notebooks\n\nThe python module [section.py](/prepline_sec_filings/api/section.py) contains the FASTApi code needed to serve the API. It's created with `make generate-api`, which derives the API from the notebook [pipeline-section.ipynb](/pipeline-notebooks/pipeline-section.ipynb).\n\nYou can generate the FastAPI APIs from all [pipeline-notebooks/](/pipeline-notebooks) by running `make generate-api`.\n\n## Docker\n\nIt is not necessary to run Docker in a local development environment, however a Dockerfile and\nmake targets of `docker-build`, `docker-start-api`, and `docker-start-jupyter` are provided for convenience.\n\nYou can also launch a Jupyter instance to try out the notebooks with [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/Unstructured-IO/pipeline-sec-filings/HEAD).\n\n## Security Policy\n\nSee our [security policy](https://github.com/Unstructured-IO/pipeline-sec-filings/security/policy) for\ninformation on how to report security vulnerabilities.\n\n## Learn more\n\n| Section | Description |\n|-|-|\n| [Company Website](https://unstructured.io) | Unstructured.io product and company info |\n[EDGAR API](https://www.sec.gov/edgar/searchedgar/companysearch.html) | Documentation for the SEC\n| [10-K Filings](https://www.sec.gov/files/form10-k.pdf) | Detailed documentation on 10-K filings |\n| [10-Q Filings](https://www.sec.gov/files/form10-q.pdf) | Detailed documentation on 10-Q filings |\n| [S-1 Filings](https://www.sec.gov/files/forms-1.pdf) | Detailed documentation on S-1 filings |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funstructured-io%2Fpipeline-sec-filings","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Funstructured-io%2Fpipeline-sec-filings","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funstructured-io%2Fpipeline-sec-filings/lists"}