{"id":38767841,"url":"https://github.com/blues/discourse-algolia-etl","last_synced_at":"2026-01-17T12:01:49.941Z","repository":{"id":253447895,"uuid":"666402688","full_name":"blues/discourse-algolia-etl","owner":"blues","description":"Extract posts from a Discourse forum and load them into an Algolia search index.","archived":false,"fork":false,"pushed_at":"2024-08-16T18:27:46.000Z","size":25,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-26T17:50:40.356Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/blues.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-14T12:31:04.000Z","updated_at":"2024-08-16T18:27:50.000Z","dependencies_parsed_at":"2024-08-16T19:47:21.477Z","dependency_job_id":"5ba61340-75bb-426b-9eae-8a4fc4d256de","html_url":"https://github.com/blues/discourse-algolia-etl","commit_stats":null,"previous_names":["blues/discourse-algolia-etl"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/blues/discourse-algolia-etl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blues%2Fdiscourse-algolia-etl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blues%2Fdiscourse-algolia-etl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blues%2Fdiscourse-algolia-etl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blues%2Fdiscourse-algolia-etl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/blues","download_url":"https://codeload.github.com/blues/discourse-algolia-etl/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blues%2Fdiscourse-algolia-etl/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28508464,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T11:50:55.898Z","status":"ssl_error","status_checked_at":"2026-01-17T11:50:55.569Z","response_time":85,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-17T12:01:49.515Z","updated_at":"2026-01-17T12:01:49.768Z","avatar_url":"https://github.com/blues.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Discourse =\u003e Algolia ETL (Extract Transform Load)\n\nThis repo contains tools to extract content from a Discourse forum and load\nit into Algolia in a shape that is compatible with Algolia DocSearch.\n\nTo satisfy DocSearch, the objects created in Algolia are of two types:\n\n1. `content` - Text from a paragraph in a post. Hierarchical information is\n   included in the object lvl0, lvl1, lvl2, lvl3. (see below)\n2. `lvl3` - Contextual objects for headers (h2, h3, etc) within the content.\n\nThese types are based on the types created by the open source\n[docsearch-scraper](https://github.com/algolia/docsearch-scraper) which is a\nvery useful tool for scraping static sites.\n\n## Usage\n\n```bash\n./setup # install dependencies\n./main-etl all\n```\n\n## Runtime Environment\n\n### Python\n\nThis tool was developed with python 3.9 or later. It may work with earlier\nversions of python 3 but it has not been tested.\n\n## Configuration\n\nThe configuration is done via environment variables. The following variables\nmust be set:\n\n### Required Config\n\n```bash\n# The Discourse API needs read access to the forum you're trying to index.\nexport DISCOURSE_API_KEY=...\nexport DISCOURSE_URL=...\nexport DISCOURSE_USERNAME=...\n# The Algolia API needs write access to the index you're trying to update.\nexport ALGOLIA_API_KEY=...\nexport ALGOLIA_APP_ID=...\nexport ALGOLIA_INDEX_NAME=...\n```\n\n### Optional Config\n\n#### Hierarchy Levels\n\n```bash\nexport ALGOLIA_LVL0=... # (default: Forum)\n```\n\nThe ALGOLIA_LVL0 is the top level name for when results show up in DocSearch.\nFor example, if you set ALGOLIA_LVL0 to \"Forum\", then all results will show up\nunder the \"Forum\" category.\n\n    ```text\n    Forum \u003e {Category Name} \u003e {Topic Name} \u003e {Section Name, h1, h2, etc.}\n    e.g.\n    Forum \u003e Hardware \u003e What antenna should I use? \u003e Cellular\n    ```\n\n#### Algolia Tags\n\n```bash\nexport ALGOLIA_TAG=...  # (default: community)\n```\n\nThe ALGOLIA_TAG is a tag that will be added to all objects in Algolia. This is\nuseful in the DocSearch UI for filtering or tagging results as being from a\ncertain source.\n\n#### Not Configurable\n\n```\nanswered\n```\n\nAll posts in a Discourse marked 'Answered' will _also_ be tagged \"answered\", in\naddition to the ALGOLIA_TAG. Posts in unanswered topics will not get an extra\ntag. This is not yet configurable but a developer could follow the lead of the\nALGOILA_TAG and add a new environment variable to control this.\n\n## Esoteric details\n\nAlgolia limits objects to 10kb, so if we find a large paragraph, we split it\nin half repeatedly until it is small enough to fit. This is done in the\ntransform step.\n\n## Advanced Usage\n\nTo do a subset of the steps, use one of:\n\n```bash\n./main-etl extract\n./main-etl transform\n./main-etl load\n./main-etl extract transform\n./main-etl transform load\n```\n\n## Debugging\n\n### Extract\n\nThe Extract step creates a file called [`discourse.json`](discourse.json) This\nfile contains the raw json from the Discourse API.\n\n### Transform\n\nThe Transform step creates a file called [`algolia.json`](algolia.json) This\nfile contains the json that will be sent to Algolia.\n\n## Development\n\n### Python\n\nIf you use the vscode devcontainer, you'll get a python environment with the\ncorrect version of python. Otherwise, you'll want to install python 3.9+.\n\n### Setup\n\n```bash\n./setup # install dependencies\n```\n\n### Testing\n\nThe tests were written with the python `unittest` framework. The easiest way to\nrun them is from the command line.\n\n```bash\n./setup # install dependencies\npython3 -m unittest discover\n```\n\nIt's also possible to use the VSCode Test Explorer to run the tests.\n\n#### Tip\n\nDebug the tests from top to bottom in the\n[`tests/test_transform.py`](tests/test_transform.py)\n\n## Submodules\n\n### Extract\n\nThe [src/extract_discourse.py](src/extract_discourse.py) file contains the\nextraction logic.\n\n```plaintext\n$ src/extract_discourse.py --help\n Extract posts and categories from Discourse to stdout.\n\nUsage:\n    discourse-extract\n\nEnvironment Variables:\n    DISCOURSE_API_KEY   The API key to use for the Discourse API.\n    DISCOURSE_URL       The URL of the Discourse instance.\n    DISCOURSE_USERNAME  The username to use for the Discourse API.\n```\n\n### Transform\n\nThe\n[src/transform_discourse_to_algolia.py](src/transform_discourse_to_algolia.py)\nfile contains the transformation logic.\n\n```plaintext\n$ src/transform_discourse_to_algolia.py --help\n Transform posts from discourse to algolia-style. Input is expected to be json on\nstdin and output is json on stdout. Allow multiple tags to be specified.\n\nUsage:\n    transform-discourse-to-algolia --discourse-url=\u003cdiscourse-url\u003e --lvl0=\u003clvl0\u003e  --tag=\u003ctag\u003e...\n\nOptions:\n    --discourse-url=\u003cdiscourse-url\u003e  The base url of the discourse forum.\n    --lvl0=\u003clvl0\u003e                    The top level category name to nest all search results under. [default: Forum]\n    --tag=\u003ctag\u003e                      The tags to add to all algolia objects. [default: community]\n```\n\n### Load\n\nThe [src/load_algolia.py](src/load_algolia.py) file contains the loading logic.\n\n```plaintext\n$ src/load_algolia.py --help\nLoad objects into Algolia from a file via the Algolia API.\n\nUsage:\n    load-algolia \u003calgolia-json-file\u003e \u003calgolia-index-name\u003e\n\nEnvironment Variables:\n    ALGOLIA_APP_ID\n    ALGOLIA_API_KEY\n```\n\n\u003e Credits\n\u003e\n\u003e Hats off to github copilot for translating my thoughts into python.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblues%2Fdiscourse-algolia-etl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fblues%2Fdiscourse-algolia-etl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblues%2Fdiscourse-algolia-etl/lists"}