{"id":16177589,"url":"https://github.com/raphaelm/confluence-scraper","last_synced_at":"2025-04-01T19:31:36.555Z","repository":{"id":136535726,"uuid":"481734585","full_name":"raphaelm/confluence-scraper","owner":"raphaelm","description":"Download all content from Confluence through the API","archived":false,"fork":false,"pushed_at":"2022-04-14T20:07:49.000Z","size":21,"stargazers_count":7,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-15T19:21:58.648Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raphaelm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-04-14T20:04:38.000Z","updated_at":"2025-03-11T22:20:35.000Z","dependencies_parsed_at":null,"dependency_job_id":"eb69f6e3-732a-41bc-b129-082fccaca49b","html_url":"https://github.com/raphaelm/confluence-scraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelm%2Fconfluence-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelm%2Fconfluence-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelm%2Fconfluence-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelm%2Fconfluence-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raphaelm","download_url":"https://codeload.github.com/raphaelm/confluence-scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246651527,"owners_count":20811993,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-10T05:09:37.842Z","updated_at":"2025-04-01T19:31:31.545Z","avatar_url":"https://github.com/raphaelm.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Confluence Scraper\n==================\n\nDownloads all pages and attachments from a Confluence Cloud instance for backup purposes.\n\n**I've created this module for internal usage at my company. I'll probably not have the time to maintain it beyond that.**\n\nSetup\n-----\n\n* Create a new OAuth 2.0 app in the [Atlassian Developer Console](https://developer.atlassian.com/console/myapps/)\n\n* Under permissions, request these scopes (not sure if all are really necessary): ``read:template:confluence read:space:confluence read:space-details:confluence read:relation:confluence read:custom-content:confluence read:content.metadata:confluence read:content:confluence read:content-details:confluence read:comment:confluence read:attachment:confluence read:content.property:confluence read:page:confluence read:label:confluence``.\n  Set a redirect URL that does not really need to exist.\n\n* Run ``pip install -Ur requirements.txt``\n\n* Create a new file `conf.py` with content like this:\n\n```\n# From Atlassian Developer console\nCLIENT_ID = \"…\"\nCLIENT_SECRET = \"…\"\n\n# You can use an URL that does not exist! We don't run a webserver, we'll just manually copy\n# the callback data to the terminal\nCALLBACK_URL = \"https://confluence-scraper.rami.io\"\n\n# Folder on disk where we store the downloaded content\nDATA_FOLDER = \"data\"\n\n# Exclude download of very large attachments:\nMAX_ATTACHMENT_SIZE = 1024 * 1024 * 1024  # 1 GB\n```\n\n* Run ``python main.py auth``, click the link in the output, and paste the redirect URL from your browser after the authentication is done.\n  This is probably required every 365 days, or more often if the script is not run regularly.\n\n* Run ``python main.py download`` to start the download.\n\nFeatures\n--------\n\n* Downloads all pages in HTML format\n\n* Downloads all attachments\n\n* Correctly fixes relative links between pages and attachments\n\n* Correclty fixes emoji\n\nKnown issues \u0026 limitations\n--------------------------\n\n* The table of contents will be entirely out of order, since the confluence API does not expose\n  the order of pages.\n  \n* Macros are not rendered, but their content is in some cases. For example, the \"Info\" macro looks fine,\n  while the \"draw.io\" macro does not render anything. However, draw.io diagrams are preserved through\n  a list of attachments.\n  \n* Thumbnails are not preserved and instead replaced with their original file. This works okayish for\n  images, but not for PDFs.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraphaelm%2Fconfluence-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraphaelm%2Fconfluence-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraphaelm%2Fconfluence-scraper/lists"}