{"id":50289236,"url":"https://github.com/maxzz/singlefile-extractor","last_synced_at":"2026-05-28T04:33:30.398Z","repository":{"id":357512600,"uuid":"1237289138","full_name":"maxzz/singlefile-extractor","owner":"maxzz","description":"Extract an elelement + inline styles from SingleFile-saved HTML into a standalone HTML.","archived":false,"fork":false,"pushed_at":"2026-05-13T03:57:02.000Z","size":1618,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-13T05:37:25.440Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maxzz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-13T03:39:01.000Z","updated_at":"2026-05-13T03:57:05.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/maxzz/singlefile-extractor","commit_stats":null,"previous_names":["maxzz/singlefile-extractor"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/maxzz/singlefile-extractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxzz%2Fsinglefile-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxzz%2Fsinglefile-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxzz%2Fsinglefile-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxzz%2Fsinglefile-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maxzz","download_url":"https://codeload.github.com/maxzz/singlefile-extractor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxzz%2Fsinglefile-extractor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33594851,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-28T02:00:06.440Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-28T04:33:29.584Z","updated_at":"2026-05-28T04:33:30.374Z","avatar_url":"https://github.com/maxzz.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# singlefile-extractor utilities\n\nSmall, standard-library-only Python scripts for extracting and post-processing content from **SingleFile-saved HTML** (often nested via `iframe[srcdoc]`).\n\nAll scripts live under `scripts/`.\n\n## Table of contents\n- [`singlefile_extractor.py`](#singlefile_extractorpy)\n- [`moveout-css.py`](#moveout-csspy)\n- [`format-html.py`](#format-htmlpy)\n- [`format-css.py`](#format-csspy)\n- [`extract-data-urls.py`](#extract-data-urlspy)\n\n## `singlefile_extractor.py`\n\n### What it does\nExtracts one `\u003cform\u003e` element (by id) from a **SingleFile-saved HTML** and writes it into a **standalone HTML file** that preserves the form’s **visual styling**.\n\nSpecifically it:\n- Walks through nested `iframe[srcdoc]` documents (SingleFile embeds pages this way).\n- Finds candidate embedded documents that contain `\u003cform id=\"...\"\u003e`.\n- Extracts from the chosen document:\n  - the opening `\u003cbody ...\u003e` tag (to keep theme/body classes)\n  - all inline `\u003cstyle\u003e...\u003c/style\u003e` blocks\n  - the full `\u003cform ...\u003e...\u003c/form\u003e` for the requested id\n- Writes a new HTML file containing only those pieces.\n\n### How to run (Windows / PowerShell)\nFrom this repo folder:\n\n```powershell\npython .\\scripts\\singlefile_extractor.py\n```\n\nBy default, it reads `tests/Opcenter Execution (4_28_2026 3：06：53 PM).html` and writes `tests/esignature-form.html`.\n\nYou can also run via npm:\n\n```powershell\nnpm run extract\nnpm run extract:help\n```\n\n### Options\n- `-i, --input`: Path to the SingleFile-saved HTML file.\n- `-o, --output`: Where to write the extracted standalone HTML.\n- `--form-id`: The id of the `\u003cform\u003e` element to extract (default: `aspnetForm`).\n- `--contains`: Optional substring filter to disambiguate when multiple matches exist (example: `ESigCaptureVP.aspx`).\n- `--max-depth`: Max depth to recurse through nested `iframe[srcdoc]` (default: `10`).\n\nTo see the full CLI help:\n\n```powershell\npython .\\scripts\\singlefile_extractor.py --help\n```\n\n### Examples\n\n```powershell\npython .\\scripts\\singlefile_extractor.py --input \"Another SingleFile Page.html\" --output \"out.html\"\npython .\\scripts\\singlefile_extractor.py --input \"Some Page.html\" --output \"some-form.html\" --form-id \"myFormId\"\npython .\\scripts\\singlefile_extractor.py --input \"Some Page.html\" --output \"out.html\" --form-id \"aspnetForm\" --contains \"ESigCaptureVP.aspx\"\n```\n\nBatch example (run on all `.html` files in a folder):\n\n```powershell\nGet-ChildItem -Filter *.html | ForEach-Object {\n  $out = Join-Path $_.DirectoryName ($_.BaseName + \"-extracted.html\")\n  python .\\scripts\\singlefile_extractor.py --input $_.FullName --output $out --form-id \"aspnetForm\"\n}\n```\n\n### Notes / limitations\n- It does **not** guarantee the extracted form is fully functional (some pages rely on external scripts/services).\n- It does **not** download external resources; it only keeps what is already embedded in the SingleFile HTML.\n\n## `moveout-css.py`\n\n### What it does\nMoves all inline `\u003cstyle\u003e...\u003c/style\u003e` blocks from an HTML file into a separate `.css` file, removes the `\u003cstyle\u003e` blocks from the HTML, and inserts a `\u003clink rel=\"stylesheet\" href=\"...\"\u003e` back into the HTML `\u003chead\u003e`.\n\n### How to run (Windows / PowerShell)\nSafe (write to new files):\n\n```powershell\npython .\\scripts\\moveout-css.py --input \"tests\\esignature-form.html\" --output \"tests-local\\esignature-form.external-css.html\" --css-output \"tests-local\\esignature-form.external-css.css\"\n```\n\nIn-place (overwrites `--input`):\n\n```powershell\npython .\\scripts\\moveout-css.py --input \"tests\\esignature-form.html\"\n```\n\n### Options\n- `-i, --input`: Path to the HTML file to process.\n- `-o, --output`: Where to write the updated HTML (default: overwrite `--input`).\n- `--css-output`: Where to write extracted CSS (default: `\u003coutput\u003e.css`).\n- `--href`: Optional `href` to use in the inserted `\u003clink\u003e` (default: relative path to `--css-output`).\n\nFull CLI help:\n\n```powershell\npython .\\scripts\\moveout-css.py --help\n```\n\n## `format-html.py`\n\n### What it does\nBest-effort HTML formatter (pretty-printer). It tokenizes the HTML and writes it back with newlines + indentation.\n\nBy default it also runs a **CSS pipeline**:\n- extracts inline `\u003cstyle\u003e...\u003c/style\u003e` blocks into a separate CSS file (and inserts a `\u003clink rel=\"stylesheet\" ...\u003e` into the formatted HTML)\n- runs `extract-data-urls.py` on that CSS so `url(data:...)` values are moved into a vars file and referenced via `var(--...)`\n- the resulting CSS is linked from the formatted HTML (and the CSS imports the vars file)\n\n### How to run (Windows / PowerShell)\nIf `--output` is omitted, it writes `\u003cinput_stem\u003e_formatted.html` next to the input file.\n\n```powershell\npython .\\scripts\\format-html.py --input \"tests\\esignature-form.html\"\n```\n\nExample with explicit output + indent:\n\n```powershell\npython .\\scripts\\format-html.py --input \"tests\\esignature-form.html\" --output \"tests-local\\out_formatted.html\" --indent 2\n```\n\n### Options\n- `-i, --input`: Path to the HTML file to format.\n- `-o, --output`: Where to write the formatted HTML (default: `\u003cinput\u003e_formatted.html`).\n- `--indent`: Spaces per indent level (default: `2`).\n- `--no-css-pipeline`: Disable the CSS pipeline (format HTML only).\n- `--css-output`: Where to write extracted CSS when `\u003cstyle\u003e` blocks exist (default: `\u003coutput_stem\u003e.css`).\n- `--css-href`: Override the href used in the inserted `\u003clink\u003e` tag.\n- `--data-urls-min-var-url-length`: Threshold for moving existing `:root` vars (default: `500`).\n\nFull CLI help:\n\n```powershell\npython .\\scripts\\format-html.py --help\n```\n\n### Notes / limitations\n- This formatter is **not a lossless HTML parser**; it may normalize whitespace in text nodes.\n- It’s intended for making “tag soup” HTML easier to read, not for producing strictly-valid HTML.\n\n## `format-css.py`\n\n### What it does\nBest-effort CSS formatter (pretty-printer). It inserts newlines + indentation around `{`, `}`, and declaration `;` while respecting strings/comments and avoiding breaking tokens inside parentheses (like `url(...)`).\n\nBy default it also runs **Data URL extraction** (same logic as `extract-data-urls.py`):\n- finds `url(data:...)` values\n- moves them into a separate `:root { --... }` vars file\n- rewrites the formatted CSS to reference them via `var(--...)` and adds an `@import` for the vars file\n\n### How to run (Windows / PowerShell)\nIf `--output` is omitted, it writes `\u003cinput_stem\u003e_formatted.css` next to the input file.\n\n```powershell\npython .\\scripts\\format-css.py --input \"tests-local\\esig.smoke.css\"\n```\n\nExample with explicit output + indent:\n\n```powershell\npython .\\scripts\\format-css.py --input \"tests-local\\esig.smoke.css\" --output \"tests-local\\esig.smoke_formatted.css\" --indent 2\n```\n\n### Options\n- `-i, --input`: Path to the CSS file to format.\n- `-o, --output`: Where to write the formatted CSS (default: `\u003cinput\u003e_formatted.css`).\n- `--indent`: Spaces per indent level (default: `2`).\n- `--no-extract-data-urls`: Disable Data URL extraction (formatting only).\n- `--data-urls-vars-output`: Where to write extracted vars CSS (default: `\u003coutput_stem\u003e_dataurls-vars.css`).\n- `--data-urls-min-var-url-length`: Threshold for moving existing `:root` vars (default: `500`).\n\nFull CLI help:\n\n```powershell\npython .\\scripts\\format-css.py --help\n```\n\n### Notes / limitations\n- This formatter is **not a full CSS parser**; it may normalize whitespace and is intended for readability.\n\n## `extract-data-urls.py`\n\n### What it does\nScans a CSS file for `url(data:...)` usages, **moves those data URLs into a separate CSS file as custom properties**, and rewrites the main CSS to reference them via `var(--...)`.\n\nIt can also move existing `:root` custom properties (like `--sf-img-*`) into the vars file when their `data:` URL exceeds a configurable length threshold.\n\n### How to run (Windows / PowerShell)\n\n```powershell\npython .\\scripts\\extract-data-urls.py --input \"tests-local\\esig.smoke_formatted.css\" --output \"tests-local\\esig.smoke_no-dataurls.css\" --vars-output \"tests-local\\esig.smoke_dataurls-vars.css\"\n```\n\nBy default, it also inserts an `@import` at the top of the rewritten CSS so the vars file is loaded automatically.\n\nYou can also run via npm:\n\n```powershell\nnpm run extract:data-urls\nnpm run extract:data-urls:help\n```\n\n### Options\n- `-i, --input`: Path to the CSS file to process.\n- `-o, --output`: Where to write the rewritten CSS (default: `\u003cinput\u003e_dataurls_extracted.css`).\n- `--vars-output`: Where to write extracted CSS custom properties (default: `\u003coutput\u003e_vars.css`).\n- `--min-var-url-length`: Only move existing `:root` custom properties into the vars file if the `data:` URL length is \u003e= this value (default: `500`).\n- `--var-prefix`: Prefix used for generated custom properties (default: `data-url` → names like `--data-url-...`).\n- `--no-import`: Do not insert an `@import` into the rewritten CSS.\n- `--import-href`: Override the href used in the inserted `@import`.\n\nFull CLI help:\n\n```powershell\npython .\\scripts\\extract-data-urls.py --help\n```\n\n### Notes / limitations\n- Best-effort parsing (like the other formatters). Works well for typical “minified + embedded assets” CSS, but it’s not a full CSS AST parser.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxzz%2Fsinglefile-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaxzz%2Fsinglefile-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxzz%2Fsinglefile-extractor/lists"}