{"id":13446828,"url":"https://github.com/go-xmlfmt/htmlextract","last_synced_at":"2026-01-11T23:50:51.437Z","repository":{"id":57611462,"uuid":"115941647","full_name":"go-xmlfmt/htmlextract","owner":"go-xmlfmt","description":"HTML Extraction Tool","archived":false,"fork":false,"pushed_at":"2018-02-17T19:07:59.000Z","size":173,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-10-28T11:43:37.530Z","etag":null,"topics":["command-line-tool","commandline","extract","go","golang","html","outline","scrape","structure"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/go-xmlfmt.png","metadata":{"files":{"readme":"README.e.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-01-01T19:24:31.000Z","updated_at":"2019-09-25T02:46:36.000Z","dependencies_parsed_at":"2022-09-16T00:51:08.935Z","dependency_job_id":null,"html_url":"https://github.com/go-xmlfmt/htmlextract","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/go-xmlfmt%2Fhtmlextract","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/go-xmlfmt%2Fhtmlextract/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/go-xmlfmt%2Fhtmlextract/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/go-xmlfmt%2Fhtmlextract/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/go-xmlfmt","download_url":"https://codeload.github.com/go-xmlfmt/htmlextract/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244829628,"owners_count":20517347,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["command-line-tool","commandline","extract","go","golang","html","outline","scrape","structure"],"created_at":"2024-07-31T05:01:00.916Z","updated_at":"2026-01-11T23:50:51.431Z","avatar_url":"https://github.com/go-xmlfmt.png","language":"HTML","funding_links":[],"categories":["HTML"],"sub_categories":[],"readme":"\n# {{.Name}}\n\n{{render \"license/shields\" . \"License\" \"MIT\"}}\n{{template \"badge/godoc\" .}}\n{{template \"badge/goreport\" .}}\n{{template \"badge/travis\" .}}\n[![PoweredBy WireFrame](https://github.com/go-easygen/wireframe/blob/master/PoweredBy-WireFrame-R.svg)](http://godoc.org/github.com/go-easygen/wireframe)\n\n## {{toc 5}}\n\n## {{.Name}} - HTML Extraction Tool\n\nThe `htmlextract` makes it easy to look at the HTML files from different aspects. \n\n- **`htmlextract outline`** will extract HTML structure as outline so as to focus more easily on the structure, not the details.\n- **`htmlextract clean`** will clean up HTML tags \u0026 attributes as much as possible, so as to go back to the plain text version as easy as possible. \n- **`htmlextract h2md`** will convert HTML to .md file on top of above clean up.\n\n# Usage\n\n### $ {{exec \"htmlextract\" | color \"sh\"}}\n\n### $ {{shell \"htmlextract outline\" | color \"sh\"}}\n\n### $ {{shell \"htmlextract clean\" | color \"sh\"}}\n\n### $ {{shell \"htmlextract h2md\" | color \"sh\"}}\n\n\n# Examples\n\n## Outline\n\n### $ {{shell \"htmlextract outline -i test/sample0.html -o\" | color \"json\"}}\n\n### Advantages\n\n- By extracting HTML structure as outline, the `htmlextract outline` will make it easier to analyze the file structure, by eliminating all the glory details out of the way, which is most often needed when doing web scrapping or WebDriver code developing.\n- The output is mindfully chosen as the JSON format so as to easily take advantage of the dynamic folding feature that the text editors provide. Or you can use the [jsonformatter.org](https://jsonformatter.org/) online as well, even without a text editor.\n\nHere is a screenshot of viewing the result of `htmlextract outline -i test/sample0.html`:\n\n![sample.png](sample.png \"Sample screenshot\")\n\n### Usage\n\n#### Specifying more attributes\n\nIf the predefined attribute selection is not enough, the it is easily to add your own by the `-a, --attributes` switch. Note that you can use the switch as many times as you wish, to provide as many attributes as you need:\n\n```sh\n$ htmlextract outline -a dojotype -a style -i test/sample0.html -o | grep -1 dojotype | head -3 \n  \"div\": {\n    \"=\": \"id=pluginList dojotype=PluginTable style=float:right; \",\n    \"_\": {}},\n```\n\n#### Work with URL directly\n\nStarting with version `0.2.0`, `htmlextract` can extract from URL directly:\n\n```sh\n$ htmlextract outline -i http://demoaut.katalon.com/profile.php -o | head -35\n...\n\"body\": {\n  \"=\": \"\",\n  \"_\": {\n  \"a\": {\n    \"=\": \"id=menu-toggle css=.btn.btn-dark.btn-lg.toggle \",\n    \"_\": {\n    \"i\": {\n      \"=\": \"css=.fa.fa-bars \",\n      \"_\": {}},\n}},\n  \"nav\": {\n    \"=\": \"id=sidebar-wrapper \",\n    \"_\": {\n    \"ul\": {\n      \"=\": \"css=.sidebar-nav \",\n      \"_\": {\n      \"a\": {\n        \"=\": \"id=menu-close css=.btn.btn-light.btn-lg.pull-right.toggle \",\n        \"_\": {\n        \"i\": {\n          \"=\": \"css=.fa.fa-times \",\n          \"_\": {}},\n...\n```\n\n\n# Download binaries\n\n- The latest binary executables are available under  \nhttps://bintray.com/antoniosun/bin/{{.Name}}/latest, or directly under  \nhttps://bintray.com/version/files/antoniosun/bin/{{.Name}}/latest  \nas the result of the Continuous-Integration process.\n- I.e., they are built during every git push, automatically by [travis-ci](https://travis-ci.org/), right from the source code, truly WYSIWYG.\n- Pick \u0026 choose the binary executable that suits your OS and its architecture. E.g., for Linux, it would most probably be the `{{.Name}}-linux-amd64` file. If your OS and its architecture is not available in the download list, please let me know and I'll add it.\n- You may want to rename it to a shorter name instead, e.g., `{{.Name}}`, after downloading it. \n\n\n# Debian package\n\nAvailable at https://bintray.com/antoniosun/deb/{{.Name}},  \nor directly at  https://dl.bintray.com/antoniosun/deb:\n\n```\necho \"deb [trusted=yes] https://dl.bintray.com/antoniosun/deb all main\" | sudo tee /etc/apt/sources.list.d/antoniosun-debs.list\nsudo apt-get update\n\nsudo chmod 644 /etc/apt/sources.list.d/antoniosun-debs.list\napt-cache policy {{.Name}}\n\nsudo apt-get install -y {{.Name}}\n```\n\n\n\n# Install Source\n\nTo install the source code instead:\n\n```\ngo get github.com/go-xmlfmt/htmlextract\n```\n\n\n## Author(s) \u0026 Contributor(s)\n\n- [Antonio SUN](https://github.com/AntonioSun)\n\n_Powered by_ [**WireFrame**](https://github.com/go-easygen/wireframe),  [![PoweredBy WireFrame](https://github.com/go-easygen/wireframe/blob/master/PoweredBy-WireFrame-Y.svg)](http://godoc.org/github.com/go-easygen/wireframe), the _one-stop wire-framing solution_ for Go cli based projects, from start to deploy.\n\nAll patches welcome. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgo-xmlfmt%2Fhtmlextract","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgo-xmlfmt%2Fhtmlextract","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgo-xmlfmt%2Fhtmlextract/lists"}