{"id":47705050,"url":"https://github.com/darhnoel/markql","last_synced_at":"2026-04-02T17:52:48.475Z","repository":{"id":331113075,"uuid":"1124113086","full_name":"darhnoel/markql","owner":"darhnoel","description":"A SQL-style query engine for HTML that makes extraction and filtering predictable, fast, and maintainable.","archived":false,"fork":false,"pushed_at":"2026-03-29T02:12:22.000Z","size":12051,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-29T05:15:30.586Z","etag":null,"topics":["cli","cplusplus","cpp20","csv","data-engineering","data-extraction","dom","etl","html","json","markql","parquet","query-language","repl","sql","sql-like","web-scraping"],"latest_commit_sha":null,"homepage":"https://darhnoel.github.io/markql/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/darhnoel.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-28T11:11:29.000Z","updated_at":"2026-03-29T02:12:26.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/darhnoel/markql","commit_stats":null,"previous_names":["darhnoel/xsql","darhnoel/markql"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/darhnoel/markql","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darhnoel%2Fmarkql","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darhnoel%2Fmarkql/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darhnoel%2Fmarkql/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darhnoel%2Fmarkql/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/darhnoel","download_url":"https://codeload.github.com/darhnoel/markql/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darhnoel%2Fmarkql/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31312744,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T12:59:32.332Z","status":"ssl_error","status_checked_at":"2026-04-02T12:54:48.875Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","cplusplus","cpp20","csv","data-engineering","data-extraction","dom","etl","html","json","markql","parquet","query-language","repl","sql","sql-like","web-scraping"],"created_at":"2026-04-02T17:52:47.826Z","updated_at":"2026-04-02T17:52:48.469Z","avatar_url":"https://github.com/darhnoel.png","language":"C++","readme":"\u003cp align=\"center\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"docs/assets/logo/markql_logo_dark.svg\"\u003e\n    \u003csource media=\"(prefers-color-scheme: light)\" srcset=\"docs/assets/logo/markql_logo_light.svg\"\u003e\n    \u003cimg src=\"docs/assets/logo/markql_logo_light.svg\" alt=\"MarkQL logo\" width=\"220\"\u003e\n  \u003c/picture\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eMarkQL\u003c/h1\u003e\n\u003cp align=\"center\"\u003eSQL-style query engine for HTML\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/darhnoel/markql/actions/workflows/python-wheels.yml\"\u003e\u003cimg src=\"https://github.com/darhnoel/markql/actions/workflows/python-wheels.yml/badge.svg\" alt=\"Build wheels\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/darhnoel/markql/actions/workflows/release-binaries.yml\"\u003e\u003cimg src=\"https://github.com/darhnoel/markql/actions/workflows/release-binaries.yml/badge.svg\" alt=\"Release Desktop\"\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/status-beta-orange\" alt=\"Status: Beta\"\u003e\n  \u003ca href=\"https://github.com/darhnoel/markql/tags\"\u003e\u003cimg src=\"https://img.shields.io/github/v/tag/darhnoel/markql\" alt=\"GitHub tag\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/pyxsql/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/pyxsql\" alt=\"PyPI version\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/pyxsql/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/dm/pyxsql\" alt=\"PyPI downloads\"\u003e\u003c/a\u003e\n  \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-Apache%202.0-blue.svg\" alt=\"License\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\nMarkQL is a **SQL-style query engine for HTML** that lets you **select precisely what you need**, **filter to the relevant parts of a page**, and **extract structured fields** using the familiar `SELECT ... FROM ... WHERE ...` flow, rather than relying on brittle, ad-hoc scraping logic.\n\n## Demo\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/assets/demo/quick_tutorial.gif\" alt=\"MarkQL quick tutorial GIF\" width=\"640\"\u003e\n\u003c/p\u003e\n\n## Quick Start\n\nPrerequisites:\n- CMake 3.16+\n- A C++20 compiler\n- Boost (multiprecision); set `-DMARKQL_ENABLE_KHMER_NUMBER=OFF` to skip Boost\n- Optional dependencies: `libxml2`, `curl`, `nlohmann_json`, `arrow/parquet`\n\nUbuntu/Debian/WSL (minimal packages):\n\n```bash\nsudo apt update\nsudo apt install -y \\\n  git ca-certificates pkg-config \\\n  build-essential cmake ninja-build \\\n  libboost-dev\n```\n\nOptional feature packages:\n\n```bash\nsudo apt install -y libxml2-dev libcurl4-openssl-dev nlohmann-json3-dev\n```\n\nArrow/Parquet packages (often missing on older distros):\n\n```bash\nsudo apt install -y libarrow-dev libparquet-dev\n```\n\nmacOS (Homebrew):\n\n```bash\nxcode-select --install\nbrew install cmake ninja pkg-config boost\n```\n\nOptional feature packages:\n\n```bash\nbrew install libxml2 curl nlohmann-json\n```\n\nArrow/Parquet:\n\n```bash\nbrew install apache-arrow\n```\n\nBuild (project default):\n\n```bash\n./scripts/build/build.sh\n```\n\nMinimal build when optional dependencies are unavailable:\n\n```bash\ncmake -S . -B build \\\n  -DMARKQL_WITH_LIBXML2=OFF \\\n  -DMARKQL_WITH_CURL=OFF \\\n  -DMARKQL_WITH_ARROW=OFF \\\n  -DMARKQL_WITH_NLOHMANN_JSON=OFF \\\n  -DMARKQL_BUILD_AGENT=ON \\\n  -DMARKQL_AGENT_FETCH_DEPS=ON\ncmake --build build\n```\n\nTo build without Boost, add `-DMARKQL_ENABLE_KHMER_NUMBER=OFF`.\n\nRun one query:\n\n```bash\n./build/markql --query \"SELECT div FROM doc LIMIT 5;\" --input ./data/index.html\n```\n\nRun interactive REPL:\n\n```bash\n./build/markql --interactive --input ./data/index.html\n```\n\n## Install MarkQL Desktop\n\nCurrent desktop releases ship three user-facing assets:\n\n- `MarkQL-Desktop-\u003cversion\u003e-linux-x86_64.AppImage`\n- `MarkQL-Desktop-\u003cversion\u003e-windows-x86_64.msi`\n- `markql-extension.zip`\n\nPython package releases continue to use `v*` tags. Desktop installer releases use `desktop-v*` tags.\n\nInstall flow today:\n\n1. Download and install MarkQL Desktop from the latest GitHub Release.\n2. Download `markql-extension.zip` from the same release and extract it.\n3. Open `chrome://extensions`.\n4. Enable `Developer mode`.\n5. Click `Load unpacked`.\n6. Select the extracted `markql-extension` folder.\n7. Launch MarkQL Desktop.\n8. Click `Copy Token`.\n9. Paste the token into the extension.\n10. Open a page and run queries.\n\nLinux AppImage note:\n\n- If the AppImage is not executable after download, run `chmod +x MarkQL-Desktop-\u003cversion\u003e-linux-x86_64.AppImage`.\n\nWindows note:\n\n- The MSI is unsigned in the MVP, so Windows may show an \"unknown publisher\" warning.\n\n## Browser Plugin MVP\n\nBuild and run `markql-agent` (localhost `127.0.0.1:7337`):\n\n```bash\n./scripts/build/build.sh\n./scripts/agent/start-agent.sh\n```\n\nNotes:\n- `MARKQL_AGENT_TOKEN` is the primary agent token variable.\n- `scripts/agent/start-agent.sh` sets a default token if not provided.\n- A legacy agent token variable still works during the migration window.\n- You can set your own token:\n\n```bash\nMARKQL_AGENT_TOKEN=your-secret-token ./scripts/agent/start-agent.sh\n```\n\nLoad the Chrome extension:\n1. Open `chrome://extensions`\n2. Enable `Developer mode`\n3. Click `Load unpacked`\n4. Select `browser_plugin/extension`\n\nExtension host permission:\n- `http://127.0.0.1:7337/*`\n\n## CLI Notes\n\n- Primary CLI binary is `./build/markql`.\n- Legacy compatibility binary `./build/markql` is still generated.\n- `doc` and `document` are both valid sources in `FROM`.\n- If `--input` is omitted, the CLI reads HTML from `stdin`.\n- URL sources (`FROM 'https://...'`) require `MARKQL_WITH_CURL=ON`.\n- `TO PARQUET(...)` requires `MARKQL_WITH_ARROW=ON`.\n- `INNER_HTML(...)` returns minified HTML by default. Use `RAW_INNER_HTML(...)` for unmodified raw output.\n- `TO TABLE(...)` supports explicit trimming/sparse options: `TRIM_EMPTY_ROWS`, `TRIM_EMPTY_COLS`, `EMPTY_IS`, `STOP_AFTER_EMPTY_ROWS`, `FORMAT`, `SPARSE_SHAPE`, and `HEADER_NORMALIZE`.\n\n## Testing\n\nC++ tests:\n\n```bash\ncmake --build build --target markql_tests\nctest --test-dir build --output-on-failure\n```\n\nPython package/tests (optional):\n\n```bash\n./scripts/python/install.sh\n./scripts/python/test.sh\n```\n\nBrowser plugin UI tests (optional):\n\n```bash\nnpm install\nnpx playwright install chromium\nnpm run test:browser-plugin\n```\n\n## Documentation\n\n- Book (chapter path + verified examples): [docs/book/SUMMARY.md](docs/book/SUMMARY.md)\n- Canonical tutorial: [docs/markql-tutorial.md](docs/markql-tutorial.md)\n- CLI guide: [docs/markql-cli-guide.md](docs/markql-cli-guide.md)\n- Editor support plan: [docs/editor-support-plan.md](docs/editor-support-plan.md)\n- VS Code extension: [docs/vscode-extension.md](docs/vscode-extension.md)\n- Vim plugin: [docs/vim-plugin.md](docs/vim-plugin.md)\n- Docs index: [docs/README.md](docs/README.md)\n- Script layout: [scripts/README.md](scripts/README.md)\n- Changelog: [CHANGELOG.md](CHANGELOG.md)\n\n## License\n\nApache License 2.0. See [LICENSE](LICENSE).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdarhnoel%2Fmarkql","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdarhnoel%2Fmarkql","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdarhnoel%2Fmarkql/lists"}