{"id":44129467,"url":"https://github.com/happyhackingspace/dit","last_synced_at":"2026-03-04T17:01:01.161Z","repository":{"id":336862030,"uuid":"1151333888","full_name":"HappyHackingSpace/dit","owner":"HappyHackingSpace","description":"HTML page, form and field type classifier using ML (LogReg + CRF)","archived":false,"fork":false,"pushed_at":"2026-02-27T13:18:25.000Z","size":530,"stargazers_count":12,"open_issues_count":14,"forks_count":4,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-02-27T18:33:51.102Z","etag":null,"topics":["classification","cli","crf","forms","go","html","logistic-regression","machine-learning","nlp","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HappyHackingSpace.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-06T10:36:41.000Z","updated_at":"2026-02-27T13:16:25.000Z","dependencies_parsed_at":"2026-02-11T00:02:09.026Z","dependency_job_id":null,"html_url":"https://github.com/HappyHackingSpace/dit","commit_stats":null,"previous_names":["happyhackingspace/dit"],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/HappyHackingSpace/dit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HappyHackingSpace%2Fdit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HappyHackingSpace%2Fdit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HappyHackingSpace%2Fdit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HappyHackingSpace%2Fdit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HappyHackingSpace","download_url":"https://codeload.github.com/HappyHackingSpace/dit/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HappyHackingSpace%2Fdit/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30086512,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-04T15:40:14.053Z","status":"ssl_error","status_checked_at":"2026-03-04T15:40:13.655Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","cli","crf","forms","go","html","logistic-regression","machine-learning","nlp","web-scraping"],"created_at":"2026-02-08T22:00:32.761Z","updated_at":"2026-03-04T17:01:01.059Z","avatar_url":"https://github.com/HappyHackingSpace.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dît\n![Banner](banner.png)\n\n\n**dît** (means *found* in Kurdish) tells you the type of an HTML page, form, and fields using machine learning.\n\nIt classifies pages (login, error, landing, blog, etc.), detects whether a form is a login, search, registration, password recovery, contact, mailing list, order form, or something else, and classifies each field (username, password, email, search query, etc.). Zero external ML dependencies.\n\n## Install\n\n```bash\ngo get github.com/happyhackingspace/dit\n```\n\n## Usage\n\n### As a Library\n\n```go\nimport \"github.com/happyhackingspace/dit\"\n\n// Load classifier (finds model.json automatically)\nc, _ := dit.New()\n\n// Classify page type\npage, _ := c.ExtractPageType(htmlString)\nfmt.Println(page.Type)  // \"login\"\nfmt.Println(page.Forms) // form classifications included\n\n// Classify forms in HTML\nresults, _ := c.ExtractForms(htmlString)\nfor _, r := range results {\n    fmt.Println(r.Type)   // \"login\"\n    fmt.Println(r.Fields) // {\"username\": \"username or email\", \"password\": \"password\"}\n}\n\n// With probabilities\npageProba, _ := c.ExtractPageTypeProba(htmlString, 0.05)\nformProba, _ := c.ExtractFormsProba(htmlString, 0.05)\n\n// Train a new model\nc, _ := dit.Train(\"data/\", \u0026dit.TrainConfig{Verbose: true})\nc.Save(\"model.json\")\n\n// Evaluate via cross-validation\nresult, _ := dit.Evaluate(\"data/\", \u0026dit.EvalConfig{Folds: 10})\nfmt.Printf(\"Form accuracy: %.1f%%\\n\", result.FormAccuracy*100)\nfmt.Printf(\"Page accuracy: %.1f%%\\n\", result.PageAccuracy*100)\n```\n\n### As a CLI\n\n```bash\n# Classify page type and forms on a URL\ndit run https://github.com/login\n\n# Classify forms in a local file\ndit run login.html\n\n# With probabilities\ndit run https://github.com/login --proba\n\n# Download training data and model from Hugging Face\ndit data download\n\n# Train a model\ndit train model.json --data-folder data\n\n# Evaluate model accuracy\ndit evaluate --data-folder data\n\n# Upload training data and model to Hugging Face\ndit data upload\n```\n\n## Page Types\n\n| Type | Description |\n|------|-------------|\n| `login` | Login page |\n| `registration` | Registration / signup page |\n| `search` | Search results page |\n| `checkout` | Checkout / payment page |\n| `contact` | Contact page |\n| `password_reset` | Password reset page |\n| `landing` | Landing / home page |\n| `product` | Product page |\n| `blog` | Blog / article page |\n| `settings` | Settings / account page |\n| `soft_404` | Soft 404 (HTTP 200 but \"not found\" content) |\n| `error` | Error page (404, 403, 500, etc.) |\n| `captcha` | CAPTCHA / bot detection page |\n| `parked` | Domain parking page |\n| `coming_soon` | Under construction / maintenance page |\n| `admin` | Admin panel / dashboard |\n| `directory_listing` | Open directory index |\n| `default_page` | Unconfigured server default |\n| `waf_block` | WAF block page |\n| `other` | Other page type |\n\n## Form Types\n\n| Type | Description |\n|------|-------------|\n| `login` | Login form |\n| `search` | Search form |\n| `registration` | Registration / signup form |\n| `password/login recovery` | Password reset / recovery form |\n| `contact/comment` | Contact or comment form |\n| `join mailing list` | Newsletter / mailing list signup |\n| `order/add to cart` | Order or add-to-cart form |\n| `other` | Other form type |\n\n## Field Types\n\n| Category | Types |\n|----------|-------|\n| **Authentication** | username, password, password confirmation, email, email confirmation, username or email |\n| **Names** | first name, last name, middle name, full name, organization name, gender |\n| **Address** | country, city, state, address, postal code |\n| **Contact** | phone, fax, url |\n| **Search** | search query, search category |\n| **Content** | comment text, comment title, about me text |\n| **Buttons** | submit button, cancel button, reset button |\n| **Verification** | captcha, honeypot, TOS confirmation, remember me checkbox, receive emails confirmation |\n| **Security** | security question, security answer |\n| **Time** | full date, day, month, year, timezone |\n| **Product** | product quantity, sorting option, style select |\n| **Other** | other number, other read-only, other |\n\nFull list of 79 field type codes in `data/config.json` (run `dit data download` to get the data).\n\n## Accuracy\n\nCross-validation results (10-fold, grouped by domain):\n\n| Metric | Score |\n|--------|-------|\n| Form type accuracy | 82.9% (1135/1369) |\n| Field type accuracy | 86.6% (4518/5218) |\n| Sequence accuracy | 78.7% (1025/1302) |\n| Page type accuracy | 53.4% (403/754) |\n| Page macro F1 | 40.2% |\n| Page weighted F1 | 53.6% |\n\nTrained on 1000+ annotated web forms and 754 annotated web pages.\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md).\n\n## Credits\n\nGo port of [Formasaurus](https://github.com/scrapinghub/Formasaurus).\n\n## License\n\n[MIT](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhappyhackingspace%2Fdit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhappyhackingspace%2Fdit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhappyhackingspace%2Fdit/lists"}