{"id":20512051,"url":"https://github.com/chxmbley/sitex","last_synced_at":"2026-05-28T01:31:37.762Z","repository":{"id":144046926,"uuid":"226946095","full_name":"chxmbley/sitex","owner":"chxmbley","description":"Read text content from websites ignoring styling, behavior, and structure","archived":false,"fork":false,"pushed_at":"2019-12-09T19:17:21.000Z","size":5,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-01-14T18:25:03.565Z","etag":null,"topics":["go","golang","parser","reader","text","text-analysis","web","webscraper","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chxmbley.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-12-09T19:03:50.000Z","updated_at":"2019-12-09T19:17:23.000Z","dependencies_parsed_at":"2023-06-18T23:26:37.153Z","dependency_job_id":null,"html_url":"https://github.com/chxmbley/sitex","commit_stats":null,"previous_names":["jdchum/sitex"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/chxmbley/sitex","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chxmbley%2Fsitex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chxmbley%2Fsitex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chxmbley%2Fsitex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chxmbley%2Fsitex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chxmbley","download_url":"https://codeload.github.com/chxmbley/sitex/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chxmbley%2Fsitex/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33590884,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-27T02:00:06.184Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["go","golang","parser","reader","text","text-analysis","web","webscraper","webscraping"],"created_at":"2024-11-15T20:39:33.954Z","updated_at":"2026-05-28T01:31:37.748Z","avatar_url":"https://github.com/chxmbley.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# jdchum/sitex\n\nPackage `jdchum/sitex` reads the text content from websites ignoring styling, behavior, and structure. This package can be used to search site text for key words and phrases as well as monitoring text for changes.\n\n## Install\n\n```sh\ngo get -u github.com/jdchum/sitex\n```\n\n## Example\n\n```go\npackage main\n\nimport (\n    \"io/ioutil\"\n\n    \"github.com/jdchum/sitex\"\n)\n\nconst url = \"https://en.wikipedia.org/wiki/Go_(programming_language)\"\n\nfunc main() {\n    // Get the site's text\n    text, err := sitex.GetSiteText(url, \" \")\n    if err != nil {\n        panic(err)\n    }\n\n    // Output the text to disk\n    err = ioutil.WriteFile(\"out.txt\", []byte(text), 0644)\n    if err != nil {\n        panic(err)\n    }\n}\n\n```\n\n## API\n\n### `sitex.GetSiteText(url, sep string) (text string, err error)`\n\n\u003e Attempts to parse all human-readable text from a webpage. \"Invisible\" text such as HTML tags, JavaScript, and CSS are ignored.\n\n* `url` - URL of the webpage to fetch and parse\n* `sep` - Separator to place between chunks of parsed text\n\nReturns the text parsed from the webpage or an error if one occured.\n\n## Limitations\n\nText is parsed as-is from the initial content returned by the server. This means that content requiring additional network requests or user interactions is not available to the parser.\n\n## Roadmap\n\n* [ ] Unicode support\n* [ ] Parse visible text from attributes\n* [ ] Follow server redirects\n* [x] Parse embedded iframes\n* [ ] Parse embedded PDF text\n\n## License\n\nMIT licensed. Copyright (c) 2019-2020 Joshua Chumbley. See the LICENSE file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchxmbley%2Fsitex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchxmbley%2Fsitex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchxmbley%2Fsitex/lists"}