{"id":16974936,"url":"https://github.com/luciopaiva/dicio-crawler","last_synced_at":"2026-05-02T03:31:52.580Z","repository":{"id":47429097,"uuid":"152919796","full_name":"luciopaiva/dicio-crawler","owner":"luciopaiva","description":"Node.js crawler for dicio.com.br.","archived":false,"fork":false,"pushed_at":"2022-12-08T02:42:09.000Z","size":89,"stargazers_count":0,"open_issues_count":4,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-11T13:52:06.020Z","etag":null,"topics":["crawler","nodejs","scraper"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luciopaiva.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-10-13T22:29:42.000Z","updated_at":"2020-12-13T12:01:52.000Z","dependencies_parsed_at":"2023-01-24T10:01:07.456Z","dependency_job_id":null,"html_url":"https://github.com/luciopaiva/dicio-crawler","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luciopaiva%2Fdicio-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luciopaiva%2Fdicio-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luciopaiva%2Fdicio-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luciopaiva%2Fdicio-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luciopaiva","download_url":"https://codeload.github.com/luciopaiva/dicio-crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247369931,"owners_count":20927927,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","nodejs","scraper"],"created_at":"2024-10-14T01:08:45.382Z","updated_at":"2026-05-02T03:31:47.548Z","avatar_url":"https://github.com/luciopaiva.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# dicio.com.br crawler\n\nAn experimental crawler for dicio.com.br. Makes throttled, concurrent requests and saves results to local sqlite database.\n\nHint: if you just want to grab the dictionary for offline use, simply download Dicio's [mobile app](https://play.google.com/store/apps/details?id=com.setegraus.dicio).\n\n## License\n\nYou should use this only for educational purposes. Please see `LICENSE.md`. I am not involved with `dicio.com.br` in any way, so run it at your own risk.\n\n## How to install and run the crawler\n\n    nvm install\n    npm install\n\n## How the crawler works\n\nStarting with some seed words (see `crawler.js::getOpenUrls()`), it crawls `dicio.com.br`, fetching each word's page, which in turn contains links to other words.\n\nFor each word, the crawler persists that word's definition and uses linked words found in the page to continue crawling the website (I call it \"open URLs\").\n\nWhenever the crawler stops (be it because it was commanded to, via `crawler.js::NUMBER_OF_REQUESTS_TO_MAKE`, be it because the user hit Ctrl+C to stop the application), it persists the current list of open URLs to the database before terminating.\n\nThe list of open URLs has its size constrained by two limits: one in-memory and the other on the database. This is to prevent memory from being totally depleted and also to avoid increasingly big delays when saving data to the database. There's no need to keep huge amounts of open URLs anyway.\n\n## Future improvements\n\nLater I found out that dicio's mobile app uses a JSON API to fetch word definitions:\n\n    https://www.dicio.com.br/api/indexv2.php?p=cadeira\n\nAnd another one for synonyms:\n\n    https://www.sinonimos.com.br/api/?method=getSinonimos\u0026palavra=cadeira\n\nAn improvement would be to start using them instead of parsing HTML. One drawback is that we loose links to other words. Synonyms could be used, but we'd end up trapped in a subset of known words (e.g.: starting with \"cadeira\", you would probably never find \"mesa\" through its synonyms and their synonyms, and so on and so on.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluciopaiva%2Fdicio-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluciopaiva%2Fdicio-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluciopaiva%2Fdicio-crawler/lists"}