{"id":17932751,"url":"https://github.com/lapetitesouris/jarvis_scraper","last_synced_at":"2025-04-03T11:16:46.721Z","repository":{"id":96859002,"uuid":"92596936","full_name":"LaPetiteSouris/jarvis_scraper","owner":"LaPetiteSouris","description":"Focused crawling","archived":false,"fork":false,"pushed_at":"2017-05-27T20:00:55.000Z","size":27,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-09T00:38:56.110Z","etag":null,"topics":["crawling","focused"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LaPetiteSouris.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-05-27T12:19:35.000Z","updated_at":"2017-05-27T19:58:26.000Z","dependencies_parsed_at":null,"dependency_job_id":"02f77354-3734-4e78-b39c-2d43ef02ef0e","html_url":"https://github.com/LaPetiteSouris/jarvis_scraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LaPetiteSouris%2Fjarvis_scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LaPetiteSouris%2Fjarvis_scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LaPetiteSouris%2Fjarvis_scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LaPetiteSouris%2Fjarvis_scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LaPetiteSouris","download_url":"https://codeload.github.com/LaPetiteSouris/jarvis_scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246989754,"owners_count":20865331,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawling","focused"],"created_at":"2024-10-28T21:30:16.316Z","updated_at":"2025-04-03T11:16:46.703Z","avatar_url":"https://github.com/LaPetiteSouris.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Focused crawler\n\nThis is a simple version of a focused crawler. Instead of massively crawling all pages' URL, for a given domain, it will first parse all sub-pages, then for each sub-page, it calculates cosine distance between the page's content and the content of a given page model.\n\nIt then returns a list of sub-pages (URLs) of all given domains whose contents best match with the given page model\n\nThis technique is called **focused crawling**, which is common in massive data collection. It is used to avoid crawling non-related pages\n\n# How does it work ?\n\nIn this simple version, the procedure is:\n1. Declare all interested domains in `jarvis_scraper/sipders/jarvis_scraper`. Give all concerned domain as a list:\n```\nstart_urls = ['http://www.musee-armee.fr']\n```\n\n2. The spider will parse all sub-pages for each domain declared in `start_urls`\n3. The crawler extract raw content of each sub-pages\n4. The crawler use RAKE algorithm to extract list of keywords for each page\n5. It then calculate cosine distance between this page and the declared keywords. The keywords are simple stored as a list in `jarvis_scraper/nlp/lib` as:\n```\nstandard_keywords = ['découvrir', 'conférences', 'musée',\n                     'expositions', 'agenda', 'objets', 'visitez',\n                     'enfants', 'ligne', 'visites', 'recherche',\n                     'missions', \"Conditions d'accès\", 'visite',\n                     'rechercher', 'jusqu', 'billetterie',\n                     'exposition']\n```\n6. For each domain, the crawler returns 5 url who has highest cosine score\n\n# How to try ?\n\n1. Install `requirements.txt` with Python 3 (3.4 is the tested environment)\n2. Navigate to project root dire and launch the following command\n```\nscrapy crawl jarvis_scraper -o result.csv\n```\n\nThe 5 url whose contents most related to the given model will be outputed in the `result.csv` file\n\nYou can try to twist the page models (declared above) and change the interested domains to see results\n\nNote that this version works best with French language page.\n\n# Libraries used\n\nThe implementation of RAKE algorithm and the list of French stopwords are taken from [here](https://github.com/zelandiya/RAKE-tutorial)\n\nThanks the author for her implementation and the given French stopwords\n# License\nThe source code is released under the MIT License\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flapetitesouris%2Fjarvis_scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flapetitesouris%2Fjarvis_scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flapetitesouris%2Fjarvis_scraper/lists"}