{"id":15173872,"url":"https://github.com/schbenedikt/web-crawler","last_synced_at":"2025-04-14T13:21:52.244Z","repository":{"id":243466724,"uuid":"812516981","full_name":"SchBenedikt/web-crawler","owner":"SchBenedikt","description":"A simple web crawler using Python that stores the metadata of each web page in a database.","archived":false,"fork":false,"pushed_at":"2025-02-15T07:22:52.000Z","size":43,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-14T13:21:48.478Z","etag":null,"topics":["crawler","database","mariadb","mysql","python","python-crawler","web"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SchBenedikt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-09T05:49:27.000Z","updated_at":"2025-04-13T11:21:05.000Z","dependencies_parsed_at":"2024-09-23T04:00:36.742Z","dependency_job_id":"1d4e5983-a58e-476b-b8b6-c244cd5971ef","html_url":"https://github.com/SchBenedikt/web-crawler","commit_stats":{"total_commits":6,"total_committers":1,"mean_commits":6.0,"dds":0.0,"last_synced_commit":"9718cbfdfc234de3af814b1d85064ac7dcb4036e"},"previous_names":["schbenedikt/web-crawler"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SchBenedikt%2Fweb-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SchBenedikt%2Fweb-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SchBenedikt%2Fweb-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SchBenedikt%2Fweb-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SchBenedikt","download_url":"https://codeload.github.com/SchBenedikt/web-crawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248886344,"owners_count":21177647,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","database","mariadb","mysql","python","python-crawler","web"],"created_at":"2024-09-27T11:03:59.421Z","updated_at":"2025-04-14T13:21:52.220Z","avatar_url":"https://github.com/SchBenedikt.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# web-crawler\n\nA simple web crawler using Python that stores the metadata and main content of each web page in a database.\n\n## Purpose and Functionality\n\nThe web crawler is designed to crawl web pages starting from a base URL, extract metadata such as title, description, image, locale, type, and main content, and store this information in a MongoDB database. The crawler can handle multiple levels of depth and respects the `robots.txt` rules of the websites it visits.\n\n## Dependencies\n\nThe project requires the following dependencies:\n\n- `requests`\n- `beautifulsoup4`\n- `pymongo`\n\nYou can install the dependencies using the following command:\n\n```bash\npip install -r requirements.txt\n```\n\n## Setting Up and Running the Web Crawler\n\n1. Clone the repository:\n\n```bash\ngit clone https://github.com/schBenedikt/web-crawler.git\ncd web-crawler\n```\n\n2. Install the dependencies:\n\n```bash\npip install -r requirements.txt\n```\n\n3. Ensure that MongoDB is running on your local machine. The web crawler connects to MongoDB at `localhost:27017` and uses a database named `search_engine`.\n\n4. Run the web crawler:\n\n```bash\npython crawler.py\n```\n\n## Installing MongoDB\n\nTo install MongoDB on your local machine, follow the instructions for your operating system:\n\n### Windows\n\n1. Download the MongoDB installer from the official MongoDB website: [MongoDB Download Center](https://www.mongodb.com/try/download/community)\n2. Run the installer and follow the installation steps.\n3. After installation, start the MongoDB service by running the following command in the Command Prompt:\n\n```bash\nnet start MongoDB\n```\n\n### macOS\n\n1. Install Homebrew if you haven't already: [Homebrew Installation](https://brew.sh/)\n2. Use Homebrew to install MongoDB by running the following command in the Terminal:\n\n```bash\nbrew tap mongodb/brew\nbrew install mongodb-community@4.4\n```\n\n3. Start the MongoDB service by running the following command:\n\n```bash\nbrew services start mongodb/brew/mongodb-community\n```\n\n### Linux\n\n1. Follow the official MongoDB installation guide for your specific Linux distribution: [MongoDB Installation Guides](https://docs.mongodb.com/manual/installation/)\n2. After installation, start the MongoDB service by running the following command:\n\n```bash\nsudo systemctl start mongod\n```\n\n## Creating the Database and Collection\n\nTo create the `search_engine` database and the `meta_data` collection in MongoDB, follow these steps:\n\n1. Open the MongoDB shell by running the following command in your terminal:\n\n```bash\nmongo\n```\n\n2. Create the `search_engine` database and switch to it:\n\n```javascript\nuse search_engine\n```\n\n3. Create the `meta_data` collection:\n\n```javascript\ndb.createCollection(\"meta_data\")\n```\n\n## Example Usage\n\nThe web crawler starts from the base URL `https://github.com/schBenedikt` and extracts metadata and main content from each page it visits. The metadata and main content are then stored in the `meta_data` collection of the `search_engine` database in MongoDB.\n\nHere is an example of how the metadata and main content are stored in the database:\n\n```json\n{\n  \"url\": \"https://github.com/schBenedikt\",\n  \"title\": \"schBenedikt - GitHub\",\n  \"description\": \"GitHub profile of schBenedikt\",\n  \"image\": \"https://avatars.githubusercontent.com/u/12345678?v=4\",\n  \"locale\": \"en_US\",\n  \"type\": \"profile\",\n  \"main_content\": \"This is the main content of the page.\"\n}\n```\n\nThe web crawler will print the metadata and main content of each page it visits to the console and save it to the database. If a page is not reachable, the corresponding entry will be deleted from the database.\n\n## Notes\n\n- The web crawler respects the `robots.txt` rules of the websites it visits.\n- The web crawler can handle multiple levels of depth, which can be configured in the `get_meta_data_from_url` function.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschbenedikt%2Fweb-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fschbenedikt%2Fweb-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschbenedikt%2Fweb-crawler/lists"}