https://github.com/schbenedikt/web-crawler

A simple web crawler using Python that stores the metadata of each web page in a database.
https://github.com/schbenedikt/web-crawler

crawler database mariadb mysql python python-crawler web

Last synced: about 1 year ago
JSON representation

A simple web crawler using Python that stores the metadata of each web page in a database.

Host: GitHub
URL: https://github.com/schbenedikt/web-crawler
Owner: SchBenedikt
Created: 2024-06-09T05:49:27.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2025-02-15T07:22:52.000Z (over 1 year ago)
Last Synced: 2025-04-14T13:21:48.478Z (about 1 year ago)
Topics: crawler, database, mariadb, mysql, python, python-crawler, web
Language: Python
Homepage:
Size: 42 KB
Stars: 3
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# web-crawler

A simple web crawler using Python that stores the metadata and main content of each web page in a database.

## Purpose and Functionality

The web crawler is designed to crawl web pages starting from a base URL, extract metadata such as title, description, image, locale, type, and main content, and store this information in a MongoDB database. The crawler can handle multiple levels of depth and respects the `robots.txt` rules of the websites it visits.

## Dependencies

The project requires the following dependencies:

- `requests`
- `beautifulsoup4`
- `pymongo`

You can install the dependencies using the following command:

```bash
pip install -r requirements.txt
```

## Setting Up and Running the Web Crawler

1. Clone the repository:

```bash
git clone https://github.com/schBenedikt/web-crawler.git
cd web-crawler
```

2. Install the dependencies:

```bash
pip install -r requirements.txt
```

3. Ensure that MongoDB is running on your local machine. The web crawler connects to MongoDB at `localhost:27017` and uses a database named `search_engine`.

4. Run the web crawler:

```bash
python crawler.py
```

## Installing MongoDB

To install MongoDB on your local machine, follow the instructions for your operating system:

### Windows

1. Download the MongoDB installer from the official MongoDB website: [MongoDB Download Center](https://www.mongodb.com/try/download/community)
2. Run the installer and follow the installation steps.
3. After installation, start the MongoDB service by running the following command in the Command Prompt:

```bash
net start MongoDB
```

### macOS

1. Install Homebrew if you haven't already: [Homebrew Installation](https://brew.sh/)
2. Use Homebrew to install MongoDB by running the following command in the Terminal:

```bash
brew tap mongodb/brew
brew install mongodb-community@4.4
```

3. Start the MongoDB service by running the following command:

```bash
brew services start mongodb/brew/mongodb-community
```

### Linux

1. Follow the official MongoDB installation guide for your specific Linux distribution: [MongoDB Installation Guides](https://docs.mongodb.com/manual/installation/)
2. After installation, start the MongoDB service by running the following command:

```bash
sudo systemctl start mongod
```

## Creating the Database and Collection

To create the `search_engine` database and the `meta_data` collection in MongoDB, follow these steps:

1. Open the MongoDB shell by running the following command in your terminal:

```bash
mongo
```

2. Create the `search_engine` database and switch to it:

```javascript
use search_engine
```

3. Create the `meta_data` collection:

```javascript
db.createCollection("meta_data")
```

## Example Usage

The web crawler starts from the base URL `https://github.com/schBenedikt` and extracts metadata and main content from each page it visits. The metadata and main content are then stored in the `meta_data` collection of the `search_engine` database in MongoDB.

Here is an example of how the metadata and main content are stored in the database:

```json
{
"url": "https://github.com/schBenedikt",
"title": "schBenedikt - GitHub",
"description": "GitHub profile of schBenedikt",
"image": "https://avatars.githubusercontent.com/u/12345678?v=4",
"locale": "en_US",
"type": "profile",
"main_content": "This is the main content of the page."
}
```

The web crawler will print the metadata and main content of each page it visits to the console and save it to the database. If a page is not reachable, the corresponding entry will be deleted from the database.

## Notes

- The web crawler respects the `robots.txt` rules of the websites it visits.
- The web crawler can handle multiple levels of depth, which can be configured in the `get_meta_data_from_url` function.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/schbenedikt/web-crawler

Awesome Lists containing this project

README