https://github.com/alaouimehdi1995/simplified-search-engine

Multithreaded Web Crawler, Scraper, Indexer
https://github.com/alaouimehdi1995/simplified-search-engine

container crawl crawler crawling database docker docker-compose engine index indexer indexing mongodb python python-3 scraper scraping search-algorithm search-engine searching

Last synced: 8 months ago
JSON representation

Multithreaded Web Crawler, Scraper, Indexer

Host: GitHub
URL: https://github.com/alaouimehdi1995/simplified-search-engine
Owner: alaouimehdi1995
License: gpl-3.0
Created: 2017-02-06T04:34:21.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2022-12-08T03:52:10.000Z (almost 3 years ago)
Last Synced: 2023-03-10T07:59:02.046Z (over 2 years ago)
Topics: container, crawl, crawler, crawling, database, docker, docker-compose, engine, index, indexer, indexing, mongodb, python, python-3, scraper, scraping, search-algorithm, search-engine, searching
Language: Python
Homepage:
Size: 47.9 KB
Stars: 8
Watchers: 2
Forks: 1
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

Simplified Searching Engine

[![Build Status](https://travis-ci.org/alaouimehdi1995/simplified-search-engine.png?branch=master)](https://travis-ci.org/alaouimehdi1995/simplified-search-engine)
[![codecov](https://codecov.io/gh/alaouimehdi1995/simplified-search-engine/branch/master/graph/badge.svg)](https://codecov.io/gh/alaouimehdi1995/simplified-search-engine)

that crawls, scraps, indexes data and stores it into a database

The program is written in Python Language, uses regex to parse HTML, and MultiThreading to go faster.
The database part is assured by MongoDB
The Project contains 4 files:

PersonnalParser.py:

- Contains PersonnalParser class, that gets HTML content, parses it, stores it and starts new PersonnalParser Thread for each link in the page content.

DBManager.py

- Contains DBManager class, which assure the connexion with DB and inserting and/or finding operations.

fill_database.py:

- Contains the general settings like start URL, proxy settings and depth search. The first crawl Thread starts here.

main.py

- Contains the code that gets the user search, gets the database content and sorts the results by relevance.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome