An open API service indexing awesome lists of open source software.

https://github.com/alaouimehdi1995/simplified-search-engine

Multithreaded Web Crawler, Scraper, Indexer
https://github.com/alaouimehdi1995/simplified-search-engine

container crawl crawler crawling database docker docker-compose engine index indexer indexing mongodb python python-3 scraper scraping search-algorithm search-engine searching

Last synced: 8 months ago
JSON representation

Multithreaded Web Crawler, Scraper, Indexer

Awesome Lists containing this project

README

          

Simplified Searching Engine

[![Build Status](https://travis-ci.org/alaouimehdi1995/simplified-search-engine.png?branch=master)](https://travis-ci.org/alaouimehdi1995/simplified-search-engine)
[![codecov](https://codecov.io/gh/alaouimehdi1995/simplified-search-engine/branch/master/graph/badge.svg)](https://codecov.io/gh/alaouimehdi1995/simplified-search-engine)

that crawls, scraps, indexes data and stores it into a database


The program is written in Python Language, uses regex to parse HTML, and MultiThreading to go faster.
The database part is assured by MongoDB
The Project contains 4 files:

PersonnalParser.py:


- Contains PersonnalParser class, that gets HTML content, parses it, stores it and starts new PersonnalParser Thread for each link in the page content.

DBManager.py


- Contains DBManager class, which assure the connexion with DB and inserting and/or finding operations.

fill_database.py:


- Contains the general settings like start URL, proxy settings and depth search. The first crawl Thread starts here.

main.py


- Contains the code that gets the user search, gets the database content and sorts the results by relevance.