https://github.com/alaouimehdi1995/simplified-search-engine
Multithreaded Web Crawler, Scraper, Indexer
https://github.com/alaouimehdi1995/simplified-search-engine
container crawl crawler crawling database docker docker-compose engine index indexer indexing mongodb python python-3 scraper scraping search-algorithm search-engine searching
Last synced: 8 months ago
JSON representation
Multithreaded Web Crawler, Scraper, Indexer
- Host: GitHub
- URL: https://github.com/alaouimehdi1995/simplified-search-engine
- Owner: alaouimehdi1995
- License: gpl-3.0
- Created: 2017-02-06T04:34:21.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T03:52:10.000Z (almost 3 years ago)
- Last Synced: 2023-03-10T07:59:02.046Z (over 2 years ago)
- Topics: container, crawl, crawler, crawling, database, docker, docker-compose, engine, index, indexer, indexing, mongodb, python, python-3, scraper, scraping, search-algorithm, search-engine, searching
- Language: Python
- Homepage:
- Size: 47.9 KB
- Stars: 8
- Watchers: 2
- Forks: 1
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
Simplified Searching Engine
[](https://travis-ci.org/alaouimehdi1995/simplified-search-engine)
[](https://codecov.io/gh/alaouimehdi1995/simplified-search-engine)that crawls, scraps, indexes data and stores it into a database
The program is written in Python Language, uses regex to parse HTML, and MultiThreading to go faster.
The database part is assured by MongoDB
The Project contains 4 files:PersonnalParser.py:
- Contains PersonnalParser class, that gets HTML content, parses it, stores it and starts new PersonnalParser Thread for each link in the page content.
DBManager.py
- Contains DBManager class, which assure the connexion with DB and inserting and/or finding operations.
fill_database.py:
- Contains the general settings like start URL, proxy settings and depth search. The first crawl Thread starts here.main.py
- Contains the code that gets the user search, gets the database content and sorts the results by relevance.