Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/thamindur/ir-project

Search Engine for Sri Lankan MPs
https://github.com/thamindur/ir-project

crawler elasticsearch python scraping search-engine

Last synced: 1 day ago
JSON representation

Search Engine for Sri Lankan MPs

Host: GitHub
URL: https://github.com/thamindur/ir-project
Owner: ThaminduR
Created: 2021-10-10T12:55:34.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2021-11-25T05:59:01.000Z (about 3 years ago)
Last Synced: 2024-12-17T12:50:43.855Z (about 2 months ago)
Topics: crawler, elasticsearch, python, scraping, search-engine
Language: Python
Homepage:
Size: 897 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Search Engine for Sri Lankan MPs.

A project carried out under the Data Mining and Information Retrieval Module.

This project contains four parts

1. Data Scraping
2. Transliterate data into Sinhala
3. Building a index using ElasticSearch
4. Flask Application

## Data Scraping
- Data Source: https://www.parliament.lk/en/members-of-parliament/directory-of-members
- Missing data values were replaced by `N/A`.
- A single missing value in the date of birth field was filled manually.
- Data files are located in the data/ directory with the stats related to missing information.
- Scraping scripts are located in the scrapy directory.
- Scraped data contains following fields,
1. Name
2. Date of birth
3. Civil status
4. Religion
5. Party
6. Electoral district
7. Email
8. Served committees
9. Career

## Translate data into Sinhala

- Scraped data was transliterated into Sinhala using `mtranslate` pip package (`pip install mtranslate`).
- `N/A` values were replaced by `දත්ත නොමැත` ("No Data" in Sinhala).
- Values in the email section were kept as it is.

## Indexing using ElasticSearch

- The settings, mapping for the created index are located in the elasticsearch/mapping.json file.
- Custom analyzers were introduced for both Sinhala and English languages.
- icu tokenizer is used for the Sinhala text and standard, lowercase tokenizer is used for the English text.
- Several character mappings were also introduced during both indexing time and query time. (. ' " @ characters were mapped to a whitespace character.)
- edge_ngram_filter was also used in both Sinhala and English analyzers.
- Indexing was done with all the fields in Sinhala and `name` and `electoral` in English.

## Flask Application

- A simple flask application was created for the searching. Retreived data is displayed in a table.

# Features

- Supports searching by `name`, `date of birth`, `civil status`, `religion`, `party`, ` electoral district`,`email`, `served committees`, `career`.
- Supports query boosting by identifying specific fields related to query using synonyms and applying boosting to the identified fields. This uses [sinling tokenizer](https://github.com/ysenarath/sinling) for the tokenizing and word splitting. A set of predefined lists are maintained to identify the context of the query.
- Supports bilingual search for `name` and `electoral` fields. Code-mixed queries are also supported.

# Query Preprocessing

# Project Structure

- elasticsearch - Contains settings and mapping json for the index creation and python script for updating index with the data.
- flask - Contains code for the flask app and app.py contains the query processing logic.
- images - Images added in the README.md
- irpScrape - Contains scrapy scripts, spiders and scraped data and translated data. `stats.josn` file in the data folder contains information about missing values of the data.
# Setting Up and Running the Project

- Install the required packages using requirements.txt.

1. Data Scraping - In the irpScrape folder, run `scrapy crawl pm` to crawl the data from the parliment website.
2. Translation - Run the `translate.py` script in the translate-scripts folder.
3. Creating Index - Start elasticsearch and create an index using the `mapping.json` given in elasticsearch folder.
4. Add Data to Index - Run the `index_dat.py` script to add data into the index.
5. Start the Flask App - Run `python run.py` inside the flask folder to start the flask app.

Note: The data crawled from the parliment website is used only for educational purposes only.