{"id":15956725,"url":"https://github.com/thamindur/ir-project","last_synced_at":"2026-04-19T17:05:22.502Z","repository":{"id":122468332,"uuid":"415585815","full_name":"ThaminduR/ir-project","owner":"ThaminduR","description":"Search Engine for Sri Lankan MPs","archived":false,"fork":false,"pushed_at":"2021-11-25T05:59:01.000Z","size":919,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-09T20:15:51.155Z","etag":null,"topics":["crawler","elasticsearch","python","scraping","search-engine"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ThaminduR.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-10T12:55:34.000Z","updated_at":"2021-11-25T05:59:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"d6322224-31e8-4135-9c5b-d807d8f4cf2d","html_url":"https://github.com/ThaminduR/ir-project","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThaminduR%2Fir-project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThaminduR%2Fir-project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThaminduR%2Fir-project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThaminduR%2Fir-project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ThaminduR","download_url":"https://codeload.github.com/ThaminduR/ir-project/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247152078,"owners_count":20892435,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","elasticsearch","python","scraping","search-engine"],"created_at":"2024-10-07T13:35:23.852Z","updated_at":"2026-04-19T17:05:22.468Z","avatar_url":"https://github.com/ThaminduR.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Search Engine for Sri Lankan MPs.\n\nA project carried out under the Data Mining and Information Retrieval Module.\n\nThis project contains four parts\n\n1. Data Scraping\n2. Transliterate data into Sinhala\n3. Building a index using ElasticSearch\n4. Flask Application\n\n## Data Scraping\n- Data Source: https://www.parliament.lk/en/members-of-parliament/directory-of-members\n- Missing data values were replaced by `N/A`.\n- A single missing value in the date of birth field was filled manually.\n- Data files are located in the data/ directory with the stats related to missing information.\n- Scraping scripts are located in the scrapy directory.\n- Scraped data contains following fields,\n    1. Name\n    2. Date of birth \n    3. Civil status\n    4. Religion\n    5. Party\n    6. Electoral district\n    7. Email\n    8. Served committees\n    9. Career\n\n## Translate data into Sinhala\n\n- Scraped data was transliterated into Sinhala using `mtranslate` pip package (`pip install mtranslate`).\n- `N/A` values were replaced by `දත්ත නොමැත` (\"No Data\" in Sinhala).\n- Values in the email section were kept as it is.\n\n## Indexing using ElasticSearch\n\n- The settings, mapping for the created index are located in the elasticsearch/mapping.json file.\n- Custom analyzers were introduced for both Sinhala and English languages.\n- icu tokenizer is used for the Sinhala text and standard, lowercase tokenizer is used for the English text.\n- Several character mappings were also introduced during both indexing time and query time. (. ' \" @ characters were mapped to a whitespace character.)\n- edge_ngram_filter was also used in both Sinhala and English analyzers.\n- Indexing was done with all the fields in Sinhala and `name` and `electoral` in English.\n\n## Flask Application\n\n- A simple flask application was created for the searching. Retreived data is displayed in a table.\n\n\u003cimg src=\"images\\ui.png\"\u003e\n\n# Features\n\n- Supports  searching  by  `name`,  `date  of  birth`,  `civil status`,  `religion`,  `party`, ` electoral  district`,`email`, `served committees`, `career`.\n- Supports query boosting by identifying specific fields related to query using synonyms and applying boosting to the identified fields. This uses [sinling tokenizer](https://github.com/ysenarath/sinling) for the tokenizing and word splitting. A set of predefined lists are maintained to identify the context of the query.\n- Supports bilingual search for `name` and `electoral` fields. Code-mixed queries are also supported.\n\n# Query Preprocessing\n\n\u003cimg src=\"images\\flow.png\" width=\"500\"\u003e\n\n\n# Project Structure\n\n- elasticsearch - Contains settings and mapping json for the index creation and python script for updating index with the data.\n- flask - Contains code for the flask app and app.py contains the query processing logic.\n- images - Images added in the README.md\n- irpScrape - Contains scrapy scripts, spiders and scraped data and translated data. `stats.josn` file in the data folder contains information about missing values of the data.\n# Setting Up and Running the Project\n\n- Install the required packages using requirements.txt.\n\n1. Data Scraping - In the irpScrape folder, run `scrapy crawl pm` to crawl the data from the parliment website.\n2. Translation - Run the `translate.py` script in the translate-scripts folder.\n3. Creating Index - Start elasticsearch and create an index using the `mapping.json` given in elasticsearch folder.\n4. Add Data to Index - Run the `index_dat.py` script to add data into the index.\n5. Start the Flask App - Run `python run.py` inside the flask folder to start the flask app.\n\n\nNote: The data crawled from the parliment website is used only for educational purposes only.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthamindur%2Fir-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthamindur%2Fir-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthamindur%2Fir-project/lists"}