Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pmadruga/ds-jobindex
Machine learning techniques (NLP) applied to the jobindex.dk dataset
https://github.com/pmadruga/ds-jobindex
bert deep-learning machine-learning natural-language-processing nlp python pytorch tfidf-vectorizer transformers
Last synced: 4 days ago
JSON representation
Machine learning techniques (NLP) applied to the jobindex.dk dataset
- Host: GitHub
- URL: https://github.com/pmadruga/ds-jobindex
- Owner: pmadruga
- License: mit
- Created: 2020-07-20T10:54:08.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2023-02-16T02:05:13.000Z (almost 2 years ago)
- Last Synced: 2024-11-01T03:11:46.778Z (about 2 months ago)
- Topics: bert, deep-learning, machine-learning, natural-language-processing, nlp, python, pytorch, tfidf-vectorizer, transformers
- Language: Jupyter Notebook
- Homepage:
- Size: 1.54 MB
- Stars: 1
- Watchers: 0
- Forks: 1
- Open Issues: 14
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Deep Learning based search engine - Jobindex.DK case
## Abstract
After scrapping the data from jobindex.dk - Denmark's biggest job portal - of 4.2 million jobs, a set of different Natural Language Processing techniques and Machine Learning models were applied to the data. Specifically, [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) and [BERT](https://github.com/google-research/bert) models were applied on top of a direct search (no AI).The goal was to improve search results on a non-English data set, which was achieved, especially with [BERT](https://github.com/google-research/bert).
## Structure
The search engine is structured into two parts:
1. Python script: preprocessing of text, generation of embeddings, distance calculation and file exporting.
2. Jupyter notebook: Determine BERT distances and present results.## Run
To generate a preprocessed dataset:
```
python src --process '/data/interim/jobindex_cropped_bigger.csv'
```To get recommendations (a preprocessed dataset has to be generated beforehand), run the `search_engine_results.ipynb` notebook.
## Scripts
To lint code, run:
```
./scripts/lint-code.sh
```To start notebooks, run:
```
./scripts/start_notebooks.sh
```