https://github.com/pmadruga/ds-jobindex
Machine learning techniques (NLP) applied to the jobindex.dk dataset
https://github.com/pmadruga/ds-jobindex
bert deep-learning machine-learning natural-language-processing nlp python pytorch tfidf-vectorizer transformers
Last synced: 4 months ago
JSON representation
Machine learning techniques (NLP) applied to the jobindex.dk dataset
- Host: GitHub
- URL: https://github.com/pmadruga/ds-jobindex
- Owner: pmadruga
- License: mit
- Created: 2020-07-20T10:54:08.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2023-02-16T02:05:13.000Z (over 3 years ago)
- Last Synced: 2025-10-20T04:13:05.319Z (8 months ago)
- Topics: bert, deep-learning, machine-learning, natural-language-processing, nlp, python, pytorch, tfidf-vectorizer, transformers
- Language: Jupyter Notebook
- Homepage:
- Size: 1.54 MB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 14
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Deep Learning based search engine - Jobindex.DK case
## Abstract
After scrapping the data from jobindex.dk - Denmark's biggest job portal - of 4.2 million jobs, a set of different Natural Language Processing techniques and Machine Learning models were applied to the data. Specifically, [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) and [BERT](https://github.com/google-research/bert) models were applied on top of a direct search (no AI).
The goal was to improve search results on a non-English data set, which was achieved, especially with [BERT](https://github.com/google-research/bert).
## Structure
The search engine is structured into two parts:
1. Python script: preprocessing of text, generation of embeddings, distance calculation and file exporting.
2. Jupyter notebook: Determine BERT distances and present results.
## Run
To generate a preprocessed dataset:
```
python src --process '/data/interim/jobindex_cropped_bigger.csv'
```
To get recommendations (a preprocessed dataset has to be generated beforehand), run the `search_engine_results.ipynb` notebook.
## Scripts
To lint code, run:
```
./scripts/lint-code.sh
```
To start notebooks, run:
```
./scripts/start_notebooks.sh
```