Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/pmadruga/ds-jobindex

Machine learning techniques (NLP) applied to the jobindex.dk dataset
https://github.com/pmadruga/ds-jobindex

bert deep-learning machine-learning natural-language-processing nlp python pytorch tfidf-vectorizer transformers

Last synced: 4 days ago
JSON representation

Machine learning techniques (NLP) applied to the jobindex.dk dataset

Host: GitHub
URL: https://github.com/pmadruga/ds-jobindex
Owner: pmadruga
License: mit
Created: 2020-07-20T10:54:08.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2023-02-16T02:05:13.000Z (almost 2 years ago)
Last Synced: 2024-11-01T03:11:46.778Z (about 2 months ago)
Topics: bert, deep-learning, machine-learning, natural-language-processing, nlp, python, pytorch, tfidf-vectorizer, transformers
Language: Jupyter Notebook
Homepage:
Size: 1.54 MB
Stars: 1
Watchers: 0
Forks: 1
Open Issues: 14
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Deep Learning based search engine - Jobindex.DK case

## Abstract
After scrapping the data from jobindex.dk - Denmark's biggest job portal - of 4.2 million jobs, a set of different Natural Language Processing techniques and Machine Learning models were applied to the data. Specifically, [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) and [BERT](https://github.com/google-research/bert) models were applied on top of a direct search (no AI).

The goal was to improve search results on a non-English data set, which was achieved, especially with [BERT](https://github.com/google-research/bert).

## Structure

The search engine is structured into two parts:

1. Python script: preprocessing of text, generation of embeddings, distance calculation and file exporting.
2. Jupyter notebook: Determine BERT distances and present results.

## Run

To generate a preprocessed dataset:

```
python src --process '/data/interim/jobindex_cropped_bigger.csv'
```

To get recommendations (a preprocessed dataset has to be generated beforehand), run the `search_engine_results.ipynb` notebook.

## Scripts

To lint code, run:
```
./scripts/lint-code.sh
```

To start notebooks, run:
```
./scripts/start_notebooks.sh
```