Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alvarobartt/ea-associate-ds
Electronic Arts (EA) NLP Assignment for: Associate Data Scientist
https://github.com/alvarobartt/ea-associate-ds
data-science electronic-arts nlp recruitment-task
Last synced: about 1 month ago
JSON representation
Electronic Arts (EA) NLP Assignment for: Associate Data Scientist
- Host: GitHub
- URL: https://github.com/alvarobartt/ea-associate-ds
- Owner: alvarobartt
- License: mit
- Created: 2020-07-28T19:06:40.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2022-09-30T19:09:16.000Z (about 2 years ago)
- Last Synced: 2024-05-01T18:32:43.244Z (7 months ago)
- Topics: data-science, electronic-arts, nlp, recruitment-task
- Language: Jupyter Notebook
- Homepage:
- Size: 99.8 MB
- Stars: 12
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Electronic Arts (EA) Assignment for: NLP Associate Data Scientist
__The goal of the test is working with a multi-language dataset, in order to demonstrate your Natural Language
Processing and Machine Translation abilities.__The Core Data Scientist and Storytelling attributes will also be evaluated during your resolution of the case.
`About the Data`:
The dataset you will be using is a multilingual, multi-context set of documents, which are a part of the one
described on the following paper: _Ferrero, Jérémy & Agnès, Frédéric & Besacier, Laurent & Schwab, Didier.
(2016). A Multilingual, Multi-Style and Multi-Granularity Dataset for Cross-Language Textual Similarity Detection._Please note the dataset is divided on contexts/categories (Conference_papers, Wikipedia, ... ) and on languages,
in the same way the folders are structured.* `Objective 1`: Create a document categorization classifier for the different contexts of the documents. You
will be addressing this objective at context level, regardless of the language the documents are written in.Tasks/Requirements:
* EDA: Exploratory data analysis of the Dataset.
* Reproducibility/Methodology: The analysis you provide must be reproductible. Your analysis will fulfill
the Data Science methodology.
* Classification model: The deliverable will include a model which will receive a document as input and will
output its class, which will be the context of that document.* `Objective 2`: Perform a topic model analysis on the provided documents. You will discover the hidden topics
and describe them.Tasks:
* Profile the different documents and topics.
* Provide a visualization of the profiles.---
## Table of Contents
* [Roadmap](#roadmap)
* [Repository Content](#repository-content)
* [Exploratory Data Analysis](#exploratory-data-analysis)
* [Text Preprocessing](#text-preprocessing)
* [Text Classification](#text-classification)
* [Topic Modelling](#topic-modelling)
* [Conclusions](#conclusions)
* [Future Work](#future-word)
* [References](#references)
* [Personal Opinion](#personal-opinion)
* [EA's Expected Way to Tackle](#eas-expected-way-to-tackle)---
## Roadmap
Before proceeding with the explanation and conclusions of every NLP tasks researched/developed for the project,
we will start by specifying the roadmap since the start day which was on Friday, July 31 until the end date of
the project which was on Tuesday, August 4.!["NLP Roadmap"](imgs/roadmap.png)
---
## Repository Content
.
├── documents_challenge/ # Dataset of Multilingual Multi-Context documents
├── research/ # Jupyter Notebooks and Reports of the project's research
├── slides/ # Jupyter Slides for presenting the project
|── imgs/ # Contains some image resources
├── 202007TestADS.pdf # Electronic Arts (EA) Associate Data Scientist Assignment PDF File
├── LICENSE # MIT License so as to release the code open-source
├── README.md # Detailed README.md so as to explain the project
└── requirements.txt # Requirements to reproduce the Jupyter NotebooksA description of the dataset and its building are described in the following paper:
[_A Multilingual, Multi-Style and Multi-Granularity Dataset for Cross-Language Textual Similarity Detection. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab. In the 10th edition of the Language Resources and Evaluation Conference (LREC 2016)_](https://www.researchgate.net/publication/301861882_A_Multilingual_Multi-Style_and_Multi-Granularity_Dataset_for_Cross-Language_Textual_Similarity_Detection)
---
## Exploratory Data Analysis
Before starting any NLP project, we first need to explore and understand the data we have
so as to decide how are we going to tackle the problem we are facing.We can see the dataset statistics from the GitHub repository [FerreroJeremy/Cross-Language-Dataset](https://github.com/FerreroJeremy/Cross-Language-Dataset):
Sub-corpus | Alignment | Authors | Translations | Translators | Alteration | NE (%)
--- | --- | --- | --- | --- | --- | --- |
__Wikipedia__ | Comparable | Anyone | - | - | Noise | 8.37
__PAN-PC-11__ | Parallel | Professional authors | Human | Professional | Yes | 3.24
__APR (Amazon Product Reviews)__ | Parallel | Anyone | Machine | Google Translate | No | 6.04
__Conference papers__ | Comparable | Computer scientists | Human | Computer scientists | Noise | 9.36During the EDA is common to plot diverse features so as to get some sort of insights on how the data is
structured accross the documents, in order to find the proper way to tackle the problem and the upcoming NLP
steps. Then, some visualizations are provided below, with some interesting data that will be explained later:!["EDA Plots"](imgs/eda-plots.png)
In this case, we plotted the distribution of the documents per context and language and the median lenght of
each single document per context, where it showed that Wikipedia is the most populated context and French texts
the biggest amount. Also both the APR and the Conference papers are the ones with fewer characters, and the PAN11
texts are between the Wikipedia and the other texts.__Reference__: [Data Exploration](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/01%20-%20Data%20Exploration.pdf)
---
## Text Preprocessing
__When it comes to NLP, data preprocessing is one, it not the most, important tasks__, since we
are adding value to the raw data.For this project, since we are facing a Mulit-Lingual Multi-Context dataset, we need to develop
a custom preprocessor which preprocesses the texts no matter the language (English, French and Spanish)
which also includes some more specific preprocessing related to the different contexts.The defined steps towards a proper preprocessing are defined as it follows:
1. __Clean Tabs and Line Breaks__: line breaks and tabs are common in text, so we will just replace them
by an space so as to make sure that removing them does not imply different words coming together.
2. __Convert to Unidecode__: so as to unify all the data, convert very str to unidecode which will replace
the accented vowels by its regular unaccented form, etc.
3. __Substitute Regular Expressions__: from a given collection of regular expressions, every match between
the regular expression and any group in the text will be replaced by a space and, so on, removed.
4. __Lower Case__: unify all the str to lower case, so as to identify the same words with different capitalizations
as the same words since all the characters will match.
5. __Split by Apostrophes__: since both English and French use the apostrophe to abbreviate text, words will be
splitted by its apostrophe if found so as to obtain two separate words from the apostrophe joined word.
6. __Remove Small Words__: a threshold has been set so as to remove the words with less than 3 characters,
since those words do not provide any useful information towards the models we need to train.
7. __Remove Stopwords__: stopwords from a list of default stopwords from every language should be removed,
and also some additional stopwords manually identified per language and context have been included so as
to provide a complete specific stopwords removal.
8. __Remove Extra Spaces__: as every regular expression and unknown character has been replaced by a space,
now multiple spaces will be substituted by a single space so as to return a str which is indeed a
space-separated list of tokens.So on, we have created a `CustomPreProcessor` which is indeed Python interface so as to preprocess
all the raw data.```python
class CustomPreProcessor(object):
"""
Custom PreProcessorPreprocesses the introduced raw text to transform it into clean text. This
preprocessing pipe is regex based.>>> from apinlp.nlp.preprocessing import CustomPreProcessor
>>> preprocessor = CustomPreProcessor()
>>> print(preprocessor._preprocess("Visit us at https://www.ea.com/"))
"visit"
"""
def __init__(self, strip_accents=True):
self.strip_accents = strip_accents
self.patterns = BASE_PATTERNS
self.additional_patterns = (SPACES_PATTERN,)self.stopwords = STOPWORDS
def _preprocess(self, text):
"""Cleans and applies a preprocessing layer to raw text"""
text = text.replace('\t', ' ').replace('\n', ' ')
if self.strip_accents:
text = unidecode(text)for pattern in self.patterns:
text = pattern.sub(' ', text)text = text.strip().lower()
text = text.replace("'", " ")
text = [word for word in text.split(' ') if len(word) > 2]for word in self.stopwords:
text = list(filter((word.lower()).__ne__, text))text = ' '.join(text)
for pattern in self.additional_patterns:
text = pattern.sub(' ', text)
return text
```Finally, we can see an example on how did the WordClouds improve with the preprocessed
data compared to the raw one.!["WordClouds Comparison"](imgs/wordcloud-comparison.png)
__Reference__: [Data Preprocessing](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/02%20-%20Data%20Preprocessing.pdf)
---
## Text Classification
Since we are facing a NLP Text Classification problem which consits on classifying multilingual data into its context
regardless the language in which the text is written.First of all, we need to define a vectorizer so as to transform the input text (already preprocessed) into a vector
and then train a model which is being fitted with those vectors. In this case we will be using the TF-IDF Vectorizer
since it is the most suitable towards tackling this problem, since it ponderates the number of occurrences of each
word inside a document with the number of occurrences of that word among all the other documents, so as to identify
the relevance of a word appearing in a document towards later predict the context in which that concrete piece
of text should be classified.Once the vectorization is completed we should just decide which classification model are we going to use depending
on both the scope and the model's requirements/limitations. In this case, since we decided to test some different
classification models, we just tested them over random stratified folds so as to see which of them performed better.!["Text Classification Models"](imgs/text-classification-models.png)
So on, after training some different classification model over some random stratified data shuffling folds, we
decided to proceed using the `LinearSVC` model since it seemed to be the most consistent one in both time and
accuracy. Then, the resulting Pipeline looks as it follows:```python
from sklearn.pipeline import Pipelinepipeline = Pipeline([
('vect', TfidfVectorizer(min_df=5)),
('clf', LinearSVC())
])
```__References__:
* [Text Classification Model](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/03%20-%20Text%20Classification%20Model.pdf)
* [Text Classification Model Testing](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/04%20-%20Text%20Classification%20Model%20Testing.pdf)---
## Topic Modelling
In this concrete case, we will be using the preprocessed data so as to fit a Topic Modelling algorithm in order
to discover the inner insights of the data and detect the hidden topics in order to have a deeper understanding
on what is data about and into which topics is the data separated.NLP Topic Modelling is a relevant part of the analysis, since it allows us to gain more insights about the
dataset we have, but since it is unsupervised, it requires us to tune the parameters until we can point out useful
conclusions which make sense from the given dataset.So on, we used the LDA (Latent Dirichlet Allocation) algorithm to identify the hidden topics in the dataset, so as
use case we started the Topic Modelling just with English texts from Wikipedia, so as to test if it worked as expected
and also to evaluate the results of one of the most populated contexts.!["Topic Modelling"](imgs/topic-modelling.png)
As we can see above, after a lot of tuning five topics were clearly identified, so we tried to establish a
relationship between the hidden topics and real topics such as Sports for example, from the top terms that were
present in those topics. And, the identified topics in the image above are (in ascending order by topic ID):
Politics/History, Music/Movies/Entertainment, Industry/Research/Chemistry, Sports/Games and Technology/Software.Topic Modelling has been applied and analysed for every possible combination of context and language, and it has
been deeply analysed in the Jupyter Notebooks.__References__:
* [Topic Modelling](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/05%20-%20Topic%20Modelling.html)
* [Topic Modelling Analysis](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/06%20-%20Topic%20Modelling%20Analysis.html)---
## Conclusions
Both objectives have been successfully completed and their respective reports have been generated, tackling the
problem as a Data Scientist should, including a detailed Story Telling on each research part developed.
Additionally to the defined objectives, a detailed data exploration analysis and text preprocessing have
been research/developed too, since it is probably the most relevant part of a NLP Data Scientist while
tackling a NLP problem, as it is adding value to the raw data.* `Objective 1`: the created model has been fit with 80% of the documents from every context and language
and tested with the remaining 20% of the data with balanced contexts and languages too, achieving an
accuracy of up to 98% on the validation set. Also this model has been dumped into a JOBLIB file so that
it can be tested over unseen data.* `Objective 2`: the topic modelling problem has been broken down into a topic modelling per context and
language, so as to get more insights and analyse the hidden topics that can be found in each collection
of documents, with also pretty satisfactory results evaluated in a supervised way.To sum up, mention that even though the project tasks have been achieved and some extra points have been
made, there is still much work ahead, so later on this Notebook, the Future Work will be defined.__Reference__: [Conclusions & Future Work](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/07%20-%20Conclusions%20%26%20Future%20Work.pdf)
---
## Future Work
As Future Work, the main line of research should be focused on developing a consistent Machine Translation
model in order to translate text from French and Spanish into English, which will indeed improve the
results even though they are pretty accurate now.Since in the first EA Interview with Francisco Martínez (EA Talent Coordinator) he spoke about the EA's
project related to Machine Translation, it would make sense to proceed with the project designing a
consistent Machine Translation model so as to test it's efficiency towards this problem.Another Future Work line of research should be the design of Deep Learning models maybe in TensorFlow
or PyTorch (usually more suitable for NLP), since we are presenting a simple use case along this project,
but reality is a bit more complex, so tackling the problem using Deep Learning models should improve the
model's performance when the input data is bigger, more contexts are provided and more languages too.Finally, multilingual word embeddings should be used so as to improve the models performance whatever
the input data is, so we should be using the word embeddings so as to "translate" (get the closest word
embedding) every word in Spanish or French to English, so as to tackle the problem as a Multi-Lingual
input one but for the model it would just be a single language. Also, when deploying the model into a
production environment a reliable layer of language detection should be applied so as to either apply
the word embeddings if the text is written in French or Spanish or discard the text if it is neither
English, Spanish nor French.__Reference__: [Conclusions & Future Work](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/07%20-%20Conclusions%20%26%20Future%20Work.pdf)
---
## References
1. [_A Multilingual, Multi-Style and Multi-Granularity Dataset for Cross-Language Textual Similarity Detection. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab. In the 10th edition of the Language Resources and Evaluation Conference (LREC 2016)_](https://www.researchgate.net/publication/301861882_A_Multilingual_Multi-Style_and_Multi-Granularity_Dataset_for_Cross-Language_Textual_Similarity_Detection)
2. [_Word Translation Without Parallel Data. Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer and Hervé Jégou. In the ICLR 2018 Computation and Language (cs.CL)_](https://arxiv.org/pdf/1710.04087.pdf)
3. [Exploiting similarities among languages for machine translation. Tomas Mikolov, Quoc V. Le and Ilya Sutskever. In the Computation and Language (cs.CL)](https://arxiv.org/abs/1309.4168)
4. [_Language-specific models in multilingual topic tracking. Leah S. Larkey, Fangfang Feng, Margaret Connell and Victor Lavrenko, In the SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval_](https://dl.acm.org/doi/abs/10.1145/1008992.1009061)
---
## Personal Opinion
This assignment was both rewarding and demanding, since the time was very limited and the multi-lingual problem was not my expertise so I had to do some extra research, which indeed was profitable as I gained new knowledge over that topic. Anyway, both the HR and the Technical Team have been so nice during all the process and the feedback on the assignment was far from good, as there is a lot of work in here.
__So feel free to use this repository as a sample NLP assignment template, since this is the format that the companies expect from a Data Scientist, Machine Learning Engineer, etc.__
P.S.: I had to quit the hiring process since I received a job opportunity that fitted better with me, so I quit before proceeding with the last interview.
---
## EA's Expected Way to Tackle
The approach I developed was nice and fully covered the scope of the assignment, since the Story Telling part was really relevant for EA and it is one of the most strong points in my assignment. Anyway, the EA's Location Team based in Madrid (Spain) and Cologne (Germany) was expecting the usage of [Helsinki NLP](https://huggingface.co/Helsinki-NLP) so as to translate all the texts into Spanish so as to tackle the Multi-Lingual Multi-Context problem just as a Multi-Context problem.
They proposed the usage of [huggingface/transformers](https://github.com/huggingface/transformers) in order to translate the texts as it can be shown in this example: https://huggingface.co/transformers/model_doc/marian.html#multilingual-models, which uses [MarianMT](https://huggingface.co/transformers/model_doc/marian.html) to load the Helsinki Machine Translation Models.