{"id":17030089,"url":"https://github.com/alvarobartt/ea-associate-ds","last_synced_at":"2025-04-12T12:12:15.194Z","repository":{"id":37629062,"uuid":"283303504","full_name":"alvarobartt/ea-associate-ds","owner":"alvarobartt","description":"Electronic Arts (EA) NLP Assignment for: Associate Data Scientist","archived":false,"fork":false,"pushed_at":"2024-08-20T16:20:02.000Z","size":104621,"stargazers_count":12,"open_issues_count":4,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-12T12:11:58.120Z","etag":null,"topics":["data-science","electronic-arts","nlp","recruitment-task"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alvarobartt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-07-28T19:06:40.000Z","updated_at":"2023-04-13T04:25:58.000Z","dependencies_parsed_at":"2023-01-17T15:46:57.468Z","dependency_job_id":null,"html_url":"https://github.com/alvarobartt/ea-associate-ds","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alvarobartt%2Fea-associate-ds","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alvarobartt%2Fea-associate-ds/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alvarobartt%2Fea-associate-ds/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alvarobartt%2Fea-associate-ds/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alvarobartt","download_url":"https://codeload.github.com/alvarobartt/ea-associate-ds/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248565078,"owners_count":21125417,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","electronic-arts","nlp","recruitment-task"],"created_at":"2024-10-14T08:04:09.895Z","updated_at":"2025-04-12T12:12:10.185Z","avatar_url":"https://github.com/alvarobartt.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Electronic Arts (EA) Assignment for: NLP Associate Data Scientist\n\n\u003cimg src=\"https://media-exp1.licdn.com/dms/image/C561BAQFjp6F5hjzDhg/company-background_10000/0?e=2159024400\u0026v=beta\u0026t=OfpXJFCHCqdhcTu7Ud-lediwihm0cANad1Kc_8JcMpA\"\u003e\n\n__The goal of the test is working with a multi-language dataset, in order to demonstrate your Natural Language \nProcessing and Machine Translation abilities.__\n\nThe Core Data Scientist and Storytelling attributes will also be evaluated during your resolution of the case.\n\n`About the Data`:\n\nThe dataset you will be using is a multilingual, multi-context set of documents, which are a part of the one \ndescribed on the following paper: _Ferrero, Jérémy \u0026 Agnès, Frédéric \u0026 Besacier, Laurent \u0026 Schwab, Didier. \n(2016). A Multilingual, Multi-Style and Multi-Granularity Dataset for Cross-Language Textual Similarity Detection._\n\nPlease note the dataset is divided on contexts/categories (Conference_papers, Wikipedia, ... ) and on languages, \nin the same way the folders are structured.\n\n* `Objective 1`: Create a document categorization classifier for the different contexts of the documents. You \nwill be addressing this objective at context level, regardless of the language the documents are written in.\n\n    Tasks/Requirements:\n\n    * EDA: Exploratory data analysis of the Dataset.\n    * Reproducibility/Methodology: The analysis you provide must be reproductible. Your analysis will fulfill \n    the Data Science methodology.\n    * Classification model: The deliverable will include a model which will receive a document as input and will \n    output its class, which will be the context of that document.\n\n* `Objective 2`: Perform a topic model analysis on the provided documents. You will discover the hidden topics \nand describe them.\n\n    Tasks:\n\n    * Profile the different documents and topics.\n    * Provide a visualization of the profiles.\n\n---\n\n## Table of Contents\n\n* [Roadmap](#roadmap)\n* [Repository Content](#repository-content)\n* [Exploratory Data Analysis](#exploratory-data-analysis)\n* [Text Preprocessing](#text-preprocessing)\n* [Text Classification](#text-classification)\n* [Topic Modelling](#topic-modelling)\n* [Conclusions](#conclusions)\n* [Future Work](#future-word)\n* [References](#references)\n* [Personal Opinion](#personal-opinion)\n* [EA's Expected Way to Tackle](#eas-expected-way-to-tackle)\n\n\n---\n\n## Roadmap\n\nBefore proceeding with the explanation and conclusions of every NLP tasks researched/developed for the project,\nwe will start by specifying the roadmap since the start day which was on Friday, July 31 until the end date of \nthe project which was on Tuesday, August 4.\n\n![\"NLP Roadmap\"](imgs/roadmap.png)\n\n---\n\n## Repository Content\n\n    .\n    ├── documents_challenge/          # Dataset of Multilingual Multi-Context documents\n    ├── research/                     # Jupyter Notebooks and Reports of the project's research\n    ├── slides/                       # Jupyter Slides for presenting the project\n    |── imgs/                         # Contains some image resources\n    ├── 202007TestADS.pdf             # Electronic Arts (EA) Associate Data Scientist Assignment PDF File\n    ├── LICENSE                       # MIT License so as to release the code open-source\n    ├── README.md                     # Detailed README.md so as to explain the project\n    └── requirements.txt              # Requirements to reproduce the Jupyter Notebooks\n\nA description of the dataset and its building are described in the following paper:\n\n[_A Multilingual, Multi-Style and Multi-Granularity Dataset for Cross-Language Textual Similarity Detection. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab. In the 10th edition of the Language Resources and Evaluation Conference (LREC 2016)_](https://www.researchgate.net/publication/301861882_A_Multilingual_Multi-Style_and_Multi-Granularity_Dataset_for_Cross-Language_Textual_Similarity_Detection)\n\n---\n\n## Exploratory Data Analysis\n\nBefore starting any NLP project, we first need to explore and understand the data we have \nso as to decide how are we going to tackle the problem we are facing.\n\nWe can see the dataset statistics from the GitHub repository [FerreroJeremy/Cross-Language-Dataset](https://github.com/FerreroJeremy/Cross-Language-Dataset):\n\nSub-corpus | Alignment | Authors | Translations | Translators | Alteration | NE (%)\n--- | --- | --- | --- | --- | --- | --- |\n__Wikipedia__ | Comparable | Anyone | - | - | Noise | 8.37\n__PAN-PC-11__ |  Parallel |  Professional authors | Human | Professional | Yes | 3.24\n__APR (Amazon Product Reviews)__ | Parallel | Anyone | Machine | Google Translate | No | 6.04\n__Conference papers__ | Comparable | Computer scientists | Human | Computer scientists | Noise | 9.36\n\nDuring the EDA is common to plot diverse features so as to get some sort of insights on how the data is\nstructured accross the documents, in order to find the proper way to tackle the problem and the upcoming NLP\nsteps. Then, some visualizations are provided below, with some interesting data that will be explained later:\n\n![\"EDA Plots\"](imgs/eda-plots.png)\n\nIn this case, we plotted the distribution of the documents per context and language and the median lenght of\neach single document per context, where it showed that Wikipedia is the most populated context and French texts \nthe biggest amount. Also both the APR and the Conference papers are the ones with fewer characters, and the PAN11\ntexts are between the Wikipedia and the other texts.\n\n__Reference__: [Data Exploration](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/01%20-%20Data%20Exploration.pdf)\n\n---\n\n## Text Preprocessing\n\n__When it comes to NLP, data preprocessing is one, it not the most, important tasks__, since we\nare adding value to the raw data.\n\nFor this project, since we are facing a Mulit-Lingual Multi-Context dataset, we need to develop\na custom preprocessor which preprocesses the texts no matter the language (English, French and Spanish)\nwhich also includes some more specific preprocessing related to the different contexts.\n\nThe defined steps towards a proper preprocessing are defined as it follows:\n\n1. __Clean Tabs and Line Breaks__: line breaks and tabs are common in text, so we will just replace them \nby an space so as to make sure that removing them does not imply different words coming together.\n2. __Convert to Unidecode__: so as to unify all the data, convert very str to unidecode which will replace\nthe accented vowels by its regular unaccented form, etc.\n3. __Substitute Regular Expressions__: from a given collection of regular expressions, every match between \nthe regular expression and any group in the text will be replaced by a space and, so on, removed.\n4. __Lower Case__: unify all the str to lower case, so as to identify the same words with different capitalizations \nas the same words since all the characters will match. \n5. __Split by Apostrophes__: since both English and French use the apostrophe to abbreviate text, words will be \nsplitted by its apostrophe if found so as to obtain two separate words from the apostrophe joined word.\n6. __Remove Small Words__: a threshold has been set so as to remove the words with less than 3 characters, \nsince those words do not provide any useful information towards the models we need to train.\n7. __Remove Stopwords__: stopwords from a list of default stopwords from every language should be removed, \nand also some additional stopwords manually identified per language and context have been included so as \nto provide a complete specific stopwords removal.\n8. __Remove Extra Spaces__: as every regular expression and unknown character has been replaced by a space, \nnow multiple spaces will be substituted by a single space so as to return a str which is indeed a \nspace-separated list of tokens.\n\nSo on, we have created a `CustomPreProcessor` which is indeed Python interface so as to preprocess \nall the raw data.\n\n```python\nclass CustomPreProcessor(object):\n    \"\"\"\n    Custom PreProcessor\n\n    Preprocesses the introduced raw text to transform it into clean text. This\n    preprocessing pipe is regex based.\n\n        \u003e\u003e\u003e from apinlp.nlp.preprocessing import CustomPreProcessor\n        \u003e\u003e\u003e preprocessor = CustomPreProcessor()\n        \u003e\u003e\u003e print(preprocessor._preprocess(\"Visit us at https://www.ea.com/\"))\n        \"visit\"\n    \"\"\"\n    \n    def __init__(self, strip_accents=True):\n        self.strip_accents = strip_accents\n        \n        self.patterns = BASE_PATTERNS\n        self.additional_patterns = (SPACES_PATTERN,)\n\n        self.stopwords = STOPWORDS\n\n    def _preprocess(self, text):\n        \"\"\"Cleans and applies a preprocessing layer to raw text\"\"\"\n        text = text.replace('\\t', ' ').replace('\\n', ' ')\n        \n        if self.strip_accents:\n            text = unidecode(text)\n\n        for pattern in self.patterns:\n            text = pattern.sub(' ', text)\n\n        text = text.strip().lower()\n        text = text.replace(\"'\", \" \")\n        \n        text = [word for word in text.split(' ') if len(word) \u003e 2]\n\n        for word in self.stopwords:\n            text = list(filter((word.lower()).__ne__, text))\n\n        text = ' '.join(text)\n            \n        for pattern in self.additional_patterns:\n            text = pattern.sub(' ', text)\n    \n        return text\n```\n\nFinally, we can see an example on how did the WordClouds improve with the preprocessed \ndata compared to the raw one.\n\n![\"WordClouds Comparison\"](imgs/wordcloud-comparison.png)\n\n__Reference__: [Data Preprocessing](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/02%20-%20Data%20Preprocessing.pdf)\n\n---\n\n## Text Classification\n\nSince we are facing a NLP Text Classification problem which consits on classifying multilingual data into its context\nregardless the language in which the text is written.\n\nFirst of all, we need to define a vectorizer so as to transform the input text (already preprocessed) into a vector \nand then train a model which is being fitted with those vectors. In this case we will be using the TF-IDF Vectorizer \nsince it is the most suitable towards tackling this problem, since it ponderates the number of occurrences of each \nword inside a document with the number of occurrences of that word among all the other documents, so as to identify \nthe relevance of a word appearing in a document towards later predict the context in which that concrete piece \nof text should be classified. \n\nOnce the vectorization is completed we should just decide which classification model are we going to use depending \non both the scope and the model's requirements/limitations. In this case, since we decided to test some different \nclassification models, we just tested them over random stratified folds so as to see which of them performed better.\n\n![\"Text Classification Models\"](imgs/text-classification-models.png)\n\nSo on, after training some different classification model over some random stratified data shuffling folds, we\ndecided to proceed using the `LinearSVC` model since it seemed to be the most consistent one in both time and\naccuracy. Then, the resulting Pipeline looks as it follows:\n\n```python\nfrom sklearn.pipeline import Pipeline\n\npipeline = Pipeline([\n    ('vect', TfidfVectorizer(min_df=5)),\n    ('clf', LinearSVC())\n])\n```\n\n__References__:\n\n* [Text Classification Model](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/03%20-%20Text%20Classification%20Model.pdf)\n* [Text Classification Model Testing](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/04%20-%20Text%20Classification%20Model%20Testing.pdf)\n\n---\n\n## Topic Modelling\n\nIn this concrete case, we will be using the preprocessed data so as to fit a Topic Modelling algorithm in order\nto discover the inner insights of the data and detect the hidden topics in order to have a deeper understanding \non what is data about and into which topics is the data separated.\n\nNLP Topic Modelling is a relevant part of the analysis, since it allows us to gain more insights about the\ndataset we have, but since it is unsupervised, it requires us to tune the parameters until we can point out useful\nconclusions which make sense from the given dataset.\n\nSo on, we used the LDA (Latent Dirichlet Allocation) algorithm to identify the hidden topics in the dataset, so as \nuse case we started the Topic Modelling just with English texts from Wikipedia, so as to test if it worked as expected\nand also to evaluate the results of one of the most populated contexts.\n\n![\"Topic Modelling\"](imgs/topic-modelling.png)\n\nAs we can see above, after a lot of tuning five topics were clearly identified, so we tried to establish a \nrelationship between the hidden topics and real topics such as Sports for example, from the top terms that were\npresent in those topics. And, the identified topics in the image above are (in ascending order by topic ID): \nPolitics/History, Music/Movies/Entertainment, Industry/Research/Chemistry, Sports/Games and Technology/Software.\n\nTopic Modelling has been applied and analysed for every possible combination of context and language, and it has \nbeen deeply analysed in the Jupyter Notebooks.\n\n__References__:\n\n* [Topic Modelling](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/05%20-%20Topic%20Modelling.html)\n* [Topic Modelling Analysis](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/06%20-%20Topic%20Modelling%20Analysis.html)\n\n---\n\n## Conclusions\n\nBoth objectives have been successfully completed and their respective reports have been generated, tackling the \nproblem as a Data Scientist should, including a detailed Story Telling on each research part developed. \nAdditionally to the defined objectives, a detailed data exploration analysis and text preprocessing have \nbeen research/developed too, since it is probably the most relevant part of a NLP Data Scientist while \ntackling a NLP problem, as it is adding value to the raw data.\n\n* `Objective 1`: the created model has been fit with 80% of the documents from every context and language \nand tested with the remaining 20% of the data with balanced contexts and languages too, achieving an \naccuracy of up to 98% on the validation set. Also this model has been dumped into a JOBLIB file so that \nit can be tested over unseen data.\n\n* `Objective 2`: the topic modelling problem has been broken down into a topic modelling per context and \nlanguage, so as to get more insights and analyse the hidden topics that can be found in each collection \nof documents, with also pretty satisfactory results evaluated in a supervised way.\n\nTo sum up, mention that even though the project tasks have been achieved and some extra points have been \nmade, there is still much work ahead, so later on this Notebook, the Future Work will be defined.\n\n__Reference__: [Conclusions \u0026 Future Work](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/07%20-%20Conclusions%20%26%20Future%20Work.pdf)\n\n---\n\n## Future Work\n\nAs Future Work, the main line of research should be focused on developing a consistent Machine Translation \nmodel in order to translate text from French and Spanish into English, which will indeed improve the \nresults even though they are pretty accurate now.\n\nSince in the first EA Interview with Francisco Martínez (EA Talent Coordinator) he spoke about the EA's \nproject related to Machine Translation, it would make sense to proceed with the project designing a \nconsistent Machine Translation model so as to test it's efficiency towards this problem.\n\nAnother Future Work line of research should be the design of Deep Learning models maybe in TensorFlow \nor PyTorch (usually more suitable for NLP), since we are presenting a simple use case along this project, \nbut reality is a bit more complex, so tackling the problem using Deep Learning models should improve the \nmodel's performance when the input data is bigger, more contexts are provided and more languages too.\n\nFinally, multilingual word embeddings should be used so as to improve the models performance whatever \nthe input data is, so we should be using the word embeddings so as to \"translate\" (get the closest word \nembedding) every word in Spanish or French to English, so as to tackle the problem as a Multi-Lingual \ninput one but for the model it would just be a single language. Also, when deploying the model into a \nproduction environment a reliable layer of language detection should be applied so as to either apply \nthe word embeddings if the text is written in French or Spanish or discard the text if it is neither \nEnglish, Spanish nor French.\n\n__Reference__: [Conclusions \u0026 Future Work](https://github.com/alvarobartt/ea-associate-ds/blob/master/research/07%20-%20Conclusions%20%26%20Future%20Work.pdf)\n\n---\n\n## References\n\n1. [_A Multilingual, Multi-Style and Multi-Granularity Dataset for Cross-Language Textual Similarity Detection. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab. In the 10th edition of the Language Resources and Evaluation Conference (LREC 2016)_](https://www.researchgate.net/publication/301861882_A_Multilingual_Multi-Style_and_Multi-Granularity_Dataset_for_Cross-Language_Textual_Similarity_Detection)\n\n2. [_Word Translation Without Parallel Data. Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer and Hervé Jégou. In the ICLR 2018 Computation and Language (cs.CL)_](https://arxiv.org/pdf/1710.04087.pdf)\n\n3. [Exploiting similarities among languages for machine translation. Tomas Mikolov, Quoc V. Le and Ilya Sutskever. In the Computation and Language (cs.CL)](https://arxiv.org/abs/1309.4168)\n\n4. [_Language-specific models in multilingual topic tracking. Leah S. Larkey, Fangfang Feng, Margaret Connell and Victor Lavrenko, In the SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval_](https://dl.acm.org/doi/abs/10.1145/1008992.1009061)\n\n---\n\n## Personal Opinion\n\nThis assignment was both rewarding and demanding, since the time was very limited and the multi-lingual problem was not my expertise so I had to do some extra research, which indeed was profitable as I gained new knowledge over that topic. Anyway, both the HR and the Technical Team have been so nice during all the process and the feedback on the assignment was far from good, as there is a lot of work in here.\n\n__So feel free to use this repository as a sample NLP assignment template, since this is the format that the companies expect from a Data Scientist, Machine Learning Engineer, etc.__\n\nP.S.: I had to quit the hiring process since I received a job opportunity that fitted better with me, so I quit before proceeding with the last interview.\n\n---\n\n## EA's Expected Way to Tackle\n\nThe approach I developed was nice and fully covered the scope of the assignment, since the Story Telling part was really relevant for EA and it is one of the most strong points in my assignment. Anyway, the EA's Location Team based in Madrid (Spain) and Cologne (Germany) was expecting the usage of [Helsinki NLP](https://huggingface.co/Helsinki-NLP) so as to translate all the texts into Spanish so as to tackle the Multi-Lingual Multi-Context problem just as a Multi-Context problem.\n\nThey proposed the usage of [huggingface/transformers](https://github.com/huggingface/transformers) in order to translate the texts as it can be shown in this example: https://huggingface.co/transformers/model_doc/marian.html#multilingual-models, which uses [MarianMT](https://huggingface.co/transformers/model_doc/marian.html) to load the Helsinki Machine Translation Models. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falvarobartt%2Fea-associate-ds","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falvarobartt%2Fea-associate-ds","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falvarobartt%2Fea-associate-ds/lists"}