{"id":14956859,"url":"https://github.com/jofaval/tfm-iabd","last_synced_at":"2025-10-24T10:31:17.408Z","repository":{"id":39130647,"uuid":"472971776","full_name":"jofaval/tfm-iabd","owner":"jofaval","description":"Master's Final Degree Project on Artificial Intelligence and Big Data","archived":false,"fork":false,"pushed_at":"2022-06-01T16:13:14.000Z","size":17895,"stargazers_count":5,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-29T10:22:35.372Z","etag":null,"topics":["ai-engineering","big-data","big-data-analytics","data-analysis","data-architecture","data-engineering","data-science","data-science-project","fastapi","kafka","mongo-db","mongodb","nlp","node-red","nodered","python","sentiment-analysis","spark","spark-streaming","transformers"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jofaval.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null}},"created_at":"2022-03-22T23:33:37.000Z","updated_at":"2024-08-09T14:42:09.000Z","dependencies_parsed_at":"2022-08-31T04:11:15.675Z","dependency_job_id":null,"html_url":"https://github.com/jofaval/tfm-iabd","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jofaval%2Ftfm-iabd","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jofaval%2Ftfm-iabd/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jofaval%2Ftfm-iabd/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jofaval%2Ftfm-iabd/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jofaval","download_url":"https://codeload.github.com/jofaval/tfm-iabd/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237950914,"owners_count":19392667,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-engineering","big-data","big-data-analytics","data-analysis","data-architecture","data-engineering","data-science","data-science-project","fastapi","kafka","mongo-db","mongodb","nlp","node-red","nodered","python","sentiment-analysis","spark","spark-streaming","transformers"],"created_at":"2024-09-24T13:13:38.551Z","updated_at":"2025-10-24T10:31:12.346Z","avatar_url":"https://github.com/jofaval.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Master's Final Degree Project #\n\nArtificial Intelligence and Big Data\n\nThe motivation behind the project is to work as a team with the idea of joining everthing we've seen, in other words:\n\nBeing able to design, research, develop and deploy a Data Science idea designing a Big Data Architecture from which to train a model with a conclusion in mind while being ethical and not breaking any EU laws.\n\nFor reference about the changes, please, check out our [CHANGELOG](./CHANGELOG.md).\n\n### \u003cp align=\"right\"\u003eGrade\u003c/p\u003e\n\u003cp align=\"right\"\u003e\nTo be graded\n\u003c/p\u003e\n\n## Table of Contents\n\n1. [Title](#title)\n1. [Description](#description)\n1. [Objectives](#objectives)\n1. [Ethics](#ethics)\n1. [Design](#design)\n    1. [Flow of the Data](#flow-of-the-data)\n    1. [Data Structure](#data-structure)\n    1. [Data Sources](#data-sources)\n1. [Product](#product)\n    1. [Product Roadmap](#product-roadmap)\n    1. [How is the Product managed?](#how-is-the-product-managed)\n1. [Methodology](#methodology)\n    1. [Product Owner](#product-owner)\n    1. [Scrum Muster](#scrum-muster)\n    1. [Software](#software)\n1. [Tech Stack](#tech-stack)\n    1. [Programming Language](#programming-language)\n    1. [ETL](#etl)\n    1. [Database](#database)\n    1. [Cloud computing](#cloud-computing)\n    1. [Infrastructure](#infrastructure)\n1. [Usage](#usage)\n    1. [Requirements](#requirements)\n    1. [Install the project](#install-the-project)\n    1. [How to boot it](#how-to-boot-it)\n    1. [Stop the execution](#stop-the-execution)\n    1. [Deployment](#deployment)\n1. [Team](#team)\n    1. [Infrastructure (Big Data Architecture)](#infrastructure-big-data-architecture)\n    1. [Data Extraction/Mining](#data-extractionmining)\n    1. [Data Normalization](#data-normalization)\n    1. [Data Storage/Loading](#data-storageloading)\n    1. [Data Cleansing](#data-cleansing)\n    1. [Data Science/Modeling (AI Engineering, sort of)](#data-sciencemodeling-ai-engineering-sort-of)\n    1. [Data Visualization](#data-visualization)\n    1. [Deploy (CI/CD integration)](#deploy-cicd-integration)\n1. [License](#license)\n1. [Legal Notice](#legal-notice)\n1. [Credits](#credits)\n1. [Gratitude](#gratitude)\n\n## Title\n[↑ Back to top](#table-of-contents)\n\n\"Hype\" is all you need\n\n## Description\n[↑ Back to top](#table-of-contents)\n\nThis is research into what defines the success of films, and whether success can be predicted (proportionally) based on the hype (expectation) generated around a film; to be able to be expandable with both series and anime, video games or any other type of multimedia content or not.\n\nIt is intended, as possible definitions of the success of a film, to be able to predict:\n\n- The benefits generated of a film based on its initial investment and how good will it be received\n- The acceptance/acclamation of a film with respect to the initial \"hype\"\n- Predict the note on IMDB a week after release, and whoever says IMDB can say other platforms (Rotten Tomatoes, Metacritic)\n- Predict your success (previously defined) one week after your release\n\nFor this, various data sources will be used, such as: Twitter, Reddit, YouTube, IMDB, and those that we can discover as the investigation progresses. One of the main and central components of the application is sentiment analysis, which would become the main focus of the prediction.\n\n## Documentation\n[↑ Back to top](#table-of-contents)\n\nFor the official documentation visit the [/docs](/docs/README.md) folder\n\n## Objectives\n[↑ Back to top](#table-of-contents)\n\n_Not in a specific order._\n\n- Teamwork as a team of Data Sciencist with (almost) no experience in the data field.\n- Use knowledge from every subject seen in the degree.\n- Develop all the required elements components and integrate them.\n- Design a Data Infrastructure.\n- Research about the movie's hype and it's success, and it's total box-office.\n- Manage and develop an E2E (end-to-end) Big Data project, from idea to analysis/visualizations.\n- Apply AI Engineering techniques to deliver a product that showcases our conclusion.\n- Develop the (A.I. and machine learning) models required for the desired outcome.\n- Use Cloud Computing Services where needed and learn to work with them.\n- Fullfill a Data Science Project requirements with a Data Team.\n- Trying to understand and predict the box office of blockbuster (mainly) movies, wether independent or from a franchise.\n\n## Ethics\n[↑ Back to top](#table-of-contents)\n\nOur idea is to have a non-biased model that does not get influenced by people's opinion, rather, can know the difference between the general sentiment and how well will it reflect the movie's success.\n\nRegarding the ethics, our goal woudln't be to forcefeed certain movies, nor to dictate whatpeople should do/watch, it'd be to have, just another tool to decide what you may want to see.\n\n## Design\n[↑ Back to top](#table-of-contents)\n\n### Flow of the Data\n[↑ To the section](#design)\n\n1. Node-RED sniffs the data and sends them to\n1. Kafka, which itself distributes it to\n1. Spark for them to be transformed and stored in\n1. MongoDB to be later retrieved with\n1. Google Colab/Python\n1. To be trained with Spark saving the predictions in\n1. MongoDB so they can be accessed from\n1. PowerBI/Tableau and display them in\n1. Azure Web Service with a simple Front with an even simpler interaction\n\n### Data Structure\n[↑ To the section](#design)\n\nAll the data will have an origin tag/field as to better identify it's properties\n\n#### Data Lake\n\nInstead of following the classic paradigm of ETL, first extract the data, then transform it BEFORE loading it. Data Lakes strives for the ELT, extract the data, load it FIRST then transform it when you need to use it.\n\nAnd we'll be using it to store all the (raw) data, that we collect in the span of the project. We'll be having Diogenes syndrome towards the data. We'd rather delete data than not having enough.\n\n#### Data Warehouse\n\nFrom this point forward we should have quality data, data that is \"clean\". Following the aforementioned ELT paradigm, a Data Warehouse is where the information will be loaded ONCE Transformed.\n\nIt will serve us as the main storage for our models, all the data that comes to this point, should and must be: clean, standarized, normalized and regularized. It should be as ready as possible for the model.\n\n### Data Sources\n[↑ To the section](#design)\n\n- IMDB\n- Twitter\n- YouTube\n- Reddit\n- Google Trends\n\n## Product\n[↑ Back to top](#table-of-contents)\n\nWe're not going to sell anyting, but, our Product idea is to have a model that retrains with differente sources of information to display the outcome on the web and with some storytelling with the conclusion.\n\n### Product Roadmap\n[↑ To the section](#product)\n\n#### Original estimation\n[↑ To the section](#product)\n\nThe initial estimation, it should be updated with the real roadmap at the end.\n\n![Roadmap](/pages/screenshots/home/Product%20Roadmap%2017-04-2022.png)\n\u003cp align=\"center\"\u003eInitial product roadmap\u003c/p\u003e\n\n#### Real\n[↑ To the section](#product)\n\n**_The project has not yet been finished_**\n\n### How is the Product managed?\n[↑ To the section](#product)\n\nWe've splitted the product in different phases. The traditiona Product phases, and expanded the Data Science development ones:\n\n#### Traditional\n[↑ To the section](#product)\n\n- Product Identification\n- Product Planification\n- Product Development\n- Product Control\n- Product Closure\n\n#### Product Development\n[↑ To the section](#product)\n\n- Infrastructure\n- Data Extraction\n- Data Normalization\n- Data Storage/Loading\n- Data Cleansing\n- Data Science/Modeling\n- Data Visualization\n- Deploy\n- Documentation Draft\n- Validation\n\n## Methodology\n[↑ Back to top](#table-of-contents)\n\nSCRUM\n\n- Kanban Board\n- Planning Poker\n\n### Product Owner\n[↑ To the section](#methodology)\n\nPepe\n\n### Tech/Team Lead\n[↑ To the section](#methodology)\n\nPepe\n\n### Scrum Muster\n[↑ To the section](#methodology)\n\nOur teachers\n\n### Software\n[↑ To the section](#methodology)\n\n- Trello\n\n## Tech Stack\n[↑ Back to top](#table-of-contents)\n\n### Programming Language\n[↑ To the section](#tech-stack)\n\n- **Python**\\\nAn easy-to-learn language chosen, mainly, because it's what the team's most comfortable with related to Big Data and A.I. technologies and it's usage. There were alternatives such as Scala, C++ or Java.\n\n### ETL\n[↑ To the section](#tech-stack)\n\n1. **Node-RED**\\\nA light weight graph/node based npm package for flow development to connect services, such as, APIs, and IoT.\n1. **Kafka**\\\nA data broker, one of the most used ones, if not the most used, meant to be used with Java or Scala, but can be interacted with through plugins, add-ons, and shell scripts\n1. **Spark**\\\nA highly efficient cluster computation and paralelization. It's API allows for Python (PySpark), Java, Scala, R and SQL, which makes it a perfect fit for our team. It is in high demand nowadays.\n\n### Database\n[↑ To the section](#tech-stack)\n\n- **MongoDB**\\\nAn opensource NoSQL document based Database, it has a great community and multiple implementations and integrations.\n\n### Cloud computing\n[↑ To the section](#tech-stack)\n\n- **AWS or Azure**\\\nBoth great cloud computing services that offer similar services, each with their own pros and cons, but both are top notch in the world of cloud computing, data science and DaaS (Data as a Service)\n- **Terraform (and maybe AWS CloudFormation)**\\\nIaC (Infrastructure as Code) is the way to go, cloudformation forces/restricts us to one service, but it is important that, however it is that we develop and deploy our cloud infrastructure, if ever, it is, cloud agnostic if possible, but easily replicable, and highly reliable, it should always produce the some output, the same outcome, without (as much) human mistake.\n\n### Infrastructure\n[↑ To the section](#tech-stack)\n\n- **Docker**\\\nAn open-source software container service that adds and extra layer of abstraction for packing software solutions\n- **Compose**\\\nA cloud-agnostic standard for container orchestration maintained by Docker that is supported by: Docker Swarm, AWS ECS, Azure Container Instances, and many more.\n\n## Usage\n[↑ Back to top](#table-of-contents)\n\n### Requirements\n[↑ To the section](#usage)\n\n- Docker\n  - \u003e Engine Version 20.10\n  - \u003e Compose Version 1.29.2\n- Python\n  - \u003e \\\u003e= 3.6.x\n- Node\n  - \u003e \\\u003e= v15.14.0\n\n_All the images versions will be provided on each Dockerfile with the exact version, avoid the `latest` for security reasons, upgrades will be manual._\n\n### Install the project\n[↑ To the section](#usage)\n\nExecute the following command on the folder you want to store the project in\n\n```bash\ngit clone https://github.com/jofaval/tfm-iabd.git\ncd tfm-iabd\n```\n\nAnd now configure the project's branches with Git flow\n\nFor Windows\n```bash\ncd tools/windows/git/\ngit-flow.bat\n```\n\nFor Linux\n```bash\ncd tools/linux/git/\n./git-flow.sh\n```\n\n### How to boot it\n[↑ To the section](#usage)\n\nExecute the `tools/windows/infra/stop.bat` or the `tools/linux/infra/stop.sh` file\n\nor execute the following commands on the shell\n\n```bash\ncd app/infra\ndocker-compose up -d\n```\n\n### Stop the execution\n[↑ To the section](#usage)\n\nExecute the `tools/windows/infra/stop.bat` or the `tools/linux/infra/stop.sh` file\n\nor execute the following commands on the shell\n\n```bash\ncd app/infra\ndocker-compose down\n```\n\n### Deployment\n[↑ To the section](#usage)\n\nHandled by the Github Actions workflow\n\n## Team\n[↑ Back to top](#table-of-contents)\n\n|                          Name                          |                       Role                      |\n|:------------------------------------------------------:|:-----------------------------------------------:|\n| [Diego del Caño](https://github.com/ddelcanonavarrete) |          Data Scientist / Data Analyst          |\n|  [Juan Crespin Valero](https://github.com/juancrespin) |             Data Analyst / SysAdmin             |\n|      [Nerea Gluskova](https://github.com/Rubirea)      |             Data Engineer / SysAdmin            |\n|    [Pepe Fabra Valverde](https://github.com/jofaval)   | Data Architect / Data Engineer / Data Scientist |\n\n_Table generated with: [https://www.tablesgenerator.com/markdown_tables](https://www.tablesgenerator.com/markdown_tables)_\n\nI (Pepe) will be supervising each task, but we're all out here to help each other.\n\n### Infrastructure (Big Data Architecture)\n[↑ To the section](#team)\n\n#### Description\n\nDefined as Preparation of docker images, ready and interjoined to support the architecture.\n\n#### Software\n\nDocker (Docker-compose), Linux, if cloud computing were to be required (AWS, Azure or Google Cloud)\n\n#### Elements\n\nThe information regarding the infrastructure it's in the [**Infrastructure**](#tech-stack) section.\n\n#### Asignees\n\n- Nerea\n- Juan\n- Pepe (only if cloud computing is required)\n\n### Data Extraction/Mining\n[↑ To the section](#team)\n\n#### Description\n\nDefined as Retrieving all the necessary data for it's work. (JUST retrieving data)\n\n#### Software\n\nNode-RED\n\n#### Asignees\n\n- Nerea\n- Pepe\n- Everyone to search for Data Sources\n\n#### Data Sources\n\n- Twitter Developer API\n- IMDB API\n- YouTube API\n- Reddit API\n- Google Trends\n\n### Data Normalization\n[↑ To the section](#team)\n\n#### Description\n\nDefined as After the data has being retrieved, create a middleground with the common data that may be needed so that all sources end up with the same Data Model, in other words, standarizing the sources.\n\n#### Software\n\nNode-RED\n\n#### Asignees\n\n- Diego\n- Nerea\n- Juan\n- Pepe\n\n### Data Storage/Loading\n[↑ To the section](#team)\n\n#### Description\n\nDefined as Storing the normalized data into the NoSQL DB (MongoDB most likely).\n\n#### Software\n\nNode-RED\n\n#### Asignees\n\n- Nerea\n\n### Data Cleansing\n[↑ To the section](#team)\n\n#### Description\n\nDefined as At this point, the data has been normalized, but not cleaned, the data should be ready for the Model to train with.\n\n#### Software\n\nPython (Google Colab?)\n\n#### Asignees\n\n- Diego\n- Juan\n- Pepe\n\n### Data Science/Modeling (AI Engineering, sort of)\n[↑ To the section](#team)\n\n#### Description\n\nDefined as Developing and implement the required model(s) for the desired performance and outcome.\n\nArtificial Intelligence and/or Machine Learning.\n\n#### Software\n\nPython (Google Colab?)\n\n#### Asignees\n\n- Diego\n- Pepe\n\n### Data Visualization\n[↑ To the section](#team)\n\n#### Description\n\nDefined as Designing and developing the story (StoryTelling) and all the required/desired visualizations for whaterever the outcome(s) are that we want.\n\n#### Software\n\nPowerBI or Tableau, up to taste.\n\n#### Asignees\n\n- Juan\n- Nerea\n- Diego\n\n### Deploy (CI/CD integration)\n[↑ To the section](#team)\n\n#### Description\n\nDefined as Prepare the connections, and proper usage of the model via endpoints and utilities.\n\n#### Software\n\nCloud Platform (if used), Git (Github)\n\n#### Asignees\n\n- Diego\n- Pepe\n\n## License\n[↑ Back to top](#table-of-contents)\n\nThe license used (MIT License) can be seen [here](./LICENSE) or you can read it locally by downloading the LICENSE file\n\n## Legal Notice\n[↑ Back to top](#table-of-contents)\n\nAll the data used is being used and stored up-to-date with the European Union's legislation, more precisely, to Span's laws which comply with E.U.'s law [GDPR (General Data Protection Regulation)](https://gdpr-info.eu/) and following the standards described at the [Charter of European Digital Rights (EDRi, EDR initiative)](https://edri.org/), surrounding the usage A.I. towards sentiment analysis and overall in the possible bias it may provide to the user. As to be ethical and prepare the model for the coming years.\n\nFor more information about the ethics of our model, please refer to the [Ethics' section](#ethics).\n\n## Use of the Data\n[↑ Back to top](#table-of-contents)\n\nWe plan to use the extracted data and it's provided data to better analyze the sentiments of users all around the world about the hype generated by a movie, wether is it's announcement, a trailer, some celeb talking about it.\n\nBy analyzing the general feeling, whether positive, negative, or neutral, we could determine if one user at a time, had a good or bad experience, they were hyped, or not.\nSo we can later influence our model towards the idea people have/had of the movie.\n\nWe'll collect the raw text data, if it's a thread, the more information we'll collect, so we can tokenize, lemmatize, preprocess and prepare the text.\nOur methodology is to preprocess, and clean the data, tokenized it into a word embedding, and using Transformers, maybe Siamese Neural Networks, but surely mT5 HuggingFace BERT to make a Logic Consequence with NLI so that we can “classify the data”.\n\nMaybe even reviews or the general feeling, in case of adaptations we'd have even more information.\n\nAnd to display the conclusion obtained thanks to the insight of the data extracted. We’ll use personal websites, github of course, a medium article. We’d like to develop and research a paper so that we could more clearly provide, document and explain the results obtained and it’s conclusions.\n\nAs for the tools, Tableau, but maybe we could get PowerBI through studentship, it’s unclear at the moment.\n\n## Credits\n[↑ Back to top](#table-of-contents)\n\n- Ismael, for the idea\n\n## Gratitude\n[↑ Back to top](#table-of-contents)\n\nTODO","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjofaval%2Ftfm-iabd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjofaval%2Ftfm-iabd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjofaval%2Ftfm-iabd/lists"}