{"id":22895888,"url":"https://github.com/daniel-lima-lopez/collaborative-filtering-in-recomender-system","last_synced_at":"2026-07-10T21:31:51.620Z","repository":{"id":252335203,"uuid":"840113245","full_name":"daniel-lima-lopez/Collaborative-Filtering-in-Recomender-System","owner":"daniel-lima-lopez","description":"A kNN-based collaborative filtering applied to a movie recommender system","archived":false,"fork":false,"pushed_at":"2024-11-04T03:42:29.000Z","size":1782,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-16T05:28:02.349Z","etag":null,"topics":["collaborative-filtering","knn","knn-algorithm","machine-learning","recomendation-algorithm","recomendation-system","recommender-system"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/daniel-lima-lopez.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-09T02:11:44.000Z","updated_at":"2024-11-04T03:42:32.000Z","dependencies_parsed_at":"2024-11-04T04:21:55.546Z","dependency_job_id":"836ad9cf-05ba-4113-ad74-8a7e3fb73085","html_url":"https://github.com/daniel-lima-lopez/Collaborative-Filtering-in-Recomender-System","commit_stats":null,"previous_names":["daniel-lima-lopez/collaborative-filtering-in-recomender-system"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/daniel-lima-lopez/Collaborative-Filtering-in-Recomender-System","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daniel-lima-lopez%2FCollaborative-Filtering-in-Recomender-System","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daniel-lima-lopez%2FCollaborative-Filtering-in-Recomender-System/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daniel-lima-lopez%2FCollaborative-Filtering-in-Recomender-System/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daniel-lima-lopez%2FCollaborative-Filtering-in-Recomender-System/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/daniel-lima-lopez","download_url":"https://codeload.github.com/daniel-lima-lopez/Collaborative-Filtering-in-Recomender-System/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daniel-lima-lopez%2FCollaborative-Filtering-in-Recomender-System/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":35344523,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-10T02:00:06.465Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["collaborative-filtering","knn","knn-algorithm","machine-learning","recomendation-algorithm","recomendation-system","recommender-system"],"created_at":"2024-12-13T23:32:36.372Z","updated_at":"2026-07-10T21:31:51.603Z","avatar_url":"https://github.com/daniel-lima-lopez.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Movie Recomendation System\nRecommendation systems offers several advantages in various industries, including personalized product or content suggestions, enhancing customer experience and user engagement. This technique enable businesses to target marketing effors more efectively, by analizing customer behavior and preferences. Furtheremore, sectors like retal, entertainment and finance leverage this approach for gaining competitive advantage and optimizing user interactions.\n\nThis repository presents the implementation of a movie recommendation system that leverages K-means clustering for movie classification and Collaborative Filtering for personalized user recommendations. This project is aimed at enhancing user engagement and driving insights into consumer preferences. The data used can be found at [Movie Lens Dataset](https://www.kaggle.com/datasets/aigamer/movie-lens-dataset?select=tags.csv).\n\nThe objectives of this project are:\n- Gain valuable insights into user preferences.\n- Generate a data set with the most valuable information to classify movies according to their characteristics.\n- Make personalized movie recommendations based on the interests of each user.\n\nThe repository is organized as follows:\n1. Exploratory Data Analysis (EDA).\n2. Model implementation.\n3. Recomendation examples.\n\n## 1. Exploratory Data Analysis (EDA)\nFirst, we estar by loading the required libraries and reading the movies data:\n\n\n```python\n# load libraries\nimport pandas as pd\nimport numpy as np\nfrom pandasql import sqldf\nimport matplotlib.pyplot as plt\nimport seaborn as sb\n\n# data from movies\nmovies_data = pd.read_csv('Data/movies.csv')\nmovies_data.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003emovieId\u003c/th\u003e\n      \u003cth\u003etitle\u003c/th\u003e\n      \u003cth\u003egenres\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003eToy Story (1995)\u003c/td\u003e\n      \u003ctd\u003eAdventure|Animation|Children|Comedy|Fantasy\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003e2\u003c/td\u003e\n      \u003ctd\u003eJumanji (1995)\u003c/td\u003e\n      \u003ctd\u003eAdventure|Children|Fantasy\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003eGrumpier Old Men (1995)\u003c/td\u003e\n      \u003ctd\u003eComedy|Romance\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003e4\u003c/td\u003e\n      \u003ctd\u003eWaiting to Exhale (1995)\u003c/td\u003e\n      \u003ctd\u003eComedy|Drama|Romance\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003e5\u003c/td\u003e\n      \u003ctd\u003eFather of the Bride Part II (1995)\u003c/td\u003e\n      \u003ctd\u003eComedy\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\nSubsequently, the information is reorganized, separating the name and year of each film into different columns, as well as the genres in which it is classified:\n\n\n```python\n# extraction of year on each movie\ntitles = []\nyears = []\nfor ti in movies_data['title'].values:\n    ti = ti.strip()\n    auxy = ti[-5:-1]\n    # extraction of year on each movie\n    if np.char.isnumeric(auxy):\n        titles.append(ti[:-7])\n        years.append(ti[-5:-1])\n    else:\n        titles.append(ti)\n        years.append('-')\n\nmovies_data['title'] = titles\nmovies_data['year'] = years\n\n# identification of unique genres\ngens = movies_data['genres'].values\nauxg = []\nfor gi in gens:\n    auxg += gi.split('|')\nauxg = np.unique(auxg)\nauxg = auxg[1:] # drop 'no genre listed'\n\n# identification of genres on each movie\naux_dic = {}\nfor gi in auxg:\n    aux_dic[gi] = [0]*len(gens)\n\nfor i, gis in enumerate(gens):\n    split = gis.split('|')\n    if split !=['(no genres listed)']:\n        for si in split:\n            aux_dic[si][i] = 1\n        \n# add features to dataframe\nfor ki in aux_dic.keys():\n    movies_data[ki] = aux_dic[ki]\n\n# drop previous genre feature\nmovies_data = movies_data.drop(['genres'], axis=1)\nmovies_data.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003emovieId\u003c/th\u003e\n      \u003cth\u003etitle\u003c/th\u003e\n      \u003cth\u003eyear\u003c/th\u003e\n      \u003cth\u003eAction\u003c/th\u003e\n      \u003cth\u003eAdventure\u003c/th\u003e\n      \u003cth\u003eAnimation\u003c/th\u003e\n      \u003cth\u003eChildren\u003c/th\u003e\n      \u003cth\u003eComedy\u003c/th\u003e\n      \u003cth\u003eCrime\u003c/th\u003e\n      \u003cth\u003eDocumentary\u003c/th\u003e\n      \u003cth\u003e...\u003c/th\u003e\n      \u003cth\u003eFilm-Noir\u003c/th\u003e\n      \u003cth\u003eHorror\u003c/th\u003e\n      \u003cth\u003eIMAX\u003c/th\u003e\n      \u003cth\u003eMusical\u003c/th\u003e\n      \u003cth\u003eMystery\u003c/th\u003e\n      \u003cth\u003eRomance\u003c/th\u003e\n      \u003cth\u003eSci-Fi\u003c/th\u003e\n      \u003cth\u003eThriller\u003c/th\u003e\n      \u003cth\u003eWar\u003c/th\u003e\n      \u003cth\u003eWestern\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003eToy Story\u003c/td\u003e\n      \u003ctd\u003e1995\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003e2\u003c/td\u003e\n      \u003ctd\u003eJumanji\u003c/td\u003e\n      \u003ctd\u003e1995\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003eGrumpier Old Men\u003c/td\u003e\n      \u003ctd\u003e1995\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003e4\u003c/td\u003e\n      \u003ctd\u003eWaiting to Exhale\u003c/td\u003e\n      \u003ctd\u003e1995\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003e5\u003c/td\u003e\n      \u003ctd\u003eFather of the Bride Part II\u003c/td\u003e\n      \u003ctd\u003e1995\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e5 rows × 22 columns\u003c/p\u003e\n\u003c/div\u003e\n\n\n\nThe rating data, which contains 100,000 movie reviews, is read. This information is leveraged to improve the movie data by adding two new columns: the number of ratings for each movie and its average rating score (minimum 0.0 and maximum 5.0).\n\n\n```python\n# data reading\nratings_data = pd.read_csv('Data/ratings.csv')\n\n# count and average calculation\nmean_ratings = sqldf(''' \n    SELECT movieID, COUNT(userID) as ratings, AVG(rating) as avg_rating \n    FROM ratings_data GROUP BY movieID\n''')\n\n# joining data with movies dataframe\nmovies_data = sqldf(''' \n    WITH T AS \n        (SELECT * from movies_data LEFT JOIN\n        mean_ratings on movies_data.movieId=mean_ratings.movieId)\n    SELECT movieId, title, year, ratings, avg_rating, Action, Adventure, Animation,\n       Children, Comedy, Crime, Documentary, Drama, Fantasy,\n       [Film-Noir], Horror, IMAX, Musical, Mystery, Romance,\n       [Sci-Fi], Thriller, War, Western FROM T\n''')\n# write final movies dataframe\nmovies_data.to_csv('Data/movies_full_data.csv', index=False)\nmovies_data.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003emovieId\u003c/th\u003e\n      \u003cth\u003etitle\u003c/th\u003e\n      \u003cth\u003eyear\u003c/th\u003e\n      \u003cth\u003eratings\u003c/th\u003e\n      \u003cth\u003eavg_rating\u003c/th\u003e\n      \u003cth\u003eAction\u003c/th\u003e\n      \u003cth\u003eAdventure\u003c/th\u003e\n      \u003cth\u003eAnimation\u003c/th\u003e\n      \u003cth\u003eChildren\u003c/th\u003e\n      \u003cth\u003eComedy\u003c/th\u003e\n      \u003cth\u003e...\u003c/th\u003e\n      \u003cth\u003eFilm-Noir\u003c/th\u003e\n      \u003cth\u003eHorror\u003c/th\u003e\n      \u003cth\u003eIMAX\u003c/th\u003e\n      \u003cth\u003eMusical\u003c/th\u003e\n      \u003cth\u003eMystery\u003c/th\u003e\n      \u003cth\u003eRomance\u003c/th\u003e\n      \u003cth\u003eSci-Fi\u003c/th\u003e\n      \u003cth\u003eThriller\u003c/th\u003e\n      \u003cth\u003eWar\u003c/th\u003e\n      \u003cth\u003eWestern\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003eToy Story\u003c/td\u003e\n      \u003ctd\u003e1995\u003c/td\u003e\n      \u003ctd\u003e215.0\u003c/td\u003e\n      \u003ctd\u003e3.920930\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003e2\u003c/td\u003e\n      \u003ctd\u003eJumanji\u003c/td\u003e\n      \u003ctd\u003e1995\u003c/td\u003e\n      \u003ctd\u003e110.0\u003c/td\u003e\n      \u003ctd\u003e3.431818\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003eGrumpier Old Men\u003c/td\u003e\n      \u003ctd\u003e1995\u003c/td\u003e\n      \u003ctd\u003e52.0\u003c/td\u003e\n      \u003ctd\u003e3.259615\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003e4\u003c/td\u003e\n      \u003ctd\u003eWaiting to Exhale\u003c/td\u003e\n      \u003ctd\u003e1995\u003c/td\u003e\n      \u003ctd\u003e7.0\u003c/td\u003e\n      \u003ctd\u003e2.357143\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003e5\u003c/td\u003e\n      \u003ctd\u003eFather of the Bride Part II\u003c/td\u003e\n      \u003ctd\u003e1995\u003c/td\u003e\n      \u003ctd\u003e49.0\u003c/td\u003e\n      \u003ctd\u003e3.071429\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e5 rows × 24 columns\u003c/p\u003e\n\u003c/div\u003e\n\n\n\nWith the information extracted we can study the general content of the data. For example, the distribution of films published in each year, where it can be observed that most of the data belongs to films published from 1990 onwards. However, older films are also present, including records from 1902.\n\n\n```python\n# number of movies released per year\nyear_data = sqldf(''' \n    SELECT * FROM movies_data ORDER BY year\n''')\n\n# movies distribution per year\nmovie_year = sqldf('''  \n    SELECT year, COUNT(year) AS freq FROM movies_data\n    GROUP BY year\n''')\n\n# distribution plot\nfig1, ax1 = plt.subplots()\nfig1.set_size_inches(20, 3.5)\nyears = movie_year['year'].values\ncounts = movie_year['freq'].values\nax1.bar(years, counts)\nplt.xticks(years, rotation = 70)\nplt.show()\n```\n\n\n    \n![png](README_files/README_8_0.png)\n    \n\n\nRegarding the genres of each film, the following figure represents the distribution of genres across all films. It should be noted that Drama and Comedy are the most viewed genres.\n\n\n```python\nimport squarify\n# movies distribution per genre\nfreqs = [] # frequency\nfor gi in auxg: # itereate over genres\n    auxF = sqldf(f''' \n        SELECT * FROM movies_data WHERE [{gi}]=1\n    ''')\n    freqs.append(len(auxF))\n\n# Treemap\nplt.axis(\"off\")\nsquarify.plot(sizes=freqs, label=auxg,\n                text_kwargs = {'fontsize': 8, 'color': 'white'},\n                pad=0.2, ec= 'black',\n                color = sb.color_palette(\"flare\", len(freqs)))\n```\n\n\n\n\n    \u003cAxes: \u003e\n\n\n\n\n    \n![png](README_files/README_10_1.png)\n    \n\n\nBelow are highest average rating drama films of recent years, including films such as The Shawshank Redemption, Fight Club and Goodfellas:\n\n\n```python\n# top drama movies in the last years, considering movies with at least 15 reviews\ntop_drama = sqldf('''  \n    SELECT movieId, title, year, avg_rating FROM movies_data\n    WHERE ratings \u003e= 15 AND Drama=1 AND year\u003e=1990\n    ORDER BY avg_rating DESC\n''')\ntop_drama.head(10)\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003emovieId\u003c/th\u003e\n      \u003cth\u003etitle\u003c/th\u003e\n      \u003cth\u003eyear\u003c/th\u003e\n      \u003cth\u003eavg_rating\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003e318\u003c/td\u003e\n      \u003ctd\u003eShawshank Redemption, The\u003c/td\u003e\n      \u003ctd\u003e1994\u003c/td\u003e\n      \u003ctd\u003e4.429022\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003e475\u003c/td\u003e\n      \u003ctd\u003eIn the Name of the Father\u003c/td\u003e\n      \u003ctd\u003e1993\u003c/td\u003e\n      \u003ctd\u003e4.300000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003e2959\u003c/td\u003e\n      \u003ctd\u003eFight Club\u003c/td\u003e\n      \u003ctd\u003e1999\u003c/td\u003e\n      \u003ctd\u003e4.272936\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003e48516\u003c/td\u003e\n      \u003ctd\u003eDeparted, The\u003c/td\u003e\n      \u003ctd\u003e2006\u003c/td\u003e\n      \u003ctd\u003e4.252336\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003e1213\u003c/td\u003e\n      \u003ctd\u003eGoodfellas\u003c/td\u003e\n      \u003ctd\u003e1990\u003c/td\u003e\n      \u003ctd\u003e4.250000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e5\u003c/th\u003e\n      \u003ctd\u003e1719\u003c/td\u003e\n      \u003ctd\u003eSweet Hereafter, The\u003c/td\u003e\n      \u003ctd\u003e1997\u003c/td\u003e\n      \u003ctd\u003e4.250000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e6\u003c/th\u003e\n      \u003ctd\u003e58559\u003c/td\u003e\n      \u003ctd\u003eDark Knight, The\u003c/td\u003e\n      \u003ctd\u003e2008\u003c/td\u003e\n      \u003ctd\u003e4.238255\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e7\u003c/th\u003e\n      \u003ctd\u003e527\u003c/td\u003e\n      \u003ctd\u003eSchindler's List\u003c/td\u003e\n      \u003ctd\u003e1993\u003c/td\u003e\n      \u003ctd\u003e4.225000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e8\u003c/th\u003e\n      \u003ctd\u003e1245\u003c/td\u003e\n      \u003ctd\u003eMiller's Crossing\u003c/td\u003e\n      \u003ctd\u003e1990\u003c/td\u003e\n      \u003ctd\u003e4.225000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e9\u003c/th\u003e\n      \u003ctd\u003e3275\u003c/td\u003e\n      \u003ctd\u003eBoondock Saints, The\u003c/td\u003e\n      \u003ctd\u003e2000\u003c/td\u003e\n      \u003ctd\u003e4.220930\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\nTaking the most popular Drama movie (The Shawshank Redemption wiith movideId 318) as an example, we can study the most watched genres by users who have watched this movie. The following figure shows the distribution of genres watched by these users, note that Drama remains the most watched genre, because users watch movies of similar genres.\n\n\n```python\n# distribution of genres wathced by users who liked Shawshank Redemption, The (moviedId 318)\n# viewers of selected movie\nviewers_data = sqldf(''' \n    WITH T2 AS\n        (WITH T AS\n            (SELECT userId FROM ratings_data\n            WHERE  movieId=318)\n        SELECT * FROM T LEFT JOIN ratings_data\n            on T.userId=ratings_data.userId)\n    SELECT userId, movieId, rating from T2 WHERE movieId!=318\n''')\n\n# information of movies watched by users and reviews with at least 4.0\nbests = sqldf(''' \n    WITH T AS (\n        SELECT DISTINCT movieId FROM viewers_data\n        WHERE rating\u003e=5.0\n        ORDER BY movieId\n              )\n    SELECT * FROM T LEFT JOIN movies_data\n    ON T.movieId=movies_data.movieId\n''')\nbests = bests.drop(['movieId'], axis=1)\n\n# genres count\nfreqs = []\nfor gi in auxg:\n    auxF = sqldf(f''' \n        SELECT * FROM bests WHERE [{gi}]=1\n    ''')\n    freqs.append(len(auxF))\n\n# Treemap\nplt.axis(\"off\")\nsquarify.plot(sizes=freqs, label=auxg,\n                text_kwargs = {'fontsize': 8, 'color': 'white'},\n                pad=0.2, ec= 'black',\n                color = sb.color_palette(\"flare\", len(freqs)))\n```\n\n\n\n\n    \u003cAxes: \u003e\n\n\n\n\n    \n![png](README_files/README_14_1.png)\n    \n\n\n## 2. Model implementation.\nThe implementation of the model requires training the k-Means algorithm on the characteristics of the films and calculating the collaborative filtering matrices of each cluster. This process is performed once and both the trained model and the set of matrices are stored for later use to make personalized recommendations.\n\nRegarding the training of the k-Means algorithm, a training dataset is first constructed with the features of each movie, including the year of publication, number of ratings, average rating, and genres. Next, a pipeline is implemented that includes a Column Transformer and the k-Means algorithm. The column transformer class transforms each column depending on whether the attribute is numeric or nominal. Finally, the k-Means algorithm is trained on the movie features and the model is written to a pickle file.\n\n\n```python\n# loading libraries\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import StandardScaler, FunctionTransformer\nfrom sklearn.cluster import KMeans\nimport pickle\n\n\n# generation of training data\ntrain_data = sqldf(''' \n    SELECT movieId, year, ratings, avg_rating, Action, Adventure, Animation,\n       Children, Comedy, Crime, Documentary, Drama, Fantasy,\n       [Film-Noir], Horror, IMAX, Musical, Mystery, Romance,\n       [Sci-Fi], Thriller, War, Western FROM movies_data    \n''')\n\n# data procesing\ntrain_data = train_data.dropna()\nXs = train_data.drop(['movieId'], axis=1)\n\n# procesing of numerical features (year, ratings, avg_rating)\nnumeric_features = ['year', 'ratings', 'avg_rating']\nnumeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])\n\n# procesing of categorical features (no transformation)\ncategorical_features = ['Action', 'Adventure', 'Animation',\n       'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',\n       'Film-Noir', 'Horror', 'IMAX', 'Musical', 'Mystery', 'Romance',\n       'Sci-Fi', 'Thriller', 'War', 'Western']\ncategorical_transformer = FunctionTransformer() # transformacion identidad\n\n# preprocesador de ambos tipos\npreprocessor = ColumnTransformer(\n    transformers=[\n        ('num', numeric_transformer, numeric_features),\n        ('cat', categorical_transformer, categorical_features)])\n\n# k-means algorithm\nkmeans = KMeans(n_clusters=5, random_state=1, n_init=\"auto\")\n\nmodel = Pipeline(steps=[('preprocessor', preprocessor),\n                      ('k-means', kmeans)])\nmodel.fit(Xs)\n\n# saving model\nwith open('model_k_means.pkl','wb') as f:\n    pickle.dump(model,f)\n```\n\nThe following table presents the top 10 rated movies found in each cluster after training the k-Mean algorithm considering 5 clusters:\n\n\n```python\n# prediction of the class of each movie\npreds = model.predict(Xs)\n\n# table with the classification of each movie\nmovies_class = pd.DataFrame({'movieId':train_data['movieId'].values,\n                             'class': preds})\n\n# information join\nauxC = sqldf(''' \n    SELECT movies_class.movieId, title, year, ratings, avg_rating, class FROM movies_class LEFT JOIN movies_data\n    ON movies_class.movieId=movies_data.movieId\n''')\n\n# top 10 movies on each cluster, considering at least 20 reviews\naux_dic = {'ranking': [i for i in range(1,11)]}\nfor ci in range(5): # tenemos 5 clusters\n    aux_top = sqldf(f''' \n        SELECT * FROM auxC\n        WHERE class={ci} AND ratings\u003e=20\n        ORDER BY avg_rating DESC\n    ''')\n    aux_dic[f'class_{ci}'] = list(aux_top['title'].values[0:10])\ndata_tops = pd.DataFrame(aux_dic)\ndata_tops\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003eranking\u003c/th\u003e\n      \u003cth\u003eclass_0\u003c/th\u003e\n      \u003cth\u003eclass_1\u003c/th\u003e\n      \u003cth\u003eclass_2\u003c/th\u003e\n      \u003cth\u003eclass_3\u003c/th\u003e\n      \u003cth\u003eclass_4\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003eIn the Name of the Father\u003c/td\u003e\n      \u003ctd\u003eStreetcar Named Desire, A\u003c/td\u003e\n      \u003ctd\u003eShawshank Redemption, The\u003c/td\u003e\n      \u003ctd\u003eOld Boy\u003c/td\u003e\n      \u003ctd\u003eBuffy the Vampire Slayer\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003e2\u003c/td\u003e\n      \u003ctd\u003eHoop Dreams\u003c/td\u003e\n      \u003ctd\u003eSunset Blvd. (a.k.a. Sunset Boulevard)\u003c/td\u003e\n      \u003ctd\u003eGodfather, The\u003c/td\u003e\n      \u003ctd\u003eGrand Day Out with Wallace and Gromit, A\u003c/td\u003e\n      \u003ctd\u003eJoe Dirt\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003eLogan\u003c/td\u003e\n      \u003ctd\u003ePhiladelphia Story, The\u003c/td\u003e\n      \u003ctd\u003eFight Club\u003c/td\u003e\n      \u003ctd\u003eHowl's Moving Castle (Hauru no ugoku shiro)\u003c/td\u003e\n      \u003ctd\u003eToys\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003e4\u003c/td\u003e\n      \u003ctd\u003eMiller's Crossing\u003c/td\u003e\n      \u003ctd\u003eLawrence of Arabia\u003c/td\u003e\n      \u003ctd\u003eDr. Strangelove or: How I Learned to Stop Worr...\u003c/td\u003e\n      \u003ctd\u003eFemme Nikita, La (Nikita)\u003c/td\u003e\n      \u003ctd\u003eAngels in the Outfield\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003e5\u003c/td\u003e\n      \u003ctd\u003eBoondock Saints, The\u003c/td\u003e\n      \u003ctd\u003eHarold and Maude\u003c/td\u003e\n      \u003ctd\u003eRear Window\u003c/td\u003e\n      \u003ctd\u003eKiss Kiss Bang Bang\u003c/td\u003e\n      \u003ctd\u003eThe Scorpion King\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e5\u003c/th\u003e\n      \u003ctd\u003e6\u003c/td\u003e\n      \u003ctd\u003eBoot, Das (Boat, The)\u003c/td\u003e\n      \u003ctd\u003eCool Hand Luke\u003c/td\u003e\n      \u003ctd\u003eGodfather: Part II, The\u003c/td\u003e\n      \u003ctd\u003eLaputa: Castle in the Sky (Tenkû no shiro Rapy...\u003c/td\u003e\n      \u003ctd\u003eScary Movie 3\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e6\u003c/th\u003e\n      \u003ctd\u003e7\u003c/td\u003e\n      \u003ctd\u003eRaging Bull\u003c/td\u003e\n      \u003ctd\u003eNotorious\u003c/td\u003e\n      \u003ctd\u003eDeparted, The\u003c/td\u003e\n      \u003ctd\u003eEvil Dead II (Dead by Dawn)\u003c/td\u003e\n      \u003ctd\u003eSuperman III\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e7\u003c/th\u003e\n      \u003ctd\u003e8\u003c/td\u003e\n      \u003ctd\u003eGlory\u003c/td\u003e\n      \u003ctd\u003eManchurian Candidate, The\u003c/td\u003e\n      \u003ctd\u003eGoodfellas\u003c/td\u003e\n      \u003ctd\u003eArmy of Darkness\u003c/td\u003e\n      \u003ctd\u003eRichie Rich\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e8\u003c/th\u003e\n      \u003ctd\u003e9\u003c/td\u003e\n      \u003ctd\u003eCinema Paradiso (Nuovo cinema Paradiso)\u003c/td\u003e\n      \u003ctd\u003eAll About Eve\u003c/td\u003e\n      \u003ctd\u003eCasablanca\u003c/td\u003e\n      \u003ctd\u003eRoad Warrior, The (Mad Max 2)\u003c/td\u003e\n      \u003ctd\u003eI Know What You Did Last Summer\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e9\u003c/th\u003e\n      \u003ctd\u003e10\u003c/td\u003e\n      \u003ctd\u003eIn Bruges\u003c/td\u003e\n      \u003ctd\u003eThird Man, The\u003c/td\u003e\n      \u003ctd\u003eDark Knight, The\u003c/td\u003e\n      \u003ctd\u003eCabin in the Woods, The\u003c/td\u003e\n      \u003ctd\u003eInspector Gadget\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\nRegarding the calculation of the Collaborative Filtering matrices, for each cluster a movie-user matrix is ​​calculated considering only the movies of the cluster and the users who have seen them. Unlike the conventional approach in which the user-movies matrix is ​​calculated for the entire dataset, in this case the complexity of the process is reduced, since the calculated matrices are significantly smaller, facilitating the implementation of the Recommendation System.\n\n\n```python\n# calculation of the user-movie rating matrix of each movie cluster\nframes = []\nfor ci in [0,1,2,3,4]: # iterate over each cluster\n    # movies info of current cluster\n    auxJ = sqldf(f''' \n        WITH T1 AS (\n            SELECT movieId FROM movies_class WHERE class={ci}\n        )\n        SELECT userId, T1.movieId, rating FROM T1 LEFT JOIN ratings_data\n        ON T1.movieId=ratings_data.movieId ORDER BY userId, T1.movieId\n    ''')\n\n    # dataframe creation\n    u_ids = np.unique(auxJ['userId'].values) # number of users\n    m_ids = np.unique(auxJ['movieId'].values) # number of movies\n    aux_dic = {}\n    for mi in m_ids:\n        aux_dic[f'{mi}'] = [0.0]*len(u_ids)\n    data_ci = pd.DataFrame(index=u_ids, data=aux_dic)\n\n    # writing of each users ratings\n    for ui in u_ids:\n        data_ui = sqldf(f''' \n            SELECT movieId, rating FROM auxJ\n            WHERE userId={ui}            \n        ''')\n        for mi, ri in zip(data_ui['movieId'].values, data_ui['rating'].values):\n            data_ci.at[ui, f'{mi}'] = ri\n    \n    # dataframe wirting\n    data_ci.to_csv(f'cf_matrices/cf_matrix_{ci}.csv', index=True)\n```\n\nThe recommendation system is implemented by leveraging both the trained k-Means model and the previously calculated collaborative filtering matrices. The implementation of the model is shown below, which considers the following points:\n1. The recommendation system loads the pre-trained model and the collaborative filtering matrices. Thus, we ensure that the loading does not require retraining the kmeans algorithm or recalculating the user-movies matrices, which would result in an inefficient process.\n2. The model predictions are implemented in the recommend function, which considers the following points:\n    - The function receives a dictionary with a new user's ratings as inputs and calculates the centroid of his highest rated movies.\n    - The k-Means algorithm is used to classify the centroid of the movies in a cluster.\n    - The user-movie matrix of this cluster is used to apply the Collaborative Filtering technique, which consists of finding the most similar users based on their movie ratings and predicting a rating vector based on these users' ratings.\n    \n\n\n\n```python\n# loading libraries\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import StandardScaler, FunctionTransformer\nfrom sklearn.cluster import KMeans\nimport pickle\nimport pandas as pd\nfrom pandasql import sqldf\nfrom sklearn.neighbors import NearestNeighbors\n\nclass RecomenderSystem:\n    def __init__(self, k=5):\n        # load model\n        with open('model_k_means.pkl', 'rb') as f:\n            self.model = pickle.load(f)\n        \n        # load cf matrices\n        self.cf_matrices = []\n        for i in range(5):\n            self.cf_matrices.append(pd.read_csv(f'cf_matrices/cf_matrix_{i}.csv', index_col=0))\n\n        # k value for k Nearest Neighbor\n        self.k = k\n\n    \n    def recomend(self, ratings):\n        # read movies dataframe\n        movies_data = pd.read_csv('Data/movies_full_data.csv')\n        \n        # identify top movies and their ids\n        movies = []\n        ids = []\n        for ki in list(ratings.keys()):        \n            if ratings[ki]\u003e=4.0: # at leats 4.0 rating\n                movies.append(ki)\n                aux_ids = sqldf(f'''\n                    SELECT movieId FROM movies_data\n                    WHERE title=\"{ki}\"\n                ''')\n                ids.append(aux_ids.values[0,0])\n\n        # extract vector of each movie\n        vectors = movies_data[movies_data['movieId'].isin(ids)]\n        vectors = vectors.drop(['movieId', 'title'], axis=1)\n\n        # centroid calculation\n        cent = np.mean(vectors.values, axis=0)\n        Fcent = pd.DataFrame(data=[cent], columns=('year','ratings','avg_rating','Action','Adventure','Animation','Children','Comedy','Crime','Documentary','Drama','Fantasy','Film-Noir','Horror','IMAX','Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western'))\n        \n        # k-means classification\n        ci = self.model.predict(Fcent)[0]\n        \n        # identification of the movies that actually belongs to this cluster\n        cluster_movies = list(self.cf_matrices[ci].columns)\n        auxb = [f'{mi}' in cluster_movies for mi in ids]\n        movies = np.array(movies)[auxb]\n        ids = np.array(ids)[auxb]\n        \n        # user-movies vector\n        user_vector = Fcent = pd.DataFrame(data=[[0.0]*len(cluster_movies)], columns=tuple(cluster_movies))\n        rs = [ratings[mi] for mi in movies]\n        for id, ri in zip(ids, rs):\n            user_vector.at[0, f'{id}'] = ri\n        \n        # get the k nearest users in Collaboprative Filtering matrix\n        Xs = self.cf_matrices[ci].values # users vectors\n        nn = NearestNeighbors(n_neighbors=self.k).fit(Xs)\n        _, indices = nn.kneighbors(user_vector.values)\n        \n        # calculate the mean ratings considering only non zero values\n        aux_p = Xs[indices[0]]\n        pred_ratings = []\n        for pi in range(aux_p.shape[1]):\n            aux_r = aux_p[:,pi]\n            non_zeros = np.sum(aux_r!=0.0)\n            mean = np.sum(aux_r)/np.max([non_zeros, 1]) # prevent division by zero\n            pred_ratings.append(mean)\n        \n        # identify the top recomendations, considering non-zero ratings\n        auxF = pd.DataFrame(data = [pred_ratings], columns=tuple(cluster_movies))\n        top_movieIds = []\n        top_ratings = []\n        for id in cluster_movies: # iterate over movies in cluster\n            ri = auxF[id].values[0]\n            if ri != 0: # non zero ratings\n                top_movieIds.append(id)\n                top_ratings.append(ri)\n        pred_movies_ratings = pd.DataFrame({'movieId': top_movieIds,\n                                            'rating': top_ratings})\n        preds = sqldf(''' \n            SELECT pred_movies_ratings.movieId, movies_data.title, pred_movies_ratings.rating FROM pred_movies_ratings\n            LEFT JOIN movies_data on pred_movies_ratings.movieId=movies_data.movieId ORDER BY pred_movies_ratings.rating DESC\n        ''')\n        print(preds)\n```\n\n## 3. Recomendation examples.\nBelow are some examples of the recommendation system. Note that it receives as input a dictionary of movie ratings of a new user. The system returns a set of recommendations and makes a prediction of the rating that the user would assign to each recommendation, in this way an informed decision can be made about which movies to recommend.\n\n\n```python\ntest = RecomenderSystem()\nratings = {'Godfather, The': 4.5, 'The Island':4.5, 'Rocky II':4.0, 'Batman: Gotham Knight':4.5}\ntest.recomend((ratings))\n```\n\n      movieId                title  rating\n    0    3671      Blazing Saddles    4.00\n    1    6502        28 Days Later    4.00\n    2   33794        Batman Begins    3.50\n    3    2005         Goonies, The    2.00\n    4    4226              Memento    2.00\n    5    1302      Field of Dreams    1.25\n    6     527     Schindler's List    0.50\n    7    3949  Requiem for a Dream    0.50\n    8   68954                   Up    0.50\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaniel-lima-lopez%2Fcollaborative-filtering-in-recomender-system","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdaniel-lima-lopez%2Fcollaborative-filtering-in-recomender-system","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaniel-lima-lopez%2Fcollaborative-filtering-in-recomender-system/lists"}