{"id":20027632,"url":"https://github.com/geedium/sphinx-polar","last_synced_at":"2026-04-12T15:05:09.557Z","repository":{"id":241658414,"uuid":"806864388","full_name":"Geedium/sphinx-polar","owner":"Geedium","description":"Spotify Data Transformation and Analysis","archived":false,"fork":false,"pushed_at":"2024-05-29T01:23:09.000Z","size":107,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-01-12T17:24:52.609Z","etag":null,"topics":["analysis","cli","data-transformation","interview-task","kaggle","nodejs","python","utilities"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Geedium.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-28T04:00:27.000Z","updated_at":"2024-05-29T00:52:05.000Z","dependencies_parsed_at":"2024-05-29T14:49:16.172Z","dependency_job_id":"fd0cf982-f767-4ddc-8a4d-f02acbca46c1","html_url":"https://github.com/Geedium/sphinx-polar","commit_stats":null,"previous_names":["geedium/sphinx-polar"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Geedium%2Fsphinx-polar","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Geedium%2Fsphinx-polar/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Geedium%2Fsphinx-polar/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Geedium%2Fsphinx-polar/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Geedium","download_url":"https://codeload.github.com/Geedium/sphinx-polar/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241460046,"owners_count":19966516,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","cli","data-transformation","interview-task","kaggle","nodejs","python","utilities"],"created_at":"2024-11-13T09:11:10.236Z","updated_at":"2026-04-12T15:05:09.521Z","avatar_url":"https://github.com/Geedium.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Kaggle Big Data Dataset Tool\nThis tool filters and optimizes CSV files (**artists** and **tracks**) to generate claims more easily and saves records after filtering to a SQL database.\n\nImportant: Ensure you have the **~/.kaggle/kaggle.json** configuration file set up before running the tool. You can obtain this configuration from Kaggle's Public API. Additionally, Python must be installed on your system to run the CLI and successfully extract CSVs.\n\n## Prerequisites\n - Python\n - Node.js\n - PostgreSQL\n\n## Python\nEnsure Python is installed on your system. You can download it from the [official Python website](https://www.python.org/downloads/).\n\n## Node.js\nEnsure Node.js is installed on your system. You can download it from the [official Node.js website](https://nodejs.org/).\n\n## PostgreSQL\nEnsure PostgreSQL is installed and running on your system. You can download it from the [official PostgreSQL website](https://www.postgresql.org/download/).\n\n# Installing Node.js Dependencies\nInstall the necessary dependencies using your preferred package manager:\n```bash\nyarn install\n# or\nnpm install\n```\nThere are also alternatives like **pnpm**.\n\n# Environment Variables\nCreate a **.env** file in the root directory of your project and insert the required environment variables as specified in the **.env.example** file. These variables include database connection details, AWS S3 bucket information, and any other necessary configurations.\n\n# Running Phase\nFirst, build the project:\n```bash\nyarn build\n```\nThis command compiles the TypeScript source files to JavaScript. After running this command, there should be a **dist** folder containing the distribution JavaScript files.\n\n## 1. Ingesting Data from Data Source\nTo download CSV files associated with the artist and track, run:\n```bash\nyarn bake\n```\nThis command uses Python to download the required CSV files from Kaggle. It utilizes the kaggle CLI to fetch the datasets. Refer to the **installKaggle.mjs** script for details on how it works.\n\n# 2. Data Transformation\nTransform the data and upload it to S3:\n```bash\nyarn start -t\n```\nThis command reads the downloaded CSV files, filters and transforms the data according to specified criteria, and then uploads the processed data to an AWS S3 bucket. The filtering criteria include:\n - Ignoring tracks with no name.\n - Ignoring tracks shorter than 1 minute.\n - Loading only artists that have tracks after filtering.\nThe data transformation involves exploding the track release date into separate columns (**year**, **month**, **day**) and transforming the track danceability into string values (**Low**, **Medium**, **High**) based on float.\n\n# 3. Pulling Data Directly from S3 (Optional)\nIf you want to download the transformed data files from S3 bucket which you can configure inside source entry file, remember to rebuild afterwards, run:\n```bash\nyarn start -f\n```\nThis command downloads the processed data files from your specified S3 bucket. It requires a network connection.\n\n# 4. Connecting to PostgreSQL\nTo create new records in PostgreSQL from the local CSV files (after transformation), run:\n```bash\nyarn start -c\n```\nThis command will:\n - Create the **artists** and **tracks** tables in your PostgreSQL database.\n - Insert the data into the respective tables.\n\n# 5. Data Processing\nTo create the SQL views (**track_info**, **most_energizing_tracks**, **tracks_with_artist_followers**), run:\n```bash\nyarn start -v\n```\nThis command sets up SQL views that perform the following tasks:\n - **track_info**: Selects track information including **id**, **name**, **popularity**, **energy**, **danceability**, and the number of artist followers.\n - **tracks_with_artist_followers**: Filters tracks to only include those where artists have followers.\n - **most_energizing_tracks**: Picks the most energizing track for each release year.\n\n# Handling Cases\n## Deleting Tracks and Artists\nTo drop records and tables, run:\n```bash\nyarn start -d\n```\nThis command will delete all records from the **artists** and **tracks** tables and drop the tables from your PostgreSQL database. Use this command with caution as it will remove all data.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeedium%2Fsphinx-polar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgeedium%2Fsphinx-polar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeedium%2Fsphinx-polar/lists"}