{"id":22502691,"url":"https://github.com/ejfox/criterion-embedding-viz","last_synced_at":"2025-07-14T15:33:36.648Z","repository":{"id":265982902,"uuid":"897026522","full_name":"ejfox/criterion-embedding-viz","owner":"ejfox","description":"Generate embeddings from every Criterion movie","archived":false,"fork":false,"pushed_at":"2025-06-20T20:13:23.000Z","size":1797,"stargazers_count":1,"open_issues_count":7,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-04T16:31:28.343Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ejfox.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-01T21:50:43.000Z","updated_at":"2025-06-20T18:55:53.000Z","dependencies_parsed_at":"2024-12-01T23:34:43.604Z","dependency_job_id":null,"html_url":"https://github.com/ejfox/criterion-embedding-viz","commit_stats":null,"previous_names":["ejfox/criterion-embedding-viz"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ejfox/criterion-embedding-viz","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ejfox%2Fcriterion-embedding-viz","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ejfox%2Fcriterion-embedding-viz/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ejfox%2Fcriterion-embedding-viz/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ejfox%2Fcriterion-embedding-viz/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ejfox","download_url":"https://codeload.github.com/ejfox/criterion-embedding-viz/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ejfox%2Fcriterion-embedding-viz/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265312118,"owners_count":23745178,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-06T23:21:08.770Z","updated_at":"2025-07-14T15:33:36.634Z","avatar_url":"https://github.com/ejfox.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Criterion Embedding Visualization\r\n\r\n\u003cimg width=\"1012\" alt=\"Screenshot 2024-12-01 at 4 54 29 PM\" src=\"https://github.com/user-attachments/assets/59b4f762-3dd5-4c46-bd45-98bdff8f0535\"\u003e\r\n\r\nThis project is designed to create vector embeddings for Criterion movie titles and descriptions using the Nomic Embedding API. The embeddings can be used for advanced data analysis, clustering, and visualization, enabling deeper exploration of the Criterion Channel's catalog.\r\n\r\n## Note About Embeddings File\r\n\r\nThe `criterion_embeddings.json` file (~156MB) is stored on R2 and not in the Git repository due to its large size. Use the provided download script to fetch it.\r\n\r\n## Objectives\r\n\r\nThe primary objective of this project is to leverage natural language processing (NLP) techniques to generate meaningful embeddings for textual data in the Criterion movie dataset. These embeddings encode semantic relationships between movie titles and descriptions, which can be used in tasks such as similarity analysis, clustering, and visualization.\r\n\r\n## Provenance\r\n\r\nThe dataset utilized in this project originates from a publicly available spreadsheet shared on Reddit by [u/morbusiff](https://www.reddit.com/user/morbusiff). The spreadsheet contains detailed information on movies available on the Criterion Channel as of 2019. \r\n\r\n- **Source Spreadsheet**: [Criterion Channel Videos Spreadsheet](https://docs.google.com/spreadsheets/d/1-ctl5IGVUqfkCH48DFUbLx0iQai9r6BLG9NStMwxPSw/edit?gid=740795620#gid=740795620)\r\n- **Original Reddit Post**: [4,176 Criterion Channel Videos in a Spreadsheet](https://www.reddit.com/r/criterion/comments/bba5go/4176_criterion_channel_videos_in_a_spreadsheet/)\r\n\r\nWe acknowledge and thank [u/morbusiff](https://www.reddit.com/user/morbusiff) for compiling and sharing this valuable dataset.\r\n\r\n## Methodology\r\n\r\n1. **Data Input**:\r\n   - The dataset is provided in CSV format (`criterion_movies.csv`) and contains information such as titles, descriptions, directors, years, and links.\r\n\r\n2. **Embedding Generation**:\r\n   - The `index.js` script processes the dataset to generate embeddings using the [Nomic Embedding API](https://docs.nomic.ai/).\r\n   - Separate embeddings are created for both the **title** and **description** of each movie to capture different semantic representations.\r\n\r\n3. **Batch Processing**:\r\n   - The script processes data in batches to optimize API usage.\r\n   - Rate-limiting is implemented via the `bottleneck` library to respect API constraints.\r\n\r\n4. **Output**:\r\n   - Embeddings are saved in JSON format (`criterion_embeddings.json`), maintaining a structured representation of the data alongside the generated embeddings.\r\n\r\n## Features\r\n\r\n- **Efficient Batch Processing**: Groups multiple embeddings in a single API call to reduce overhead.\r\n- **Title and Description Embeddings**: Provides separate embeddings for both fields to allow fine-grained analysis.\r\n- **Progress Saving and Resumption**: Automatically resumes processing from the last completed batch after interruptions.\r\n- **Rate Limiting**: Ensures compliance with API constraints using `bottleneck`.\r\n\r\n## Requirements\r\n\r\n- Node.js (v14 or higher)\r\n- A valid Nomic API key\r\n\r\n## Installation\r\n\r\n1. Clone the repository:\r\n   ```bash\r\n   git clone https://github.com/ejfox/criterion-embedding-viz.git\r\n   cd criterion-embedding-viz\r\n   ```\r\n\r\n2. Install dependencies:\r\n   ```bash\r\n   npm install\r\n   ```\r\n\r\n3. Configure environment variables by creating a `.env` file:\r\n   ```bash\r\n   echo \"NOMIC_API_KEY=your_nomic_api_key\" \u003e .env\r\n   ```\r\n   \r\n   ### Advanced Configuration Options\r\n   ```bash\r\n   # Output format\r\n   OUTPUT_FORMAT=ndjson        # \"json\" or \"ndjson\" (newline-delimited JSON)\r\n   OUTPUT_FILE=criterion_embeddings.ndjson\r\n   \r\n   # Embedding configuration\r\n   TASK_TYPE=search_document   # \"search_document\", \"search_query\", \"clustering\", \"classification\"\r\n   DIMENSIONALITY=768          # 768 or 256 for Nomic\r\n   ```\r\n\r\n4. Place your dataset in the root directory as `criterion_movies.csv`.\r\n\r\n5. Download the embeddings file:\r\n   ```bash\r\n   ./download-embeddings.sh\r\n   ```\r\n   This downloads the pre-generated embeddings file (~156MB) from Cloudflare R2.\r\n\r\n## Execution\r\n\r\nRun the script to generate embeddings:\r\n```bash\r\nnode index.js\r\n```\r\n\r\n## Data Output\r\n\r\nThe script generates embeddings in `criterion_embeddings.json`. Each entry includes:\r\n- Metadata from the CSV dataset.\r\n- Separate embeddings for the movie title and description.\r\n\r\n### Sample JSON Output\r\n```json\r\n[\r\n  {\r\n    \"Title (Data retrieved 2019-06-21)\": \"Mulholland Dr.\",\r\n    \"Description\": \"Directed by David Lynch...\",\r\n    \"title_embedding\": [0.0256958, 0.00015819073, ...],\r\n    \"description_embedding\": [0.03456134, -0.0124586, ...]\r\n  },\r\n  ...\r\n]\r\n```\r\n\r\n## Applications\r\n\r\nThe generated embeddings can be used for:\r\n- Semantic similarity analysis between movies.\r\n- Clustering based on descriptive content.\r\n- Visualization of relationships within the dataset using dimensionality reduction techniques (e.g., PCA, t-SNE, UMAP).\r\n\r\n## Limitations\r\n\r\n- The embeddings are limited to the semantic information provided in titles and descriptions. Additional metadata (e.g., genre, director) could enhance future analyses.\r\n- Generated embeddings are dependent on the Nomic API's embedding model as of the time of execution.\r\n\r\n## Ethical Considerations\r\n\r\n- **Data Provenance**: The dataset was shared publicly and is used for analytical purposes. Attribution is provided to the original compiler.\r\n- **Intellectual Property**: Ensure proper use of Criterion Channel data in compliance with its terms of service and copyright regulations.\r\n\r\n## Future Enhancements / TODO\r\n\r\n### Wikipedia Enrichment\r\nConcept: Automatically find and embed Wikipedia articles for each movie to create richer embeddings:\r\n- Use Wikipedia API to search for each movie title + year + director\r\n- Implement human spot-checking interface to verify correct matches\r\n- Extract and chunk Wikipedia content by logical sections (plot, cast, production, reception, etc.)\r\n- Generate embeddings for each section separately\r\n- Could enable deeper semantic search like \"films about existentialism\" or \"movies with troubled productions\"\r\n- Store Wikipedia URLs and section embeddings alongside movie data\r\n\r\n### Multi-Provider Embedding Support\r\nMake it easy to swap between different embedding services:\r\n- **OpenRouter** (priority) - Access to multiple models through one API\r\n- OpenAI embeddings (text-embedding-3-small/large)\r\n- Cohere embeddings\r\n- Local embeddings (sentence-transformers)\r\n\r\nChallenges to solve:\r\n- Different providers use different dimensions (OpenAI: 1536/3072, Nomic: 768/256, etc.)\r\n- Need abstraction layer to handle different API formats\r\n- Store provider metadata with embeddings for compatibility\r\n- Consider dimension reduction techniques for cross-provider compatibility\r\n\r\n## Acknowledgments\r\n\r\nSpecial thanks to [u/morbusiff](https://www.reddit.com/user/morbusiff) for compiling and sharing the original dataset on Reddit.\r\n\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fejfox%2Fcriterion-embedding-viz","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fejfox%2Fcriterion-embedding-viz","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fejfox%2Fcriterion-embedding-viz/lists"}