{"id":17982015,"url":"https://github.com/jaykef/min-patchnizer","last_synced_at":"2025-03-25T19:30:42.488Z","repository":{"id":225477924,"uuid":"764033514","full_name":"Jaykef/min-patchnizer","owner":"Jaykef","description":"Minimal, clean code for video/image \"patchnization\" - a process commonly used in tokenizing visual data for use in a Transformer encoder. ","archived":false,"fork":false,"pushed_at":"2024-05-16T23:02:28.000Z","size":4609,"stargazers_count":8,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-05-17T00:21:45.473Z","etag":null,"topics":["computer-vision","nlp","patchnization","tokenization","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Jaykef.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-02-27T11:10:12.000Z","updated_at":"2024-05-16T23:02:31.000Z","dependencies_parsed_at":"2024-04-04T01:47:00.974Z","dependency_job_id":null,"html_url":"https://github.com/Jaykef/min-patchnizer","commit_stats":null,"previous_names":["jaykef/min-patchnizer"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jaykef%2Fmin-patchnizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jaykef%2Fmin-patchnizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jaykef%2Fmin-patchnizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jaykef%2Fmin-patchnizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Jaykef","download_url":"https://codeload.github.com/Jaykef/min-patchnizer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222090825,"owners_count":16929472,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","nlp","patchnization","tokenization","transformer"],"created_at":"2024-10-29T18:12:41.795Z","updated_at":"2024-10-29T18:12:42.417Z","avatar_url":"https://github.com/Jaykef.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# min-patchnizer\n\nMinimal, clean code for video/image \"patchnization\" - a process commonly used in tokenizing visual data for use in a Transformer encoder. The code here, first extracts still images (frames) from a video, splits the image frames into smaller fixed-size patches, linearly embeds each of them, adds position embeddings and then saves the resulting sequence of vectors for use in a Vision Transformer encoder. I tried training the resulting sequence vectors with Karpathy's minbpe and it took 2173.45 seconds per frame to tokenize. The whole \"patchnization\" took ~77.40a for a 20s video on my M2 Air.\n\n![IMG_5672](https://github.com/Jaykef/min-patchnizer/assets/11355002/de2eb521-58d5-4308-b061-19a32217cbb2)\n\u003cbr\u003e\u003cbr\u003e\n\nThe files in this repo work as follows:\n\n\u003cul\u003e\n  \u003cli\u003e\u003ca href=\"https://github.com/Jaykef/min-patchnizer/blob/main/patchnizer.py\"\u003epatchnizer.py\u003c/a\u003e: Holds code for simple implemenatation of the three stages involved (extract_image_frames from video, reduce image_frames_to_patches of fixed sizes 16x16 pixels, then linearly_embed_patches into a 1D vector sequence with additional position embeddings.\u003c/li\u003e\n  \n  \u003cli\u003e\u003ca href=\"https://github.com/Jaykef/min-patchnizer/blob/main/patchnize.py\"\u003epatchnize.py\u003c/a\u003e: performs the whole process with custom configs (patch_size, created dirs, video - I am using the \"dogs playing in snow\" video by sora).\u003c/li\u003e\n\n  \u003cli\u003e\u003ca href=\"https://github.com/Jaykef/min-patchnizer/blob/main/patchnize.py\"\u003etrain.py\u003c/a\u003e: Trains the resulting one-dimensional vector sequence (linear_patch_embeddings + position_embeddings) on Karpathy's minbpe (a minimal implementation of the byte-pair encoding algorithm).\u003c/li\u003e\n\n  \u003cli\u003e\u003ca href=\"https://github.com/Jaykef/min-patchnizer/blob/main/patchnize.py\"\u003echeck.py\u003c/a\u003e: Checks to see if the patch embeddings match the original image patches and then reconstructs the original image frames - this basically just do the reverse of linear embedding.\u003c/li\u003e\n\u003c/ul\u003e\n\n\nThe whole process builds on the approach introduced in the Vision Transformer paper: \u003ca href=\"https://arxiv.org/abs/2010.11929\"\u003e\"An image is worth 16x16 words: Transformers for image recognition at scale.\"\u003c/a\u003e\n\nYoutube Video: \u003ca href=\"https://youtu.be/eT1mJE4J38o?si=9uTeLo6eFoNmbJLt\"\u003eWatch Demo\u003c/a\u003e\n\n## Usage\n\nFirst patchnize:\n```\npython patchnize.py\n```\n  \nNext check: \n```\npython check.py\n``` \n\nThen train: \n```\npython train.py\n```\n\n## References\n\u003cul\u003e\n  \u003cli\u003e\u003ca href=\"https://openai.com/research/video-generation-models-as-world-simulators\"\u003eSORA Technical Report\u003c/a\u003e\u003c/li\u003e\n  \n  \u003cli\u003e\u003ca href=\"https://arxiv.org/abs/2010.11929\"\u003e\"An image is worth 16x16 words: Transformers for image recognition at scale\", Alexey Dosovitskiy et al.\u003c/a\u003e\u003c/li\u003e\n\n  \u003cli\u003e\u003ca href=\"https://github.com/karpathy/minbpe#:~:text=/-,minbpe,-Type\"\u003eminbpe by karpathy\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\n## License\nMIT\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaykef%2Fmin-patchnizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjaykef%2Fmin-patchnizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaykef%2Fmin-patchnizer/lists"}