{"id":13644099,"url":"https://github.com/kyegomez/gemini","last_synced_at":"2025-04-04T10:04:19.542Z","repository":{"id":191245512,"uuid":"684341554","full_name":"kyegomez/Gemini","owner":"kyegomez","description":"The open source implementation of Gemini, the model that will \"eclipse ChatGPT\" by Google","archived":false,"fork":false,"pushed_at":"2024-04-08T12:07:27.000Z","size":695,"stargazers_count":354,"open_issues_count":4,"forks_count":40,"subscribers_count":12,"default_branch":"main","last_synced_at":"2024-04-14T03:24:12.752Z","etag":null,"topics":["ai","artificial-intelligence","gemini","gpt4","machine-learning","ml","multi-modality","multimodla"],"latest_commit_sha":null,"homepage":"https://discord.gg/GYbXvDGevY","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kyegomez.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null},"funding":{"github":["kyegomez"]}},"created_at":"2023-08-29T00:10:58.000Z","updated_at":"2024-04-22T13:34:44.636Z","dependencies_parsed_at":"2023-08-29T01:20:53.065Z","dependency_job_id":"6de3d9c2-c882-45a4-8aa8-17076293c5b1","html_url":"https://github.com/kyegomez/Gemini","commit_stats":null,"previous_names":["kyegomez/gemini"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kyegomez%2FGemini","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kyegomez%2FGemini/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kyegomez%2FGemini/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kyegomez%2FGemini/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kyegomez","download_url":"https://codeload.github.com/kyegomez/Gemini/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247154503,"owners_count":20892867,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","artificial-intelligence","gemini","gpt4","machine-learning","ml","multi-modality","multimodla"],"created_at":"2024-08-02T01:01:57.543Z","updated_at":"2025-04-04T10:04:19.517Z","avatar_url":"https://github.com/kyegomez.png","language":"Python","funding_links":["https://github.com/sponsors/kyegomez"],"categories":["CLIs"],"sub_categories":[],"readme":"[![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)\n\n# Gemini\n\n![gemini](gemini.png)\n\nThe open source implementation of Gemini, the model that will \"eclipse ChatGPT\", it seems to work by directly taking in all modalities all at once into a transformer with special decoders for text or img generation!\n\n[Join the Agora discord channel to help with the implementation!](https://discord.gg/CMDpRxCV8g) and [Here is the project board:](https://github.com/users/kyegomez/projects/11/views/1)\n\nThe input sequences for Gemini consist of texts, audio, images, and videos. These inputs are transformed into tokens, which are then processed by a transformer. Subsequently, conditional decoding takes place to generate image outputs. Interestingly, the architecture of Gemini bears resemblance to Fuyu's architecture but is expanded to encompass multiple modalities. Instead of utilizing a visual transformer (vit) encoder, Gemini simply feeds image embeddings directly into the transformer. For Gemini, the token inputs will likely be indicated by special modality tokens such as [IMG], \u003cimg\u003e, [AUDIO], or \u003caudio\u003e. Codi, a component of Gemini, also employs conditional generation and makes use of the tokenized outputs. To implement this model effectively, I intend to initially focus on the image embeddings to ensure their smooth integration. Subsequently, I will proceed with incorporating audio embeddings and then video embeddings.\n\n# Install\n`pip3 install gemini-torch`\n\n\n## Usage\n\n### Gemini Transformer Usage\n- Base transformer\n- Multi Grouped Query Attn / flash attn\n- rope\n- alibi\n- xpos\n- qk norm\n- no pos embeds\n- kv cache\n\n```python\nimport torch\n\nfrom gemini_torch.model import Gemini\n\n# Initialize model with smaller dimensions\nmodel = Gemini(\n    num_tokens=50432,\n    max_seq_len=4096,  # Reduced from 8192\n    dim=1280,  # Reduced from 2560\n    depth=16,  # Reduced from 32\n    dim_head=64,  # Reduced from 128\n    heads=12,  # Reduced from 24\n    use_abs_pos_emb=False,\n    attn_flash=True,\n    attn_kv_heads=2,\n    qk_norm=True,\n    attn_qk_norm=True,\n    attn_qk_norm_dim_scale=True,\n)\n\n# Text shape: [batch, seq_len, dim]\ntext = torch.randint(0, 50432, (1, 4096))  # Reduced seq_len from 8192\n\n# Apply model to text\ny = model(\n    text,\n)\n\n# Output shape: [batch, seq_len, dim]\nprint(y)\n```\n--------\n\n### Full Multi-Modal Gemini \n- Processes images and audio through a series of reshapes\n- Ready to train for production grade usage\n- Hyper optimized with flash attention, qk norm, and other methods\n\n```python\nimport torch\n\nfrom gemini_torch.model import Gemini\n\n# Initialize model with smaller dimensions\nmodel = Gemini(\n    num_tokens=10000,  # Reduced from 50432\n    max_seq_len=1024,  # Reduced from 4096\n    dim=320,  # Reduced from 1280\n    depth=8,  # Reduced from 16\n    dim_head=32,  # Reduced from 64\n    heads=6,  # Reduced from 12\n    use_abs_pos_emb=False,\n    attn_flash=True,\n    attn_kv_heads=2,\n    qk_norm=True,\n    attn_qk_norm=True,\n    attn_qk_norm_dim_scale=True,\n    post_fusion_norm=True,\n    post_modal_transform_norm=True,\n)\n\n# Text shape: [batch, seq_len, dim]\ntext = torch.randint(0, 10000, (1, 1024))  # Reduced seq_len from 4096\n\n# Img shape: [batch, channels, height, width]\nimg = torch.randn(1, 3, 64, 64)  # Reduced height and width from 128\n\n# Audio shape: [batch, audio_seq_len, dim]\naudio = torch.randn(1, 32)  # Reduced audio_seq_len from 64\n\n# Apply model to text and img\ny, _ = model(text=text, img=img, audio=audio)\n\n# Output shape: [batch, seq_len, dim]\nprint(y)\nprint(y.shape)\n\n\n# After much training\nmodel.eval()\n\ntext = tokenize(texts)\nlogits = model(text)\ntext = detokenize(logits)\n```\n------\n\n\n## LongGemini\nAn implementation of Gemini with Ring Attention, no multi-modality processing yet.\n\n```python\nimport torch\nfrom gemini_torch import LongGemini\n\n# Text tokens\nx = torch.randint(0, 10000, (1, 1024))\n\n# Create an instance of the LongGemini model\nmodel = LongGemini(\n    dim=512,  # Dimension of the input tensor\n    depth=32,  # Number of transformer blocks\n    dim_head=128,  # Dimension of the query, key, and value vectors\n    long_gemini_depth=9,  # Number of long gemini transformer blocks\n    heads=24,  # Number of attention heads\n    qk_norm=True,  # Whether to apply layer normalization to query and key vectors\n    ring_seq_size=512,  # The size of the ring sequence\n)\n\n# Apply the model to the input tensor\nout = model(x)\n\n# Print the output tensor\nprint(out)\n\n```\n\n\n## Tokenizer\n- Sentencepiece, tokenizer\n- We're using the same tokenizer as LLAMA with special tokens denoting the beginning and end of the multi modality tokens.\n- Does not fully process img, audio, or videos now we need help on that\n\n```python\nfrom gemini_torch.tokenizer import MultimodalSentencePieceTokenizer\n\n# Example usage\ntokenizer_name = \"hf-internal-testing/llama-tokenizer\"\ntokenizer = MultimodalSentencePieceTokenizer(tokenizer_name=tokenizer_name)\n\n# Encoding and decoding examples\nencoded_audio = tokenizer.encode(\"Audio description\", modality=\"audio\")\ndecoded_audio = tokenizer.decode(encoded_audio)\n\nprint(\"Encoded audio:\", encoded_audio)\nprint(\"Decoded audio:\", decoded_audio)\n```\n\n\n\n# References\n* Combine Reinforcment learning with modular pretrained transformer, multi-modal capabilities, image, audio, \n* self improving mechanisms like robocat\n* PPO? or MPO\n* get good at backtracking and exploring alternative paths\n* speculative decoding\n* Algorithm of Thoughts\n* RLHF\n* [Gemini Report](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf)\n* [Gemini Landing Page](https://deepmind.google/technologies/gemini/#introduction)\n\n\n# Todo\n\n- [ ] [Check out the project board for more todos](https://github.com/users/kyegomez/projects/11/views/1)\n\n\n- [x] Implement the img feature embedder and align imgs with text and pass into transformer: ```Gemini models are trained to accommodate textual input interleaved with a wide variety of audio and visual inputs, such as natural images, charts, screenshots, PDFs, and videos, and they can produce\ntext and image outputs (see Figure 2). The visual encoding of Gemini models is inspired by our own\nfoundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al.,\n2022), with the important distinction that the models are multimodal from the beginning and can\nnatively output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b).```\n\n- [x] Implement the audio processing using USM by Google:```In addition, Gemini can directly ingest audio signals at\n16kHz from Universal Speech Model (USM) (Zhang et al., 2023) features. This enables the model to\ncapture nuances that are typically lost when the audio is naively mapped to a text input (for example,\nsee audio understanding demo on the website).```\n\n\n- [ ] Video Processing Technique: \"\nVideo understanding is accomplished by encoding the video as a sequence of frames in the large\ncontext window. Video frames or images can be interleaved naturally with text or audio as part of the\nmodel input\"\n\n- [ ] Prompting Technique: ``` We find Gemini Ultra achieves highest\naccuracy when used in combination with a chain-of-thought prompting approach (Wei et al., 2022)\nthat accounts for model uncertainty. The model produces a chain of thought with k samples, for\nexample 8 or 32. If there is a consensus above a preset threshold (selected based on the validation\nsplit), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood\nchoice without chain of thought. We refer the reader to appendix for a detailed breakdown of how\nthis approach compares with only chain-of-thought prompting or only greedy sampling.```\n\n\n\n- [ ] Train a 1.8B + 3.25 Model: ```Nano-1 and Nano-2 model sizes are only 1.8B and 3.25B\nparameters respectively. Despite their size, they show exceptionally strong performance on factuality,\ni.e. retrieval-related tasks, and significant performance on reasoning, STEM, coding, multimodal and```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkyegomez%2Fgemini","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkyegomez%2Fgemini","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkyegomez%2Fgemini/lists"}