{"id":29126260,"url":"https://github.com/gitgoap/llm-multilingual-check","last_synced_at":"2025-07-10T01:06:52.894Z","repository":{"id":299836527,"uuid":"1004364668","full_name":"gitgoap/LLM-Multilingual-check","owner":"gitgoap","description":null,"archived":false,"fork":false,"pushed_at":"2025-06-18T15:24:16.000Z","size":41,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-18T15:35:10.921Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gitgoap.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-18T14:07:51.000Z","updated_at":"2025-06-18T15:24:19.000Z","dependencies_parsed_at":"2025-06-18T15:48:52.897Z","dependency_job_id":null,"html_url":"https://github.com/gitgoap/LLM-Multilingual-check","commit_stats":null,"previous_names":["gitgoap/llm-multilingual-check"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gitgoap/LLM-Multilingual-check","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitgoap%2FLLM-Multilingual-check","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitgoap%2FLLM-Multilingual-check/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitgoap%2FLLM-Multilingual-check/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitgoap%2FLLM-Multilingual-check/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gitgoap","download_url":"https://codeload.github.com/gitgoap/LLM-Multilingual-check/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitgoap%2FLLM-Multilingual-check/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262680554,"owners_count":23347597,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-29T23:03:54.002Z","updated_at":"2025-07-10T01:06:52.887Z","avatar_url":"https://github.com/gitgoap.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LLM-Multilingual-Check\n\n## Project: *\"How Does a Multilingual Language Model Handle Multiple Languages?\"*\n\n### Objective\nUnderstand how multilingual models (specifically **BLOOM-1.7B**) represent and transfer knowledge across languages.\n\n---\n\n## Task 1: Embedding Similarity\n\n-  Making a parallel word list (500–5000 entries) in **English**, **French**, and **Portuguese** — languages supported by BLOOM.\n-  **Compute cosine similarity** between word embeddings across languages to check if “meaning” aligns.\n\n**Percentage distibution** of 3 language in Bloom 1.7B dataset (source)[https://huggingface.co/bigscience/bloom-1b7]:\n- English: **31.3%**\n- French: **13.5%**\n- Portuguese: **5.2%**\n\n### Task 1 Result\n**Average Cosine Similarity:**\n- English–French: **0.9365**\n- French–Portuguese: **0.9349**\n- Portuguese–English: **0.9153**\n\nGPU Used for the task: `Nvidia T4` on `Google Collab`\n\n#### Verdict: High average cosine similarity in all 3 pairs signifies that Bloom 1.7B has similar meaning for words in different language since Portuguese having only 5.2% share in total Bloom training dataset  gives similar cosine similariy of english, french pair.\n\n---\n\n##  Task 2: Cross-Lingual Transfer (Zero-Shot Learning)\n\n-  **Fine-tuning** BLOOM on a downstream task (e.g., **sentiment classification**) using only **English** data (High-Resource Language).\n-  **Test** the model on the same task in a **Low-Resource Language** (e.g., **Hindi** or **Swahili**) *without any additional training*.\n\n###  Goal\nCheck if BLOOM can generalize and transfer task knowledge across languages — a key indicator of its multilingual capabilities and usefulness in **low-resource language settings**.\n\nBloom chosen due to **Multilingual Pretraining**, then Finetuning.\n\n---\n\n### Future Note\nThis project can be scaled to 15-20 languages in task 1 to check multiple different LLMs.\n\nFeel free to contribute or explore further improvements!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgitgoap%2Fllm-multilingual-check","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgitgoap%2Fllm-multilingual-check","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgitgoap%2Fllm-multilingual-check/lists"}