{"id":15014176,"url":"https://github.com/megagonlabs/ginza-transformers","last_synced_at":"2025-04-12T06:05:10.035Z","repository":{"id":43403808,"uuid":"379536535","full_name":"megagonlabs/ginza-transformers","owner":"megagonlabs","description":"Use custom tokenizers in spacy-transformers","archived":false,"fork":false,"pushed_at":"2022-08-09T09:21:51.000Z","size":33,"stargazers_count":16,"open_issues_count":2,"forks_count":5,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-26T01:24:32.893Z","etag":null,"topics":["ginza","natural-language-processing","nlp","spacy","spacy-transformers","sudachitra","tokenizers","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/megagonlabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-06-23T08:42:11.000Z","updated_at":"2024-07-01T00:43:15.000Z","dependencies_parsed_at":"2022-09-26T18:41:24.766Z","dependency_job_id":null,"html_url":"https://github.com/megagonlabs/ginza-transformers","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fginza-transformers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fginza-transformers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fginza-transformers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fginza-transformers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/megagonlabs","download_url":"https://codeload.github.com/megagonlabs/ginza-transformers/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248142907,"owners_count":21054672,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ginza","natural-language-processing","nlp","spacy","spacy-transformers","sudachitra","tokenizers","transformers"],"created_at":"2024-09-24T19:45:17.649Z","updated_at":"2025-04-12T06:05:10.011Z","avatar_url":"https://github.com/megagonlabs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ginza-transformers: Use custom tokenizers in spacy-transformers\n\nThe `ginza-transformers` is a simple extension of the [spacy-transformers](https://github.com/explosion/spacy-transformers) to use the custom tokenizers (defined outside of [huggingface/transformers](https://huggingface.co/transformers/)) in `transformer` pipeline component of [spaCy v3](https://spacy.io/usage/v3). The `ginza-transformers` also provides the ability to download the models from [Hugging Face Hub](https://huggingface.co/models) automatically at run time.\n\n## Fallback mechanisms\nThere are two fallback tricks in `ginza-transformers`.\n\n### Cutom tokenizer fallbacking\nLoading a custom tokenizer specified in `components.transformer.model.tokenizer_config.tokenizer_class` attribute of `config.cfg` of a spaCy language model package, as follows.\n- `ginza-transformers` initially tries to import a tokenizer class with the standard manner of `huggingface/transformers` (via `AutoTokenizer.from_pretrained()`)\n- If a `ValueError` raised from `AutoTokenizer.from_pretrained()`, the fallback logic of `ginza-transformers` tries to import the class via `importlib.import_module` with the `tokenizer_class` value\n\n### Model loading at run time\nDownloading the model files published in Hugging Face Hub at run time, as follows.\n- `ginza-transformers` initially tries to load local model directory (i.e. `/${local_spacy_model_dir}/transformer/model/`)\n- If `OSError` raised, the first fallback logic passes a model name specified in `components.transformer.model.name` attribute of `config.cfg` to `AutoModel.from_pretrained()` with `local_files_only=True` option, which means the first fallback logic will immediately look in the local cache and will not reference the Hugging Face Hub at this point\n- If `OSError` raised from the first fallback logic, the second fallback logic executes `AutoModel.from_pretrained()` without `local_files_only` option, which means the second fallback logic will search specified model name in the Hugging Face Hub\n\n## How to use\nBefore executing `spacy train` command, make sure that [spaCy is working with cuda suppot](https://spacy.io/usage#gpu), and then install this package like:\n```cosole\npip install -U ginza-transformers\n```\n\nYou need to use `config.cfg` with a different setting when performing the analysis than the `spacy train`.\n\n### Setting for training phase\n[Here is an example](https://github.com/megagonlabs/ginza/blob/develop/config/ja_ginza_electra.cfg) of spaCy's `config.cfg` for training phase.\nWith this config, `ginza-transformers` employs [`SudachiTra`](https://github.com/WorksApplications/SudachiTra) as a transformer tokenizer and use [`megagonlabs/tansformers-ud-japanese-electra-base-discriminator`](https://huggingface.co/megagonlabs/transformers-ud-japanese-electra-base-discriminator) as a pretrained transformer model.\nThe attributes of the training phase that differ from the defaults of spacy-transformers model are as follows:\n```\n[components.transformer.model]\n@architectures = \"ginza-transformers.TransformerModel.v1\"\nname = \"megagonlabs/transformers-ud-japanese-electra-base-discriminator\"\n\n[components.transformer.model.tokenizer_config]\nuse_fast = false\ntokenizer_class = \"sudachitra.tokenization_electra_sudachipy.ElectraSudachipyTokenizer\"\ndo_lower_case = false\ndo_word_tokenize = true\ndo_subword_tokenize = true\nword_tokenizer_type = \"sudachipy\"\nsubword_tokenizer_type = \"wordpiece\"\nword_form_type = \"dictionary_and_surface\"\n\n[components.transformer.model.tokenizer_config.sudachipy_kwargs]\nsplit_mode = \"A\"\ndict_type = \"core\"\n```\n\n### Setting for analysis phases\n[Here is an example](https://github.com/megagonlabs/ginza/blob/develop/config/ja_ginza_electra.analysis.cfg) of `config.cfg` for analysis phase.\nThis config references [`megagonlabs/tansformers-ud-japanese-electra-base-ginza`](https://huggingface.co/megagonlabs/transformers-ud-japanese-electra-base-ginza). The transformer model specified at `components.transformer.model.name` would be downloaded from the Hugging Face Hub at run time.\nThe attributes of the analysis phase that differ from the training phase are as follows:\n```\n[components.transformer]\nfactory = \"transformer_custom\"\n\n[components.transformer.model]\nname = \"megagonlabs/transformers-ud-japanese-electra-base-ginza\"\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmegagonlabs%2Fginza-transformers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmegagonlabs%2Fginza-transformers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmegagonlabs%2Fginza-transformers/lists"}