{"id":31562728,"url":"https://github.com/ark2016/word-segmentation","last_synced_at":"2026-04-17T13:33:10.694Z","repository":{"id":316130723,"uuid":"1061284995","full_name":"ark2016/word-segmentation","owner":"ark2016","description":"Задача: Восстановление пропущенных пробелов в тексте с помощью NLP / DL / алгоритма.","archived":false,"fork":false,"pushed_at":"2025-09-22T20:22:42.000Z","size":30,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-05T04:51:21.789Z","etag":null,"topics":["ai","llm","ml","ollama","word-segmentation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ark2016.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-21T16:02:52.000Z","updated_at":"2025-09-22T20:23:40.000Z","dependencies_parsed_at":"2025-09-22T22:11:58.946Z","dependency_job_id":"20b91111-2686-404a-8b3c-709f483592fb","html_url":"https://github.com/ark2016/word-segmentation","commit_stats":null,"previous_names":["ark2016/word-segmentation"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ark2016/word-segmentation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ark2016%2Fword-segmentation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ark2016%2Fword-segmentation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ark2016%2Fword-segmentation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ark2016%2Fword-segmentation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ark2016","download_url":"https://codeload.github.com/ark2016/word-segmentation/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ark2016%2Fword-segmentation/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31931464,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-17T12:37:54.787Z","status":"ssl_error","status_checked_at":"2026-04-17T12:37:25.095Z","response_time":62,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","llm","ml","ollama","word-segmentation"],"created_at":"2025-10-05T04:50:46.613Z","updated_at":"2026-04-17T13:33:10.665Z","avatar_url":"https://github.com/ark2016.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Word Segmentation Solution\n\nРешение для восстановления пропущенных пробелов в русском тексте с использованием LLM через Ollama.\n\n## Выбор модели\n\nИспользуется **QVikhr-3-1.7B-Instruction-noreasoning** - очень лёгкая модель (1.7B параметров), которая является SOTA среди русскоязычных моделей в своём классе. Оптимальное соотношение качества и скорости для задач NLP.\n\n## Быстрый старт\n\n### 1. Установка Ollama\n```bash\ncurl -fsSL https://ollama.com/install.sh | sh\n```\n\n### 2. Запуск модели\n```bash\nollama run hf.co/Vikhrmodels/QVikhr-3-1.7B-Instruction-noreasoning-GGUF:Q4_K_M\n```\n\n### 3. Подготовка окружения\n```bash\npython3 -m venv venv\nsource venv/bin/activate\npip install pandas numpy requests\n```\n\n### 4. Запуск обработки\n```bash\nexport OLLAMA_API_URL=http://localhost:11434\npython space_restoration_solution.py\n```\n\n## Результат\n\nФайл `submission.csv` будет содержать:\n- `id` - идентификатор записи\n- `predicted_positions` - список позиций для вставки пробелов\n\n**Для отправки**: переименуйте `submission_fixed_with_quotes.csv` в `.txt` формат для загрузки в систему.\n\n## Требования\n\n- GPU с 4GB+ VRAM\n- CUDA совместимая видеокарта\n- Python 3.8+\n\n## Тестирование\n\n```bash\npython test_solution.py\n```\n\n## Подход к решению\n\n- **LLM-only решение**: Используется только языковая модель без дополнительных алгоритмов\n- **Zero-shot промптинг**: Модель работает на примерах из промпта без дообучения\n- **Оптимизация промпта**: Специально настроенный промпт для минимизации рассуждений и получения точных позиций\n\n## Производительность\n\n- **Скорость**: ~2-5 текстов/сек на Tesla T4\n- **Память**: ~4GB VRAM\n- **Точность**: Хорошо работает на коротких текстах объявлений\n\n## Устранение проблем\n\n```bash\n# Проверка работы Ollama\nollama ps\n\n# Перезагрузка модели\nollama pull hf.co/Vikhrmodels/QVikhr-3-1.7B-Instruction-noreasoning-GGUF:Q4_K_M\n```\n\n## Структура проекта\n\n```\nword-segmentation/\n├── space_restoration_solution.py  # Основное решение\n├── test_solution.py              # Тесты и проверки\n├── fix_submission.py             # Утилита для форматирования\n├── requirements.txt              # Зависимости\n├── dataset_1937770_3.txt         # Входные данные\n└── README.md                     # Документация\n```\n\nP.S. для данных хорошо было бы использовать DVC, как и для молели, но неудобно передовать секреты для DVC\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fark2016%2Fword-segmentation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fark2016%2Fword-segmentation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fark2016%2Fword-segmentation/lists"}