{"id":23236991,"url":"https://github.com/nicolay-r/arekit-ss","last_synced_at":"2025-08-19T23:31:10.641Z","repository":{"id":74011628,"uuid":"575174134","full_name":"nicolay-r/arekit-ss","owner":"nicolay-r","description":"Low Resource Context Relation Sampler for contexts with relations for fact-checking and fine-tuning your LLM models, powered by AREkit","archived":false,"fork":false,"pushed_at":"2024-12-10T10:42:40.000Z","size":2098,"stargazers_count":3,"open_issues_count":7,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-12-10T11:35:50.063Z","etag":null,"topics":["dataset","datasets","datasets-preparation","factchecking","googletrans","googletranslate","ml","nlp","python","relations-extraction"],"latest_commit_sha":null,"homepage":"https://github.com/nicolay-r/AREkit","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nicolay-r.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-06T23:17:31.000Z","updated_at":"2024-12-10T10:42:45.000Z","dependencies_parsed_at":"2024-11-10T13:33:07.288Z","dependency_job_id":"27330b68-6589-401e-9807-18fcb741bf4f","html_url":"https://github.com/nicolay-r/arekit-ss","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicolay-r%2Farekit-ss","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicolay-r%2Farekit-ss/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicolay-r%2Farekit-ss/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicolay-r%2Farekit-ss/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nicolay-r","download_url":"https://codeload.github.com/nicolay-r/arekit-ss/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230374271,"owners_count":18216044,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","datasets","datasets-preparation","factchecking","googletrans","googletranslate","ml","nlp","python","relations-extraction"],"created_at":"2024-12-19T04:13:23.968Z","updated_at":"2024-12-19T04:13:24.554Z","avatar_url":"https://github.com/nicolay-r.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## arekit-ss 0.25.0\n\n![](https://img.shields.io/badge/Python-3.9-brightgreen.svg)\n![](https://img.shields.io/badge/AREkit-0.25.0-orange.svg)\n[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nicolay-r/arekit-ss/blob/master/arekit_ss.ipynb)\n[![PyPI downloads](https://img.shields.io/pypi/dm/arekit-ss.svg)](https://pypistats.org/packages/arekit-ss)\n\n\n### [📜 List of binded sources](https://github.com/nicolay-r/AREkit/wiki/Binded-Sources)\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"logo.png\"/\u003e\n\u003c/p\u003e\n\n`arekit-ss` [AREkit double \"s\"] -- is an object-pair context sampler \nfor [datasources](https://github.com/nicolay-r/AREkit/wiki/Binded-Sources), \npowered by [AREkit](https://github.com/nicolay-r/AREkit)\n\n\u003e **NOTE:** For custom text sampling, please follow the [ARElight](https://github.com/nicolay-r/ARElight) project.\n\n## Installation\n\nInstall dependencies:\n```bash\npip install git+https://github.com/nicolay-r/arekit-ss.git@0.25.0\n```\n\nDownload resources:\n```bash\npython -m arekit_ss.download_data\n```\n\n## Usage\n[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nicolay-r/arekit-ss/blob/master/arekit_ss.ipynb)\n\nExample of composing prompts:\n```bash\npython -m arekit_ss.sample --writer csv --source rusentrel --sampler prompt \\\n  --prompt \"For text: '{text}', the attitude between '{s_val}' and '{t_val}' is: '{label_val}'\" \\\n  --dest_lang en --docs_limit 1\n```\n\n\u003e **Mind the case (issue [#18](https://github.com/nicolay-r/arekit-ss/issues/18)):**\n\u003e switching to another language may affect on amount of extracted data because of `terms_per_context`\n\u003e parameter that crops context by fixed and predefined amount of words.\n\n\u003cdetails\u003e\n\u003csummary\u003e\n\n## Parameters\n\u003c/summary\u003e\n\n* `source` -- source name from the list of the [supported sources](https://github.com/nicolay-r/arekit-ss/blob/master/arekit_ss/sources/src_list.py).\n    * `terms_per_context` -- amount of words (terms) in between SOURCE and TARGET objects.\n    * `object-source-types` -- filter specific source object types\n    * `object-target-types` -- filter specific target object types\n    * `relation_types` -- list of types, in which items separated with `|` char; all by default\n    * `splits` -- Manual selection of the data-types related splits that should be chosen for the sampling process; \n      types should be separated by ':' sign; for example: 'train:test'\n* `sampler` -- List of the supported samplers:\n    * `nn` -- CNN/LSTM architecture related, including frames annotation from [RuSentiFrames](https://github.com/nicolay-r/RuSentiFrames).\n        * `no-vectorize` -- flag is applicable only for `nn`, and denotes no need to generate embeddings for features\n    * `bert` -- BERT-based, single-input sequence.\n    * `prompt` -- prompt-based sampler for LLM systems [[prompt engeneering guide]](https://github.com/dair-ai/Prompt-Engineering-Guide)\n        * `prompt` -- text of the prompt which includes the following parameters:\n          * `{text}` is an original text of the sample\n          * `{s_val}` and `{t_val}` values of the source and target of the pairs respectively\n          * `{label_val}` value of the label\n* `writer` -- the output format of samples:\n    * `csv` -- for [AREnets](https://github.com/nicolay-r/AREnets) framework;\n    * `jsonl` -- for [OpenNRE](https://github.com/thunlp/OpenNRE) framework.\n    * `sqlite` -- SQLite-3.0 database.\n* `mask_entities` -- mask entity mode.\n* Text translation parameters:\n    * `src_lang` -- original language of the text.\n    * `dest_lang` -- target language of the text.\n* `output_dir` -- target directory for samples storing\n* Limiting the amount of documents from source:\n    * `docs_limit` -- amount of documents to be considered for sampling from the whole source.\n    * `doc_ids` -- list of the document IDs.\n\u003c/details\u003e\n\n![output_prompts](https://github.com/nicolay-r/arekit-ss/assets/14871187/d1499f24-b2df-410b-98cc-8d4018de8c65)\n\n## Powered by\n\n* [AREkit framework](https://github.com/nicolay-r/AREkit)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnicolay-r%2Farekit-ss","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnicolay-r%2Farekit-ss","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnicolay-r%2Farekit-ss/lists"}