{"id":14997632,"url":"https://github.com/code-kern-ai/refinery-python-sdk","last_synced_at":"2025-07-16T03:45:52.875Z","repository":{"id":39674751,"uuid":"374332384","full_name":"code-kern-ai/refinery-python-sdk","owner":"code-kern-ai","description":"Official Python SDK for Kern AI refinery.","archived":false,"fork":false,"pushed_at":"2024-11-14T10:00:21.000Z","size":175,"stargazers_count":19,"open_issues_count":1,"forks_count":3,"subscribers_count":3,"default_branch":"dev","last_synced_at":"2025-06-16T21:18:11.837Z","etag":null,"topics":["active-learning","data-centric-ai","deep-learning","labeling","labeling-tool","machine-learning","natural-language-processing","neural-search","nlp","python","sdk","spacy","supervised-learning","text-annotation","text-classification","transformer"],"latest_commit_sha":null,"homepage":"https://www.kern.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/code-kern-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-06-06T10:36:37.000Z","updated_at":"2025-02-25T00:51:26.000Z","dependencies_parsed_at":"2024-05-14T13:43:15.120Z","dependency_job_id":"ac5d71b8-f6a5-442d-a4ae-1c3c4b3e9062","html_url":"https://github.com/code-kern-ai/refinery-python-sdk","commit_stats":null,"previous_names":["code-kern-ai/refinery-python"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/code-kern-ai/refinery-python-sdk","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code-kern-ai%2Frefinery-python-sdk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code-kern-ai%2Frefinery-python-sdk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code-kern-ai%2Frefinery-python-sdk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code-kern-ai%2Frefinery-python-sdk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/code-kern-ai","download_url":"https://codeload.github.com/code-kern-ai/refinery-python-sdk/tar.gz/refs/heads/dev","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code-kern-ai%2Frefinery-python-sdk/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265479841,"owners_count":23773625,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["active-learning","data-centric-ai","deep-learning","labeling","labeling-tool","machine-learning","natural-language-processing","neural-search","nlp","python","sdk","spacy","supervised-learning","text-annotation","text-classification","transformer"],"created_at":"2024-09-24T17:05:17.338Z","updated_at":"2025-07-16T03:45:52.836Z","avatar_url":"https://github.com/code-kern-ai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![refinery repository](https://uploads-ssl.webflow.com/61e47fafb12bd56b40022a49/62cf1c3cb8272b1e9c01127e_refinery%20sdk%20banner.png)](https://github.com/code-kern-ai/refinery)\n[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)\n[![pypi 1.4.0](https://img.shields.io/badge/pypi-1.4.0-yellow.svg)](https://pypi.org/project/refinery-python-sdk/1.4.0/)\n\nThis is the official Python SDK for [*refinery*](https://github.com/code-kern-ai/refinery), the **open-source** data-centric IDE for NLP.\n\n**Table of Contents**\n- [Installation](#installation)\n- [Usage](#usage)\n  - [Creating a `Client` object](#creating-a-client-object)\n  - [Fetching labeled data](#fetching-labeled-data)\n  - [Fetching lookup lists](#fetching-lookup-lists)\n  - [Upload files](#upload-files)\n  - [Adapters](#adapters)\n    - [Sklearn](#sklearn-adapter)\n    - [PyTorch](#pytorch-adapter)\n    - [HuggingFace](#hugging-face-adapter)\n    - [Rasa](#rasa-adapter)\n  - [Callbacks](#callbacks)\n    - [Sklearn](#sklearn-callback)\n    - [PyTorch](#pytorch-callback)\n    - [HuggingFace](#hugging-face-callback)\n- [Contributing](#contributing)\n- [License](#license)\n- [Contact](#contact)\n\nIf you like what we're working on, please leave a ⭐!\n\n## Installation\n\nYou can set up this SDK either via running `$ pip install refinery-python-sdk`, or by cloning this repository and running `$ pip install -r requirements.txt`.\n\n## Usage\n\n### Creating a `Client` object\nOnce you installed the package, you can create a `Client` object from any Python terminal as follows:\n\n```python\nfrom refinery import Client\n\nuser_name = \"your-username\" # this is the email you log in with\npassword = \"your-password\"\nproject_id = \"your-project-id\" # can be found in the URL of the web application\n\nclient = Client(user_name, password, project_id)\n# if you run the application locally, please use the following instead:\n# client = Client(user_name, password, project_id, uri=\"http://localhost:4455\")\n```\n\nThe `project_id` can be found in your browser, e.g. if you run the app on your localhost: `http://localhost:4455/app/projects/{project_id}/overview`\n\nAlternatively, you can provide a `secrets.json` file in your directory where you want to run the SDK, looking as follows:\n```json\n{\n    \"user_name\": \"your-username\",\n    \"password\": \"your-password\",\n    \"project_id\": \"your-project-id\"\n}\n```\n\nAgain, if you run on your localhost, you should also provide `\"uri\": \"http://localhost:4455\"`. Afterwards, you can access the client like this:\n\n```python\nclient = Client.from_secrets_file(\"secrets.json\")\n```\n\nWith the `Client`, you easily integrate your data into any kind of system; may it be a custom implementation, an AutoML system or a plain data analytics framework 🚀\n\n### Fetching labeled data\n\nNow, you can easily fetch the data from your project:\n```python\ndf = client.get_record_export(tokenize=False)\n# if you set tokenize=True (default), the project-specific \n# spaCy tokenizer will process your textual data\n```\n\nAlternatively, you can also just run `rsdk pull` in your CLI given that you have provided the `secrets.json` file in the same directory.\n\nThe `df` contains both your originally uploaded data (e.g. `headline` and `running_id` if you uploaded records like `{\"headline\": \"some text\", \"running_id\": 1234}`), and a triplet for each labeling task you create. This triplet consists of the manual labels, the weakly supervised labels, and their confidence. For extraction tasks, this data is on token-level.\n\nAn example export file looks like this:\n```json\n[\n  {\n    \"running_id\": \"0\",\n    \"Headline\": \"T. Rowe Price (TROW) Dips More Than Broader Markets\",\n    \"Date\": \"Jun-30-22 06:00PM\\u00a0\\u00a0\",\n    \"Headline__Sentiment Label__MANUAL\": null,\n    \"Headline__Sentiment Label__WEAK_SUPERVISION\": \"Negative\",\n    \"Headline__Sentiment Label__WEAK_SUPERVISION__confidence\": \"0.6220\"\n  }\n]\n```\n\nIn this example, there is no manual label, but a weakly supervised label `\"Negative\"` has been set with 62.2% confidence.\n\n### Fetching lookup lists\nIn your project, you can create lookup lists to implement distant supervision heuristics. To fetch your lookup list(s), you can either get all or fetch one by its list id.\n```python\nlist_id = \"your-list-id\"\nlookup_list = client.get_lookup_list(list_id)\n```\n\nThe list id can be found in your browser URL when you're on the details page of a lookup list, e.g. when you run on localhost: `http://localhost:4455/app/projects/{project_id}/knowledge-base/{list_id}`.\n\nAlternatively, you can pull all lookup lists:\n```python\nlookup_lists = client.get_lookup_lists()\n```\n\n### Upload files\nYou can import files directly from your machine to your application:\n\n```python\nfile_path = \"my/file/path/data.json\"\nupload_was_successful = client.post_file_import(file_path)\n```\n\nWe use Pandas to process the data you upload, so you can also provide `import_file_options` for the file type you use. Currently, you need to provide them as a `\\n`-separated string (e.g. `\"quoting=1\\nsep=';'\"`). We'll adapt this in the future to work with dictionaries instead.\n\nAlternatively, you can `rsdk push \u003cpath-to-your-file\u003e` via CLI, given that you have provided the `secrets.json` file in the same directory.\n\n**Make sure that you've selected the correct project beforehand, and fit the data schema of existing records in your project!**\n\n### Adapters\n\n#### Sklearn Adapter\nYou can use *refinery* to directly pull data into a format you can apply for building [sklearn](https://github.com/scikit-learn/scikit-learn) models. This can look as follows:\n\n```python\nfrom refinery.adapter.sklearn import build_classification_dataset\nfrom sklearn.tree import DecisionTreeClassifier\n\ndata = build_classification_dataset(client, \"headline\", \"__clickbait\", \"distilbert-base-uncased\")\n\nclf = DecisionTreeClassifier()\nclf.fit(data[\"train\"][\"inputs\"], data[\"train\"][\"labels\"])\n\npred_test = clf.predict(data[\"test\"][\"inputs\"])\naccuracy = (pred_test == data[\"test\"][\"labels\"]).mean()\n```\n\nBy the way, we can highly recommend to combine this with [Truss](https://github.com/basetenlabs/truss) for easy model serving!\n\n#### PyTorch Adapter\nIf you want to build a [PyTorch](https://github.com/pytorch/pytorch) network, you can build the `train_loader` and `test_loader` as follows:\n\n```python\nfrom refinery.adapter.torch import build_classification_dataset\ntrain_loader, test_loader, encoder, index = build_classification_dataset(\n    client, \"headline\", \"__clickbait\", \"distilbert-base-uncased\"\n)\n```\n\n#### Hugging Face Adapter\nTransformers are great, but often times, you want to finetune them for your downstream task. With *refinery*, you can do so easily by letting the SDK build the dataset for you that you can use as a plug-and-play base for your training:\n\n```python\nfrom refinery.adapter import transformers\ndataset, mapping = transformers.build_dataset(client, \"headline\", \"__clickbait\")\n```\n\nFrom here, you can follow the [finetuning example](https://huggingface.co/docs/transformers/training) provided in the official Hugging Face documentation. A next step could look as follows:\n\n```python\nsmall_train_dataset = dataset[\"train\"].shuffle(seed=42).select(range(1000))\nsmall_eval_dataset = dataset[\"test\"].shuffle(seed=42).select(range(1000))\n\nfrom transformers import (\n  AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer\n)\nimport numpy as np\nfrom datasets import load_metric\n\ntokenizer = AutoTokenizer.from_pretrained(\"distilbert-base-uncased\")\n\ndef tokenize_function(examples):\n    return tokenizer(examples[\"headline\"], padding=\"max_length\", truncation=True)\n\ntokenized_datasets = dataset.map(tokenize_function, batched=True)\nmodel = AutoModelForSequenceClassification.from_pretrained(\"distilbert-base-uncased\", num_labels=2)\ntraining_args = TrainingArguments(output_dir=\"test_trainer\")\nmetric = load_metric(\"accuracy\")\n\ndef compute_metrics(eval_pred):\n    logits, labels = eval_pred\n    predictions = np.argmax(logits, axis=-1)\n    return metric.compute(predictions=predictions, references=labels)\n\ntraining_args = TrainingArguments(output_dir=\"test_trainer\", evaluation_strategy=\"epoch\")\n\nsmall_train_dataset = tokenized_datasets[\"train\"].shuffle(seed=42).select(range(1000))\nsmall_eval_dataset = tokenized_datasets[\"test\"].shuffle(seed=42).select(range(1000))\n\ntrainer = Trainer(\n    model=model,\n    args=training_args,\n    train_dataset=small_train_dataset,\n    eval_dataset=small_eval_dataset,\n    compute_metrics=compute_metrics,\n)\n\ntrainer.train()\n\ntrainer.save_model(\"path/to/model\")\n```\n\n#### Rasa Adapter\n*refinery* is perfect to be used for building chatbots with [Rasa](https://github.com/RasaHQ/rasa). We've built an adapter with which you can easily create the required Rasa training data directly from *refinery*.\n\nTo do so, do the following:\n\n```python\nfrom refinery.adapter import rasa\n\nrasa.build_intent_yaml(\n  client,\n  \"text\",\n  \"__intent__WEAK_SUPERVISION\"\n)\n```\n\nThis will create a `.yml` file looking as follows:\n\n```yml\nnlu:\n- intent: check_balance\n  examples: |\n    - how much do I have on my savings account\n    - how much money is in my checking account\n    - What's the balance on my credit card account\n```\n\nIf you want to provide a metadata-level label (such as sentiment), you can provide the optional argument `metadata_label_task`:\n\n```python\nfrom refinery.adapter import rasa\n\nrasa.build_intent_yaml(\n  client,\n  \"text\",\n  \"__intent__WEAK_SUPERVISION\",\n  metadata_label_task=\"__sentiment__WEAK_SUPERVISION\"\n)\n```\n\nThis will create a file like this:\n```yml\nnlu:\n- intent: check_balance\n  metadata:\n    sentiment: neutral\n  examples: |\n    - how much do I have on my savings account\n    - how much money is in my checking account\n    - What's the balance on my credit card account\n```\n\nAnd if you have entities in your texts which you'd like to recognize, simply add the `tokenized_label_task` argument:\n\n```python\nfrom refinery.adapter import rasa\n\nrasa.build_intent_yaml(\n  client,\n  \"text\",\n  \"__intent__WEAK_SUPERVISION\",\n  metadata_label_task=\"__sentiment__WEAK_SUPERVISION\",\n  tokenized_label_task=\"text__entities__WEAK_SUPERVISION\"\n)\n```\n\nThis will not only inject the label names on token-level, but also creates lookup lists for your chatbot:\n\n```yml\nnlu:\n- intent: check_balance\n  metadata:\n    sentiment: neutral\n  examples: |\n    - how much do I have on my [savings](account) account\n    - how much money is in my [checking](account) account\n    - What's the balance on my [credit card account](account)\n- lookup: account\n  examples: |\n    - savings\n    - checking\n    - credit card account\n```\n\nPlease make sure to also create the further necessary files (`domain.yml`, `data/stories.yml` and `data/rules.yml`) if you want to train your Rasa chatbot. For further reference, see their [documentation](https://rasa.com/docs/rasa).\n\n\n### Callbacks\nIf you want to feed your production model's predictions back into *refinery*, you can do so with any version greater than [1.2.1](https://github.com/code-kern-ai/refinery/releases/tag/v1.2.1).\n\nTo do so, we have a generalistic interface and framework-specific classes.\n\n#### Sklearn Callback\nIf you want to train a scikit-learn model an feed its outputs back into the refinery, you can do so easily as follows:\n\n```python\nfrom sklearn.linear_model import LogisticRegression\nclf = LogisticRegression() # we use this as an example, but you can use any model implementing predict_proba\n\nfrom refinery.adapter.sklearn import build_classification_dataset\ndata = build_classification_dataset(client, \"headline\", \"__clickbait\", \"distilbert-base-uncased\")\nclf.fit(data[\"train\"][\"inputs\"], data[\"train\"][\"labels\"])\n\nfrom refinery.callbacks.sklearn import SklearnCallback\ncallback = SklearnCallback(\n    client, \n    clf,\n    \"clickbait\", \n)\n\n# executing this will call the refinery API with batches of size 32, so your data is pushed to the app\ncallback.run(data[\"train\"][\"inputs\"], data[\"train\"][\"index\"])\ncallback.run(data[\"test\"][\"inputs\"], data[\"test\"][\"index\"])\n```\n\n#### PyTorch Callback\nFor PyTorch, the procedure is really similar. You can do as follows:\n\n```python\nfrom refinery.adapter.torch import build_classification_dataset\ntrain_loader, test_loader, encoder, index = build_classification_dataset(\n    client, \"headline\", \"__clickbait\", \"distilbert-base-uncased\"\n)\n\n# build your custom model and train it here - example:\nimport torch.nn as nn\nimport numpy as np\nimport torch\n\n# number of features (len of X cols)\ninput_dim = 768\n# number of hidden layers\nhidden_layers = 20\n# number of classes (unique of y)\noutput_dim = 2\nclass Network(nn.Module):\n    def __init__(self):\n        super(Network, self).__init__()\n        self.linear1 = nn.Linear(input_dim, output_dim)\n   \n    def forward(self, x):\n        x = torch.sigmoid(self.linear1(x))\n        return x\n    \nclf = Network()\ncriterion = nn.CrossEntropyLoss()\noptimizer = torch.optim.SGD(clf.parameters(), lr=0.1)\n\nepochs = 2\nfor epoch in range(epochs):\n    running_loss = 0.0\n    for i, data in enumerate(train_loader, 0):\n        inputs, labels = data\n        # set optimizer to zero grad to remove previous epoch gradients\n        optimizer.zero_grad()\n        # forward propagation\n        outputs = clf(inputs)\n        loss = criterion(outputs, labels)\n        # backward propagation\n        loss.backward()\n        # optimize\n        optimizer.step()\n        running_loss += loss.item()\n        # display statistics\n        print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.5f}')\n        running_loss = 0.0\n\n# with this model trained, you can use the callback\nfrom refinery.callbacks.torch import TorchCallback\ncallback = TorchCallback(\n    client, \n    clf,\n    \"clickbait\", \n    encoder\n)\n\n# and just execute this \ncallback.run(train_loader, index[\"train\"])\ncallback.run(test_loader, index[\"test\"])\n```\n\n#### HuggingFace Callback\nCollect the dataset and train your custom transformer model as follows:\n\n```python\nfrom refinery.adapter import transformers\ndataset, mapping, index = transformers.build_classification_dataset(client, \"headline\", \"__clickbait\")\n\n# train a model here, we're simplifying this by just using an existing model w/o retraining\nfrom transformers import pipeline\npipe = pipeline(\"text-classification\", model=\"distilbert-base-uncased\")\n\n# if you're interested to see how a training looks like, look into the above HuggingFace adapter\n\n# you can now apply the callback\nfrom refinery.callbacks.transformers import TransformerCallback\ncallback = TransformerCallback(\n    client, \n    pipe,\n    \"clickbait\", \n    mapping\n)\n\ncallback.run(dataset[\"train\"][\"headline\"], index[\"train\"])\ncallback.run(dataset[\"test\"][\"headline\"], index[\"test\"])\n```\n\n#### Generic Callback\nThis one is your fallback if you have a very custom solution; other than that, we recommend you look into the framework-specific classes.\n\n```python\nfrom refinery.callbacks.inference import ModelCallback\nfrom refinery.adapter.sklearn import build_classification_dataset\nfrom sklearn.linear_model import LogisticRegression\n\ndata = build_classification_dataset(client, \"headline\", \"__clickbait\", \"distilbert-base-uncased\"0)\nclf = LogisticRegression()\nclf.fit(data[\"train\"][\"inputs\"], data[\"train\"][\"labels\"])\n\n# you can build initialization functions that set states of objects you use in the pipeline\ndef initialize_fn(inputs, labels, **kwargs):\n    return {\"clf\": kwargs[\"clf\"]}\n\n# postprocessing shifts the model outputs into a format accepted by our API\ndef postprocessing_fn(outputs, **kwargs):\n    named_outputs = []\n    for prediction in outputs:\n        pred_index = prediction.argmax()\n        label = kwargs[\"clf\"].classes_[pred_index]\n        confidence = prediction[pred_index]\n        named_outputs.append([label, confidence])\n    return named_outputs\n\ncallback = ModelCallback(\n    client: Client,\n    \"my-custom-regression\",\n    \"clickbait\",\n    inference_fn=clf.predict_proba,\n    initialize_fn=initialize_fn,\n    postprocessing_fn=postprocessing_fn\n)\n\n# executing this will call the refinery API with batches of size 32\ncallback.initialize_and_run(data[\"train\"][\"inputs\"], data[\"train\"][\"index\"])\ncallback.run(data[\"test\"][\"inputs\"], data[\"test\"][\"index\"])\n```\n\n\n## Contributing\nContributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.\n\nIf you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag \"enhancement\".\n\n1. Fork the Project\n2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the Branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\nAnd please don't forget to leave a ⭐ if you like the work! \n\n## License\nDistributed under the MIT License. See LICENSE.txt for more information.\n\n## Contact\nThis library is developed and maintained by [Kern AI](https://github.com/code-kern-ai). If you want to provide us with feedback or have some questions, don't hesitate to contact us. We're super happy to help ✌️\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcode-kern-ai%2Frefinery-python-sdk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcode-kern-ai%2Frefinery-python-sdk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcode-kern-ai%2Frefinery-python-sdk/lists"}