{"id":30539036,"url":"https://github.com/viktor-shcherb/text-labelling","last_synced_at":"2026-04-15T05:32:31.504Z","repository":{"id":307661074,"uuid":"1030277128","full_name":"viktor-shcherb/text-labelling","owner":"viktor-shcherb","description":"Streamlit-based, GitHub-backed collaborative annotation tool – extensible data models, real-time multi-user annotation, and versioned projects.","archived":false,"fork":false,"pushed_at":"2025-09-06T14:26:33.000Z","size":198,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-09-06T16:13:51.162Z","etag":null,"topics":["annotation-tool","collaboration","data-annotation","git","github-integration","machine-learning","oauth","pydantic","python","real-time","streamlit","text-annotation","yaml"],"latest_commit_sha":null,"homepage":"https://text-labelling.streamlit.app/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/viktor-shcherb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-01T11:20:11.000Z","updated_at":"2025-09-06T14:26:37.000Z","dependencies_parsed_at":null,"dependency_job_id":"56014edd-e658-42b7-8ca1-6c0dfd3be59b","html_url":"https://github.com/viktor-shcherb/text-labelling","commit_stats":null,"previous_names":["viktor-shcherb/text-labelling"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/viktor-shcherb/text-labelling","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viktor-shcherb%2Ftext-labelling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viktor-shcherb%2Ftext-labelling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viktor-shcherb%2Ftext-labelling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viktor-shcherb%2Ftext-labelling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/viktor-shcherb","download_url":"https://codeload.github.com/viktor-shcherb/text-labelling/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viktor-shcherb%2Ftext-labelling/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31828531,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-14T18:05:02.291Z","status":"online","status_checked_at":"2026-04-15T02:00:06.175Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotation-tool","collaboration","data-annotation","git","github-integration","machine-learning","oauth","pydantic","python","real-time","streamlit","text-annotation","yaml"],"created_at":"2025-08-27T21:24:35.927Z","updated_at":"2026-04-15T05:32:31.486Z","avatar_url":"https://github.com/viktor-shcherb.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Text Labelling App\n\n\u003cimg width=\"2043\" height=\"738\" alt=\"banner_text_labelling (1)\" src=\"https://github.com/user-attachments/assets/74556a50-e609-4186-b31e-bd149b1e940a\" /\u003e\n\nA collaborative annotation tool designed for easy deployment and customization. It provides out-of-the-box support for real-time multi-user labeling, project versioning, and seamless persistance to GitHub without additional infrastructure.\n\n**Key Features:**\n\n* **Collaborative Annotation:** Multiple contributors can label data concurrently with real‑time updates.\n* **Flexible Data Model:** Built with Pydantic for custom project, item, and annotation schemas.\n* **Versioned Projects:** Track iterations and changes across project versions.\n* **GitHub Integration:** Store source data and annotations directly in your repository for auditability.\n* **Extensible UI:** Plug in your own rendering logic and controls without touching core code.\n\n**Potential Limitations:**\n\n* **Auth0 Authentication Only:** Requires an Auth0 tenant (OAuth2/OpenID Connect).\n* **GitHub Storage Constraints:** Best suited for datasets within GitHub’s size limits (see [GitHub Community discussion](https://github.com/orgs/community/discussions/120943#discussioncomment-9209743)).\n* **Single-Instance Assumption:** Running multiple instances of the app on the same project/version may overwrite each other’s changes and corrupt the repository.\n\n---\n\n# Deployment\n\nDeploying the Text Labelling App is straightforward but requires initial setup for Auth0 and GitHub integrations. You can run the app locally for development or deploy to Streamlit Cloud for production.\n\n**Prerequisites**\n\n* Python 3.8+\n* `pip` package manager\n* Auth0 tenant (free tier available)\n* GitHub account with permission to create Apps\n\n## 1. Auth0 Integration\n\n1. **Create an Auth0 Application**\n\n   * In the Auth0 dashboard, add a new \"Web Application\".\n   * Under **Settings → Application URIs**, configure:\n\n     * **Allowed Callback URLs:** `https://\u003cyour-domain\u003e/oauth2callback`\n     * **Allowed Logout URLs:** Your app origin.\n2. **Configure Secrets**\n   In your project’s `secrets.toml`, add:\n\n   ```toml\n   [auth]\n   redirect_uri = \"https://\u003cyour-domain\u003e/oauth2callback\"\n   cookie_secret = \"\u003crandom SHA‑256 or longer string\u003e\"\n\n   [auth.auth0]\n   client_id = \"\u003cAuth0 Client ID\u003e\"\n   client_secret = \"\u003cAuth0 Client Secret\u003e\"\n   server_metadata_url = \"https://\u003cyour-tenant\u003e.\u003cregion\u003e.auth0.com/.well-known/openid-configuration\"\n   ```\n\n## 2. GitHub Integration\n\n1. **Create a GitHub App**\n\n   * Go to **Settings → Developer settings → GitHub Apps** and click **New GitHub App**.\n   * Grant **Permissions**: `Repository contents: Read \u0026 write`.\n2. **Generate Credentials**\n\n   * In your GitHub App settings, generate a **Private Key** (.pem file).\n   * Note the **Client ID**, **Client Secret**, and **App ID**.\n   * Record your App’s **slug** (last segment of the public URL, e.g., `text-labelling-app`).\n   * To find your **commit\\_sign\\_id**, query `https://api.github.com/users/\u003cyour-app-slug\u003e[bot]` and look for the `id` field.\n3. **Configure Secrets**\n   Add your GitHub App credentials to `secrets.toml` under the `[github_app]` section:\n\n   ```toml\n   [github_app]\n   slug = \"\u003cyour-app-slug\u003e\"                # e.g., text-labelling-app\n   client_id = \"\u003cGitHub App Client ID\u003e\"\n   commit_sign_id = \"\u003cGitHub App numeric ID\u003e\"\n   private_key_pem = '''\n   -----BEGIN RSA PRIVATE KEY-----\n   \u003cYOUR PRIVATE KEY\u003e\n   -----END RSA PRIVATE KEY-----\n   '''\n   ```\n\n## 3. Running Locally\u0026#x20;\n\n```bash\npip install -r requirements.txt\nstreamlit run src/label_app/ui/main.py\n```\n\n* Open `http://localhost:8501` in your browser.\n* Ensure `secrets.toml` resides in `.streamlit` folder.\n\n## 4. Streamlit Cloud Deployment\n\n1. Fork this repository to your GitHub account.\n2. On Streamlit Cloud, select **New app** and connect your fork.\n3. In the app settings, add the same secrets under **Advanced settings**:\n\n   * **auth.redirect\\_uri**, **auth.cookie\\_secret**\n   * **auth.auth0.client\\_id**, **client\\_secret**, **server\\_metadata\\_url**\n   * **github\\_app.slug**, **client\\_id**, **commit\\_sign\\_id**, **private\\_key\\_pem**\n4. Deploy and share the live URL with your team.\n\n---\n\n# Tailoring the App\n\nAdapt the core app to your specific labeling tasks by extending the data model and UI. You won’t need to alter any deployment or infrastructure code.\n\n## 1. Data Model\n\nAll data classes live in `src/label_app/data/models.py`. Three abstract Pydantic bases define your project structure:\n\n* **ProjectBase**: Project-wide settings (instructions, label schema). Serialized as `project.yaml` at each version root.\n* **ItemBase**: Represents one data point (text snippet, image URL, dialog turn, etc.).\n* **AnnotationBase**: Annotation schema for each item.\n\n**Steps to Extend:**\n\n1. **Subclass Templates**\n\n   ```python\n   class YourProject(ProjectBase): ...\n   class YourItem(ItemBase): ...\n   class YourAnnotation(AnnotationBase): ...\n   ```\n2. **Register in Union**\n   Update the discriminated union so Pydantic knows your type:\n\n   ```python\n   Project = Annotated[\n       Union[ChatProject, YourProject],\n       Field(discriminator=\"task_type\"),\n   ]\n   ```\n3. **Version Config**\n\n   * Place `project.yaml` at `\u003cproject\u003e/\u003cversion\u003e/project.yaml` following your `YourProject` schema.\n   * Organize source files under `\u003cproject\u003e/\u003cversion\u003e/source/` as `.jsonl` chunks (100–1000 lines each).\n\n## 2. UI Customization\n\nThe renderer for item and annotation views lives in `src/label_app/ui/components/annotation_view.py`. You can hook into this without changing core app logic.\n\n**Steps to Customize:**\n\n1. **Locate Render Method**\n\n   ```python\n   @singledispatch\n   def render(project: Project, annotation: _AnnotationType) -\u003e _AnnotationType:\n       raise TypeError(f\"No renderer registered for {type(project).__name__}\")\n   ```\n2. **Override Renderer**\n\n   ```python\n   @render.register\n   def _your_render(project: YourProject, annotation: YourAnnotation) -\u003e YourAnnotation: ...\n   ```\n3. **Add Custom Controls**\n\n   * Use `st.selectbox`, `st.slider`, `st.checkbox`, etc., to capture annotations.\n   * Leverage Streamlit layout (`st.columns`, `st.expander`) for better UX.\n4. **Validate \u0026 Iterate**\n\n   * Restart the app locally (`streamlit run ...`).\n   * Test with your `.jsonl` dataset and ensure annotations save correctly.\n\n---\n\n# Projects\n\nList your labeling projects in `src/label_app/app_settings.yaml` to make them appear in the UI. You can register multiple projects:\n\n```yaml\nprojects:\n    ner: \"https://github.com/your-org/ner-repo/data\"\n    image: \"https://github.com/your-org/img-classifier/dataset\"\n```\n\n## Project Versions\n\nEach version of a project lives in its own subdirectory. For example:\n\n```\nner-project/\n├── v1.0/\n│   ├── project.yaml       # Pydantic config for v1.0\n│   └── source/\n│       └── data_part1.jsonl\n│       └── data_part2.jsonl\n└── v2.0/\n    ├── project.yaml       # Updated config for v2.0\n    └── source/\n        └── all_data.jsonl\n```\n\n* **Immutable Versions:** Once data files and `project.yaml` are committed, treat them as read-only. Create a new version directory for any changes.\n* **project.yaml:** Must match your `ProjectBase` schema exactly.\n* **Source Tree:** The `source/` directory can contain nested subdirectories; only leaf files should be `.jsonl`. Each line in these JSONL files must represent a serialized Pydantic `YourItem`  model.\n* **Chunking:** For performance, keep each `.jsonl` file between 100–1000 lines.\n\n## Annotations Storage\n\nWhen contributors annotate, their results are saved back into the repository under:\n\n```\n\u003cproject\u003e/\u003cversion\u003e/annotation/\u003ccontributor\u003e/source/.../*.jsonl\n```\n\n* The `annotation/` tree mirrors the `source/` layout, ensuring your commit history remains clear.\n* Multiple contributors can work in parallel without merge conflicts.\n\n---\n\n## Potential Welcomed Contributions\n\n* **New Authentication Backends**\n\n  * Support additional OAuth/OIDC providers (e.g., GitHub OAuth, Google).\n  * Implement a plugin architecture for custom authentication modules.\n\n* **Alternative Storage Layers**\n\n  * Add optional back-ends for data and annotation storage (e.g., AWS S3, Google Cloud Storage, SQL/NoSQL databases).\n  * Integrate Hugging Face Datasets as a storage and loading option, enabling seamless access to public and private HF repositories.\n\n* **Automated Testing \u0026 CI/CD**\n\n  * Include unit and integration tests with pytest covering core components.\n  * Provide GitHub Actions workflows for linting, testing, and automated deployment of documentation or Docker images.\n\n* **Interactive Analytics Page**\n\n  * Add a dedicated page within the app for visualizing annotation metrics such as annotator throughput, label distributions, and inter-annotator agreement.\n  * Use interactive plotting libraries (Plotly, Recharts, or Streamlit charts) for real-time analytics.\n\n* **Built-in Project Types \u0026 Templates**\n\n  * Add additional out-of-the-box project implementations (beyond the current chat labeling).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fviktor-shcherb%2Ftext-labelling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fviktor-shcherb%2Ftext-labelling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fviktor-shcherb%2Ftext-labelling/lists"}