{"id":44905269,"url":"https://github.com/iscc/iscc-sct","last_synced_at":"2026-02-17T22:29:01.998Z","repository":{"id":246096616,"uuid":"819754795","full_name":"iscc/iscc-sct","owner":"iscc","description":"ISCC - Semantic Code Text","archived":false,"fork":false,"pushed_at":"2025-11-20T15:15:45.000Z","size":4748,"stargazers_count":3,"open_issues_count":10,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-11-20T16:22:37.528Z","etag":null,"topics":["cross-lingual","cross-lingual-embeddings","cross-lingual-simialrity","generated-text-detection","semantic-similarity","semantic-textual-similarity"],"latest_commit_sha":null,"homepage":"https://huggingface.co/spaces/iscc/iscc-sct","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iscc.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"titusz","custom":"https://iscc.foundation/support/"}},"created_at":"2024-06-25T06:26:14.000Z","updated_at":"2025-06-23T13:31:53.000Z","dependencies_parsed_at":"2024-06-25T21:13:45.270Z","dependency_job_id":"7346f939-55d7-46d3-84a4-facee2970602","html_url":"https://github.com/iscc/iscc-sct","commit_stats":null,"previous_names":["iscc/iscc-sct"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/iscc/iscc-sct","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iscc%2Fiscc-sct","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iscc%2Fiscc-sct/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iscc%2Fiscc-sct/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iscc%2Fiscc-sct/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iscc","download_url":"https://codeload.github.com/iscc/iscc-sct/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iscc%2Fiscc-sct/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29560567,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-17T21:50:49.831Z","status":"ssl_error","status_checked_at":"2026-02-17T21:46:15.313Z","response_time":100,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cross-lingual","cross-lingual-embeddings","cross-lingual-simialrity","generated-text-detection","semantic-similarity","semantic-textual-similarity"],"created_at":"2026-02-17T22:29:01.417Z","updated_at":"2026-02-17T22:29:01.989Z","avatar_url":"https://github.com/iscc.png","language":"Python","funding_links":["https://github.com/sponsors/titusz","https://iscc.foundation/support/"],"categories":[],"sub_categories":[],"readme":"# ISCC - Semantic Text-Code\n\n[![Tests](https://github.com/iscc/iscc-sct/actions/workflows/tests.yml/badge.svg)](https://github.com/iscc/iscc-core/actions/workflows/tests.yml)\n[![Version](https://img.shields.io/pypi/v/iscc-sct.svg)](https://pypi.python.org/pypi/iscc-sct/)\n[![Downloads](https://pepy.tech/badge/iscc-sct)](https://pepy.tech/project/iscc-sct)\n\n\u003e [!CAUTION]\n\u003e **This is a proof of concept.** All releases with version numbers below v1.0.0 may break backward\n\u003e compatibility and produce incompatible Semantic Text-Codes. The algorithms of this `iscc-sct`\n\u003e repository are experimental and not part of the official\n\u003e [ISO 24138:2024](https://www.iso.org/standard/77899.html) standard.\n\n`iscc-sct` is a **Semantic-Code Text** implementation for the [ISCC](https://core.iscc.codes)\n(*International Standard Content Code*). The Semantic-Code Text is a new ISCC-UNIT for semantic text\nidentification. The algorithm creates simmilar (low hamming distance) codes for semantically similar\ntext inputs across different languages. The SCT ISCC-UNIT is a compact binary code created from a\nbinarized document-vector text-embeddings.\n\n## Quick Start\n\n```bash\n# Install the package\npip install iscc-sct\n\n# Generate a semantic code\npython -c \"import iscc_sct as sct; print(sct.create('Your text here').iscc)\"\n\n# Or use the CLI\nsct \"path/to/textfile.txt\"\n```\n\n## What is the ISCC\n\nThe ISCC is a combination of various similarity preserving fingerprints and an identifier for\ndigital media content.\n\nISCCs are generated algorithmically from digital content, just like cryptographic hashes. However,\ninstead of using a single cryptographic hash function to identify data only, the ISCC uses various\nalgorithms to create a composite identifier that exhibits similarity-preserving properties (soft\nhash or Simprint).\n\nThe component-based structure of the ISCC identifies content at multiple levels of abstraction. Each\ncomponent is self-describing, modular, and can be used separately or with others to aid in various\ncontent identification tasks. The algorithmic design supports content deduplication, database\nsynchronization, indexing, integrity verification, timestamping, versioning, data provenance,\nsimilarity clustering, anomaly detection, usage tracking, allocation of royalties, fact-checking and\ngeneral digital asset management use-cases.\n\n## Comparison with Standard ISCC Content-Code Text\n\n| Feature       | ISCC Content-Code Text   | ISCC Semantic-Code Text           |\n| ------------- | ------------------------ | --------------------------------- |\n| Focus         | Lexical similarity       | Semantic similarity               |\n| Cross-lingual | No                       | Yes                               |\n| Use case      | Near-duplicate detection | Semantic similarity, translations |\n\n## What is ISCC Semantic Text-Code?\n\nThe ISCC framework already includes a Text-Code based on lexical similarity for near-duplicate\nmatching. The ISCC Semantic Text-Code is a planned additional ISCC-UNIT focused on capturing a more\nabstract and broader semantic similarity. It is engineered to be robust against a wide range of\nvariations and, most remarkably, translations of text that cannot be matched based on lexical\nsimilarity alone.\n\n### Translation Matching\n\nOne of the most interesting aspects of the Semantic Text-Code is its ability to generate\n**(near)-identical codes for translations or paraphrased versions of the same text**. This means\nthat the same content, expressed in different languages, can be identified and linked, opening up\nnew possibilities for cross-lingual content identification and similarity detection.\n\n## Key Features\n\n- **Semantic Similarity**: Utilizes deep learning models to generate codes that reflect the semantic\n  essence of text.\n- **Translation Matching**: Creates nearly identical codes for text translations, enabling\n  cross-lingual content identification.\n- **Bit-Length Flexibility**: Supports generating codes of various bit lengths (up to 256 bits),\n  allowing for adjustable granularity in similarity detection.\n- **ISCC Compatible**: Generates codes fully compatible with the ISCC specification, facilitating\n  seamless integration with existing ISCC-based systems.\n\n## Installation\n\nEnsure you have Python 3.10 or newer installed on your system. Install the library using:\n\n```bash\npip install iscc-sct\n```\n\nFor systems with GPU CUDA support, enhance performance by installing with:\n\n```bash\npip install iscc-sct[gpu]\n```\n\n## Usage\n\nGenerate a Semantic Text-Code using the create function:\n\n```pycon\n\u003e\u003e\u003e import iscc_sct as sct\n\u003e\u003e\u003e text = \"This is some sample text. It can be a longer document or even an entire book.\"\n\u003e\u003e\u003e sct.create(text, bits=256)\n{\n  \"iscc\": \"ISCC:CADV3GG6JH3XEVRNSVYGCLJ7AAV3BOT5J7EHEZKPFXEGRJ2CTWACGZI\",\n  \"characters\": 77\n}\n\n```\n\nFor granular (per chunk) feature outputs:\n\n```pycon\n\u003e\u003e\u003e import iscc_sct as sct\n\u003e\u003e\u003e text = \"This is some sample text. It can be a longer document or even an entire book.\"\n\u003e\u003e\u003e sct.create(text, bits=256, granular=True)\n{\n  \"iscc\": \"ISCC:CADV3GG6JH3XEVRNSVYGCLJ7AAV3BOT5J7EHEZKPFXEGRJ2CTWACGZI\",\n  \"characters\": 77,\n  \"features\": [\n    {\n      \"maintype\": \"semantic\",\n      \"subtype\": \"text\",\n      \"version\": 0,\n      \"byte_offsets\": false,\n      \"simprints\": [\n        {\n          \"simprint\": \"XZjeSfdyVi0\",\n          \"offset\": 0,\n          \"size\": 77,\n          \"content\": \"This is some sample text. It can be a longer document or even an entire book.\"\n        }\n      ]\n    }\n  ]\n}\n\n```\n\n\u003e [!TIP]\n\u003e By default, granular features (simprints) report their offsets as character positions. If the\n\u003e `byte_offsets` option is enabled (via the ISCC_SCT_BYTE_OFFSETS environment variable or as an\n\u003e option in code), the offsets will be computed on the UTF-8 representation of the text. This can be\n\u003e useful when you need to retrieve individual text chunks via random access from remote storage.\n\n### Comparing Two Texts\n\n```python\nimport iscc_sct as sct\n\n# Generate codes for two texts\ntext1 = \"\"\"\nAn ISCC applies to a specific digital asset and is a data-descriptor deterministically constructed\nfrom multiple hash digests using the algorithms and rules in this document. This document does not\nprovide information on registration of ISCCs.\n\"\"\"\n\ntext2 = \"\"\"\nEin ISCC bezieht sich auf ein bestimmtes digitales Gut und ist ein Daten-Deskriptor, der\ndeterministisch aus mehreren Hash-Digests unter Verwendung der Algorithmen und Regeln in diesem\nDokument erstellt wird. Dieses Dokument enthält keine Informationen über die Registrierung von ISCCs.\n\"\"\"\n\ncode1 = sct.create(text1)\ncode2 = sct.create(text2)\n\ndistance = sct.iscc_distance(code1.iscc, code2.iscc)\nprint(f\"Hamming distance in bits: {distance}\")\n```\n\nThe installation also provides a sct command-line tool:\n\n```shell\nusage: sct [-h] [-b BITS] [-g] [-d] [path]\n\nGenerate Semantic Text-Codes for text files.\n\npositional arguments:\n  path                  Path to text files (supports glob patterns) or 'gui' to launch Gradio demo.\n\noptions:\n  -h, --help            show this help message and exit\n  -b BITS, --bits BITS  Bit-Length of Code (default 256)\n  -g, --granular        Activate granular processing.\n  -d, --debug           Show debugging messages.\n```\n\n## How It Works\n\n```\nText Input → Text Chunking → Embedding Generation → Vector Aggregation → Binarization → ISCC Encoding\n```\n\n`iscc-sct` employs the following process:\n\n1. Splits the text into overlaping chunks (using syntactically sensible breakpoints).\n2. Uses a pre-trained deep learning model for text embedding.\n3. Generates feature vectors capturing essential characteristics of the chunks.\n4. Aggregates these vectors and binarizes them to produce a Semantic Text-Code.\n5. Prefixes the binarized vector with the matching ISCC header, encodes it with base32, and adds the\n   \"ISCC:\" prefix.\n\nThis process ensures robustness to variations and translations, enabling cross-lingual matching\nbased on a short Simprint.\n\n## Configuration\n\nISCC-SCT can be configured using environment variables:\n\n| Environment Variable | Description                          | Default |\n| -------------------- | ------------------------------------ | ------- |\n| ISCC_SCT_BITS        | Default bit-length of generated code | 64      |\n| ISCC_SCT_MAX_TOKENS  | Maximum tokens per chunk             | 127     |\n| ISCC_SCT_OVERLAP     | Maximum token overlap between chunks | 48      |\n\nSee iscc_sct/options.py for more configuration settings.\n\n## Performance Considerations\n\n- The embedding model will be downloaded on first execution\n- **CPU vs GPU**: On systems with CUDA-compatible GPUs, install with `pip install iscc-sct[gpu]` for\n  significantly faster processing.\n\n## Development and Contributing\n\nWe welcome contributions to enhance the capabilities and efficiency of this proof of concept. For\ndevelopment, install the project in development mode using [Poetry](https://python-poetry.org):\n\n```shell\ngit clone https://github.com/iscc/iscc-sct.git\ncd iscc-sct\npoetry install\n```\n\nIf you have suggestions for improvements or bug fixes, please open an issue or pull request. For\nmajor changes, please open an issue first to discuss your ideas.\n\n**We particularly welcome recommendations for other multilingual text embedding models trained with\nMatryoshka Representation Learning (MRL) and optimized for binarization. Such contributions could\nsignificantly improve the performance and efficiency of the ISCC Semantic Text-Code generation.**\n\n## Gradio Demo\n\nThis repository also provides an interactive Gradio demo that allows you to explore the capabilities\nof ISCC Semantic Text-Code. The demo showcases:\n\n- Generation of ISCC Semantic Text-Codes for input texts\n- Comparison of two texts and their similarity based on the generated codes\n- Visualization of text chunking and granular matches\n- Adjustable parameters like ISCC bit-length and maximum tokens per chunk\n\nYou can access the live version of the Gradio demo at:\n[https://huggingface.co/spaces/iscc/iscc-sct](https://huggingface.co/spaces/iscc/iscc-sct)\n\n### Running the Gradio Demo Locally\n\nTo run the Gradio demo locally, you first need to install the `iscc-sct` package with the optional\n`demo` dependency:\n\n```shell\npip install iscc-sct[demo]\n```\n\nThis will ensure that Gradio and other necessary dependencies for the demo are installed.\n\nAfter installation, you can use the `sct` command-line tool that comes with the package:\n\n```shell\nsct gui\n```\n\nThis command will launch the Gradio interface in your default web browser, allowing you to interact\nwith the demo on your local machine.\n\n## Current Limitations\n\n- The semantic matching works best for texts with at least several sentences.\n- Very short texts (a few words) may not generate reliable semantic codes.\n- Performance may vary across different language pairs.\n- The model size is approximately 450MB, which may impact initial loading time.\n\n## Suported Languages\n\nArabic, Armenian, Bengali, Bosnian, Bulgarian, Burmese, Catalan, Chinese (China), Chinese (Taiwan),\nCroatian, Czech, Danish, Dutch, English, Estonian, Farsi, Finnish, French, French (Canada),\nGalician, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian,\nJapanese, Kannada, Korean, Kurdish, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Marathi,\nMongolian, Norwegian Bokmål, Persian, Polish, Portuguese, Portuguese (Brazil), Romanian, Russian,\nSerbian, Sinhala, Slovak, Slovenian, Spanish, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian,\nUrdu, Vietnamese.\n\n## Citation\n\nIf you use ISCC-SCT in your research, please cite:\n\n```bibtex\n@software{iscc_sct,\n  author = {Pan, Titusz},\n  title = {ISCC-SCT: Semantic Text-Code for the International Standard Content Code},\n  url = {https://github.com/iscc/iscc-sct},\n  version = {0.1.4},\n  year = {2025},\n}\n```\n\n## Future Work\n\n### Shift Resistant Semantic Chunking\n\nThe current chunking strategy uses tries to maximize chunk sizes (up to 127 tokens) while still\nsplitting at lexically sensible boundaries with an overlap of up to 48 tokens. See\n[text-splitter](https://github.com/benbrandt/text-splitter).\n\nCross-document chunk matching via granular Simprints can likely be improved significantly with a\nsemantically aware and shift-resistant chunking strategy. Better shift resistance would improve the\nchances that the bounderies detected for semantically similar text sequences in different documents\nare aligned.\n\n### MRL based Embeddings\n\nA text embedding model trained with\n[Matryoshka Representation Learning](https://arxiv.org/pdf/2205.13147) may yield better results with\nshort 64-bit Semantic Text-Codes.\n\n### Larger Chunk Sizes\n\nA text embedding model with support for a larger `max_token` size (currently 128) may yield\nhigher-order granular simprints based on larger chunks of text.\n\n## Acknowledgements\n\n- Text Chunking: [text-splitter](https://github.com/benbrandt/text-splitter)\n- Text Embeddings:\n  [Sentence-Transformers](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiscc%2Fiscc-sct","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiscc%2Fiscc-sct","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiscc%2Fiscc-sct/lists"}