{"id":50137381,"url":"https://github.com/microsoft/dstoolkit-km-solution-accelerator","last_synced_at":"2026-05-23T23:01:19.484Z","repository":{"id":43077199,"uuid":"402097885","full_name":"microsoft/dstoolkit-km-solution-accelerator","owner":"microsoft","description":"Data Science Toolkit - Knowledge Mining Solution Accelerator","archived":false,"fork":false,"pushed_at":"2026-03-18T17:53:34.000Z","size":403351,"stargazers_count":24,"open_issues_count":4,"forks_count":10,"subscribers_count":25,"default_branch":"main","last_synced_at":"2026-05-11T19:33:31.664Z","etag":null,"topics":["ai","azure-cognitive-search","azure-cognitive-services","computer-vision","dstoolkit","knowledge-mining","search-engine","translation"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2021-09-01T14:49:18.000Z","updated_at":"2026-03-24T23:40:02.000Z","dependencies_parsed_at":"2023-02-18T08:31:38.314Z","dependency_job_id":null,"html_url":"https://github.com/microsoft/dstoolkit-km-solution-accelerator","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/microsoft/dstoolkit-km-solution-accelerator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fdstoolkit-km-solution-accelerator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fdstoolkit-km-solution-accelerator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fdstoolkit-km-solution-accelerator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fdstoolkit-km-solution-accelerator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/dstoolkit-km-solution-accelerator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fdstoolkit-km-solution-accelerator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33413619,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-23T18:09:33.147Z","status":"ssl_error","status_checked_at":"2026-05-23T18:09:31.380Z","response_time":53,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","azure-cognitive-search","azure-cognitive-services","computer-vision","dstoolkit","knowledge-mining","search-engine","translation"],"created_at":"2026-05-23T23:01:18.860Z","updated_at":"2026-05-23T23:01:19.473Z","avatar_url":"https://github.com/microsoft.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"![banner](docs/media/banner.png)\n\n__Knowledge Mining solution accelerator__\n\nThis repository contains all the code for deploying an end-to-end Knowledge Mining solution based on Azure Cognitive Search.\n\nIt is built on top of standards Azure services like Functions, Web App Services, Congitive Services \u0026 Cognitive Search. It provides a deployment pipeline allowing quick and easy setup of CI/CD pipelines for your projects.\n\nFor detailed documentation please refer to the __docs__ section of the repo containing the solution wiki.\n\n# Before you start \n\nIn order to successfully setup your solution you will need to have access to and or provisioned the following:\n\n- Access to an Azure subscription (required)\n- Access to an Azure DevOps subscription (optional)\n\nAn Owner or Contributor role is assumed on the Azure subscription or the targeted Resource Group. \n\n# Getting Started\n\nPlease refer to the [README](deployment/README.md) to deploy this solution accelerator. \n\nThe directions provided in all guides assume you have a fundamental working knowledge of the Azure portal, Azure Functions, Azure Cognitive Search, Functions, Storage and Azure Cognitives Services. \n\nFor additional training and support, please see:\n\n* [Knowledge Mining Bootcamp](https://github.com/MicrosoftLearning/LearnAI-KnowledgeMiningBootcamp)\n* [AI in Cognitive Search documentation](https://docs.microsoft.com/azure/search/cognitive-search-resources-documentation)\n\n# Knowledge Mining overview\n\nKnowledge mining (KM) is an emerging discipline in artificial intelligence (AI) that uses a combination of intelligent services to quickly learn from vast amounts of information. It allows organizations to deeply understand and easily explore information, uncover hidden insights, and find relationships and patterns at scale.\n\n[Knowledge Mining in Azure](https://azure.microsoft.com/en-us/solutions/knowledge-mining)\n\n# What is this solution accelerator ? \n\nThis KM solution accelerator aims to provide you with a workable end-to-end Knowledge Mining solution composed of : \n- Ingestion\n    - Data ingestion from Azure Data Lake\n- Enrichment\n    - Data enrichment with Azure Applied AI and Cognitive Services\n- Exploration\n    - Keyword and Semantic search\n    - Support for multiples search indexes\n    - Content security model (permissions)\n    - Modular User Interface \n\nWith this cloud-based accelerator you will get an end-to-end solution with the tools to deploy, extend, operate \u0026 monitor.\n\nIn that respect, the solution provides \n- Azure Web App Authentication support \n- High configurability (json)\n- Full Extensibility \n- Operations (PowerShell-based)\n- Azure Pipelines for CI/CD \n- Deployment framework (manual or through CI/CD)\n\n# Why a knowledge mining solution accelerator? \n\nThis Knowledge Mining solution accelerator is inspired from another accelerator [Knowledge Mining Solution Accelerator](https://github.com/Azure-Samples/azure-search-knowledge-mining). \n\nBased on our fields experience, we built features/skills to address common unstructured data challenges focusing on the usability and data explore experience. \n\nBelow is a non-exhaustive list of key highlights:\n\n* **Embedded images indexation** \n    - Images embedded in documents are indexed as documents not just for keywords search recall.\n    - PDF pages are extracted as images (configurable).\n    - A custom version of Apache Tika is used for images extraction.\n    - Overcome the limit of [1000 normalized images](https://docs.microsoft.com/en-us/azure/search/cognitive-search-concept-image-scenarios#get-normalized-images)\n\n* **Image normalization** : \n    - handling oversized images for OCR completeness\n    - support for TIFF format\n    - thumbnails creation for UI support\n\n* **Metadata**\n    - Using Apache Tika we give you access to all metadata present in each document or image. A common scenario are Images with geo-location metadata i.e. EXIF GPS coordinates. \n\n* **HTML Conversion**\n    - Having an HTML representation of a document could ease some NLP work. \n    - Table of contents is a common structure which we expose in the HTML representation of a PDF. \n\n* **Tables extraction**: tabular information are common in unstructured data corpus. The solution will extract, index and project tables to a dedicated knowledge store (optional).   \n\n* **Translation**\": there are two translation features in this solution\n    * **Text Translation** : non-native content and title are normalized to a define language (default is english)\n    * **Document Translation** : for non-native documents, the solution will translate them. They will follow the same Document processing as any document. Translated documents will provide you with translated tables for instance.\n\n* **Text Analytics** : extract Entities (Named, Linked) from any document and OCR'ed image text.\n\n* **Export to Excel**: popular ask when exploring unstructured data. \n\n* **Configurable UI**: building a UI is time consuming, we wanted to bring great UI configurability so you could bring to life new KM solutions in a timely manner.\n\n# What Knowledge Mining scenarios this accelerator targets ?\n\nThis solution accelerator spirit is of a [Content Research](https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/content-research) KM scenario. \n\n![](docs/architecture/knowledge-mining-content-research.png)\n\nNevertheless, since its architecture is open, you could use it as a foundation for more specialized KM scenarios.\n\n**This solution accelerator is not targeted to any domain although its extensibility would give you the tools to make it domain specific.**\n\nSome inspirational use-cases \n- AI-driven Data \u0026 Web Exploration\n- Unstructured data Insights extraction (mine the unseen value)\n- AI-Driven Strategy planning tool\n- Intranet Semantic Search\n- R\u0026D portal for data discovery, patterns extraction \u0026 patents exploration\n- etc.\n\nYou may think of productization such accelerator for your organization.\n\n# Who is the target audience ?\n\nThis solution accelerator targets whoever is in need of  \n\n- Proof Of Concept to showcase Knowledge Mining to your stakeholders \n- Deploy an end-to-end KM solution for immediate Production use\n- Learn how to build a KM solution on Azure\n- Playground for evaluating Azure Machine Learning, Cognitive \u0026 Applied AI Services \n\n# Data Science Toolkit Integration\n\nThis solution accelerator purpose is also to ease the integration of Data Science modules into your knowledge mining solution. \n\nThe Data Science Toolkit team has built accelerators for your data science workload. \n\n| Solution | Description |\n|--------------|---|\n|[Verseagility](https://github.com/microsoft/verseagility)|Verseagility is a Python-based toolkit to ramp up your custom natural language processing (NLP) task, allowing you to bring your own data, use your preferred frameworks and bring models into production. It is a central component of the Microsoft Data Science Toolkit.|\n| [MLOps Base](https://github.com/microsoft/dstoolkit-mlops-base) | This repository contains the basic repository structure for machine learning projects based on Azure technologies (Azure ML and Azure DevOps). The folder names and files are chosen based on personal experience. You can find the principles and ideas behind the structure, which we recommend to follow when customizing your own project and MLOps process. Also, we expect users to be familiar with azure machine learning concepts and how to use the technology.| \n|[MLOps for DataBricks](https://github.com/microsoft/dstoolkit-ml-ops-for-databricks)| This repository contains the Databricks development framework for delivering any Data Engineering projects, and machine learning projects based on the Azure Technologies.| \n|[Classification Solution Accelerator](https://github.com/microsoft/dstoolkit-classification-solution-accelerator)| This repository contains the basic repository structure for delivering classification solutions for machine learning (ML) projects based on Azure technologies (Azure ML and Azure DevOps).|\n|[Object Detection Solution Accelerator](https://github.com/microsoft/dstoolkit-objectdetection-tensorflow-azureml)|This repository contains all the code for training TensorFlow object detection models within Azure Machine Learning (AML) with setups for training on Azure compute, experiment monitoring and endpoint deployment as a webservice. It is built on the MLOps Accelerator and provides end to end training and deployment pipelines allowing quick and easy setup of CI/CD pipelines for your projects.|\n|||\n\n# Documentation\n\nYou may refer to the solution accelerator documentation as follows: \n\n| Topic  | Description | Documentation Link | \n|----|----|----|\n| Pre-Requisites | What do you need to deploy \u0026 operate the solution | [README](docs/pre-reqs/README.md)| \n| Architecture | How the solution is architected|[README](docs/architecture/README.md)| \n| Deployment | How to deploy this solution accelerator |[README](docs/deployment/README.md)| \n| Configuration | All you need to know about the solution accelerator configuration |[README](docs/configuration/README.md)| \n| Data Science | Integration with Data Science |[README](docs/data-science/README.md)| \n| Deployment | Ho to get started by deploying the solution |[README](docs/deployment/README.md)| \n| Monitoring | How to monitor the solution |[README](docs/monitoring/README.md)| \n| Search | How search is configured and managed |[README](docs/search/README.md)| \n| Search \u0026 Explore (UI) | User Interface to Search \u0026 Explore |[README](docs/ui/README.md)| \n||||\n\n# Repository Structure\n\nThe respository structure of this accelerator is as follows \n\n--------\n- **azure-pipelines** - Azure DevOps pipelines to set up your CI/CD\n- **[configuration](configuration/README.md)** - solution configuration \n- **data** - sample data to validate the solution deployment.\n    - **documents** : sample documents for your KM solution \n- **[deployment](deployment/README.md)** - Configuration \u0026 scripts for deployment \u0026 operations\n    - **config** : contains the entire solution base configuration\n    - **modules** : PowerShell modules\n    - **scripts** : Deployment scripts\n    - **init_env.ps1** : Environment initialization script\n- **[docs](docs/README.md)** - contains solution documentation wiki in .md format. Designed to be imported as an Azure DevOps wiki.\n- **overlay** - Source code\n- **[src](src/README.md)** - Source code\n    - **CognitiveSearch.Skills** Custom skills\n    - **CognitiveSearch.UI** User Interface .NET Core MVC\n    - **Data Science** - placeholder to add your data science modules. \n--------\n\n# How to use this accelerator?\n\nClone or download this repository and then navigate to the Deployment folder, following the steps outlined in the [deployment](deployment/README.md) guide. \n\nWhen you complete all of the steps, you'll have a working end-to-end knowledge mining solution that combines data sources ingestion with data enrichment skills and a web app powered by Azure Cognitive Search.\n\n# Credits\n\nThis solution is inspired from the original work of the \n\n- Contributors of [Knowledge Mining Solution Accelerator](https://github.com/Azure-Samples/azure-search-knowledge-mining/graphs/contributors)\n- Contributors of [Azure Search Power Skills ](https://github.com/Azure-Samples/azure-search-power-skills/graphs/contributors)\n\nCore contributors to this solution accelerator are \n- [Nicolas Uthurriague](https://github.com/puthurr)\n- [Edoardo Quasso](https://github.com/EdoQuasso) for the Azure Cognitive Functions (Python)\n- [Harika Nagidi](https://github.com/harikanagidi) for VNET support and deployment improvements.\n\n# Special Thanks \n\nThe data science toolkit sponsorship team\n\n- [Karsten Strøbæk](https://github.com/strobaek)\n- [Willie Ahlers](https://github.com/WillieAhlers1)\n- [Kimberly O'Donoghue]()\n\nFor the great conversation on Knowledge Mining and Unstructured Data\n- [Sreedhar Mallangi](https://github.com/smallangi)\n- [Timm Walz](https://github.com/nonstoptimm)\n\n# Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n# Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft \ntrademarks or logos is subject to and must follow \n[Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fdstoolkit-km-solution-accelerator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fdstoolkit-km-solution-accelerator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fdstoolkit-km-solution-accelerator/lists"}