https://github.com/curiosity-ai/catalyst
🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
https://github.com/curiosity-ai/catalyst
ai artificial-intelligence csharp embeddings machine-learning natural-language-processing natural-language-understanding nlp
Last synced: 25 days ago
JSON representation
🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
- Host: GitHub
- URL: https://github.com/curiosity-ai/catalyst
- Owner: curiosity-ai
- License: mit
- Created: 2019-08-04T09:00:12.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-08-07T10:21:53.000Z (8 months ago)
- Last Synced: 2024-08-08T09:57:35.786Z (8 months ago)
- Topics: ai, artificial-intelligence, csharp, embeddings, machine-learning, natural-language-processing, natural-language-understanding, nlp
- Language: C#
- Homepage:
- Size: 13.4 MB
- Stars: 703
- Watchers: 39
- Forks: 72
- Open Issues: 33
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
- awesome-dotnet-datascience - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (Natural Language Processing)
- awesome-dotnet-core - Catalyst
- TensorFlow-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
- Deep-Learning-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
- CoreML-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
- Apache-Airflow-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
- PyTorch-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
- Apache-Spark-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
- MATLAB-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
- Apache-Kafka-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
- fucking-awesome-dotnet-core - Catalyst - platform Natural Language Processing (NLP) library inspired by spaCy, with pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. Part of the 🌎 [SciSharp Stack](scisharp.github.io/SciSharp/) (Frameworks, Libraries and Tools / Machine Learning and Data Science)
- awesome-dotnet-core - Catalyst - platform Natural Language Processing (NLP) library inspired by spaCy, with pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. Part of the [SciSharp Stack](https://scisharp.github.io/SciSharp/) (Frameworks, Libraries and Tools / Machine Learning and Data Science)
- awesome-dotnet-machine-learning - Catalyst - ai/catalyst?style=social"/> : 🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (Uncategorized / Uncategorized)
README
[](https://www.nuget.org/packages/Catalyst/) [](https://dev.azure.com/curiosity-ai/mosaik/_build/latest?definitionId=10&branchName=master)
_**catalyst**_ is a C# Natural Language Processing library built for speed. Inspired by [spaCy's design](https://spacy.io/), it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
[](https://gitter.im/curiosityai/catalyst?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
## âš¡ Features
- Fast, modern pure-C# NLP library, supporting [.NET standard 2.0](https://docs.microsoft.com/en-us/dotnet/standard/net-standard)
- Cross-platform, runs anywhere [.NET core](https://dotnet.microsoft.com/download) is supported - Windows, Linux, macOS and even ARM
- Non-destructive [tokenization](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/Base/FastTokenizer.cs), >99.9% [RegEx-free](https://blog.codinghorror.com/regex-performance/), >1M tokens/s on a modern CPU
- Named Entity Recognition ([gazeteer](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/EntityRecognition/Spotter.cs), [rule-based](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/EntityRecognition/PatternSpotter.cs) & [perceptron-based](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/EntityRecognition/AveragePerceptronEntityRecognizer.cs))
- Pre-trained models based on [Universal Dependencies](https://universaldependencies.org/) project
- Custom models for learning [Abbreviations](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/Special/AbbreviationCapturer.cs) & [Senses](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/EntityRecognition/Spotter.cs#L214)
- Out-of-the-box support for training [FastText](https://fasttext.cc/) and [StarSpace](https://github.com/facebookresearch/StarSpace) embeddings (pre-trained models coming soon)
- Part-of-speech tagging
- Language detection using [FastText](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/Special/FastTextLanguageDetector.cs) or [cld3](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/Special/LanguageDetector.cs)
- Efficient binary serialization based on [MessagePack](https://github.com/neuecc/MessagePack-CSharp/)
- Pre-built models for [language packages](https://www.nuget.org/packages?q=catalyst.models) ✨
- Lemmatization ✨ (using lookup tables ported from [spaCy](https://github.com/explosion/spacy-lookups-data))## Language Packages ✨
All language-specific data and models are provided as NuGet packages, you can find all packages [here](https://www.nuget.org/packages?q=catalyst.models).The new models are trained on the latest release of [Universal Dependencies v2.7](https://universaldependencies.org/).
We've also added the option to store and load models using streams:
`````csharp
// Creates and stores the model
var isApattern = new PatternSpotter(Language.English, 0, tag: "is-a-pattern", captureTag: "IsA");
isApattern.NewPattern(
"Is+Noun",
mp => mp.Add(
new PatternUnit(P.Single().WithToken("is").WithPOS(PartOfSpeech.VERB)),
new PatternUnit(P.Multiple().WithPOS(PartOfSpeech.NOUN, PartOfSpeech.PROPN, PartOfSpeech.AUX, PartOfSpeech.DET, PartOfSpeech.ADJ))
));
using(var f = File.OpenWrite("my-pattern-spotter.bin"))
{
await isApattern.StoreAsync(f);
}// Load the model back from disk
var isApattern2 = new PatternSpotter(Language.English, 0, tag: "is-a-pattern", captureTag: "IsA");using(var f = File.OpenRead("my-pattern-spotter.bin"))
{
await isApattern2.LoadAsync(f);
}
`````## ✨ Getting Started
Using _**catalyst**_ is as simple as installing its [NuGet Package](https://www.nuget.org/packages/Catalyst), and setting the storage to use our online repository. This way, models will be lazy loaded either from disk or downloaded from our online repository. **Check out also some of the [sample projects](https://github.com/curiosity-ai/catalyst/tree/master/samples)** for more examples on how to use _**catalyst**_.
```csharp
Catalyst.Models.English.Register(); //You need to pre-register each language (and install the respective NuGet Packages)Storage.Current = new DiskStorage("catalyst-models");
var nlp = await Pipeline.ForAsync(Language.English);
var doc = new Document("The quick brown fox jumps over the lazy dog", Language.English);
nlp.ProcessSingle(doc);
Console.WriteLine(doc.ToJson());
```You can also take advantage of C# lazy evaluation and native multi-threading support to process a large number of documents in parallel:
```csharp
var docs = GetDocuments();
var parsed = nlp.Process(docs);
DoSomething(parsed);IEnumerable GetDocuments()
{
//Generates a few documents, to demonstrate multi-threading & lazy evaluation
for(int i = 0; i < 1000; i++)
{
yield return new Document("The quick brown fox jumps over the lazy dog", Language.English);
}
}void DoSomething(IEnumerable docs)
{
foreach(var doc in docs)
{
Console.WriteLine(doc.ToJson());
}
}
```Training a new [FastText](https://fasttext.cc/) [word2vec](https://en.wikipedia.org/wiki/Word2vec) embedding model is as simple as this:
```csharp
var nlp = await Pipeline.ForAsync(Language.English);
var ft = new FastText(Language.English, 0, "wiki-word2vec");
ft.Data.Type = FastText.ModelType.CBow;
ft.Data.Loss = FastText.LossType.NegativeSampling;
ft.Train(nlp.Process(GetDocs()));
ft.StoreAsync();
```For fast embedding search, we have also released a C# version of the ["Hierarchical Navigable Small World" (HNSW)](https://arxiv.org/abs/1603.09320) algorithm on [NuGet](https://www.nuget.org/packages/HNSW/), based on our fork of Microsoft's [HNSW.Net](https://github.com/curiosity-ai/hnsw.net). We have also released a C# version of the "Uniform Manifold Approximation and Projection" ([UMAP](https://umap-learn.readthedocs.io/en/latest/how_umap_works.html)) algorithm for dimensionality reduction on [GitHub](https://github.com/curiosity-ai/umap-csharp) and on [NuGet](https://www.nuget.org/packages/UMAP/).
## 📖 Links
| Documentation | |
| ----------------- | --------------------------------------------------------- |
| [Contribute] | How to contribute to _**catalyst**_ codebase. |
| [Samples] | Sample projects demonstrating _**catalyst**_ capabilities |
| [](https://gitter.im/curiosityai/catalyst?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) | Join our gitter channel |[Contribute]: https://github.com/curiosity-ai/catalyst/blob/master/CONTRIBUTING.md
[Samples]: https://github.com/curiosity-ai/catalyst/tree/master/samples