https://github.com/curiosity-ai/catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
https://github.com/curiosity-ai/catalyst

ai artificial-intelligence csharp embeddings machine-learning natural-language-processing natural-language-understanding nlp

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/curiosity-ai/catalyst
Owner: curiosity-ai
License: mit
Created: 2019-08-04T09:00:12.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2024-08-07T10:21:53.000Z (11 months ago)
Last Synced: 2024-08-08T09:57:35.786Z (11 months ago)
Topics: ai, artificial-intelligence, csharp, embeddings, machine-learning, natural-language-processing, natural-language-understanding, nlp
Language: C#
Homepage:
Size: 13.4 MB
Stars: 703
Watchers: 39
Forks: 72
Open Issues: 33
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

awesome-dotnet-datascience - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (Natural Language Processing)
awesome-dotnet-core - Catalyst
TensorFlow-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
Deep-Learning-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
CoreML-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
Apache-Airflow-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
PyTorch-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
Apache-Spark-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
MATLAB-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
Apache-Kafka-Guide - Catalyst - trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (NLP Tools, Libraries, and Frameworks)
fucking-awesome-dotnet-core - Catalyst - platform Natural Language Processing (NLP) library inspired by spaCy, with pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. Part of the 🌎 [SciSharp Stack](scisharp.github.io/SciSharp/) (Frameworks, Libraries and Tools / Machine Learning and Data Science)
awesome-dotnet-core - Catalyst - platform Natural Language Processing (NLP) library inspired by spaCy, with pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. Part of the [SciSharp Stack](https://scisharp.github.io/SciSharp/) (Frameworks, Libraries and Tools / Machine Learning and Data Science)
awesome-dotnet-machine-learning - Catalyst - ai/catalyst?style=social"/> : 🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models. (Uncategorized / Uncategorized)

README

        
[![Nuget](https://img.shields.io/nuget/v/Catalyst.svg?maxAge=0&colorB=brightgreen)](https://www.nuget.org/packages/Catalyst/) [![Build Status](https://dev.azure.com/curiosity-ai/mosaik/_apis/build/status/catalyst?branchName=master)](https://dev.azure.com/curiosity-ai/mosaik/_build/latest?definitionId=10&branchName=master)





_**catalyst**_ is a C# Natural Language Processing library built for speed. Inspired by [spaCy's design](https://spacy.io/), it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.

[![Gitter](https://badges.gitter.im/curiosityai/catalyst.svg)](https://gitter.im/curiosityai/catalyst?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

## ⚡ Features

- Fast, modern pure-C# NLP library, supporting [.NET standard 2.0](https://docs.microsoft.com/en-us/dotnet/standard/net-standard)

- Cross-platform, runs anywhere [.NET core](https://dotnet.microsoft.com/download) is supported - Windows, Linux, macOS and even ARM

- Non-destructive [tokenization](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/Base/FastTokenizer.cs), >99.9% [RegEx-free](https://blog.codinghorror.com/regex-performance/), >1M tokens/s on a modern CPU

- Named Entity Recognition ([gazeteer](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/EntityRecognition/Spotter.cs), [rule-based](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/EntityRecognition/PatternSpotter.cs) & [perceptron-based](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/EntityRecognition/AveragePerceptronEntityRecognizer.cs))

- Pre-trained models based on [Universal Dependencies](https://universaldependencies.org/) project

- Custom models for learning [Abbreviations](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/Special/AbbreviationCapturer.cs) & [Senses](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/EntityRecognition/Spotter.cs#L214)

- Out-of-the-box support for training [FastText](https://fasttext.cc/) and [StarSpace](https://github.com/facebookresearch/StarSpace) embeddings (pre-trained models coming soon)

- Part-of-speech tagging

- Language detection using [FastText](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/Special/FastTextLanguageDetector.cs) or [cld3](https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/Special/LanguageDetector.cs)

- Efficient binary serialization based on [MessagePack](https://github.com/neuecc/MessagePack-CSharp/)

- Pre-built models for [language packages](https://www.nuget.org/packages?q=catalyst.models) ✨

- Lemmatization ✨ (using lookup tables ported from [spaCy](https://github.com/explosion/spacy-lookups-data))

## Language Packages ✨

All language-specific data and models are provided as NuGet packages, you can find all packages [here](https://www.nuget.org/packages?q=catalyst.models). 

The new models are trained on the latest release of [Universal Dependencies v2.7](https://universaldependencies.org/).

We've also added the option to store and load models using streams:

`````csharp

// Creates and stores the model

var isApattern = new PatternSpotter(Language.English, 0, tag: "is-a-pattern", captureTag: "IsA");

isApattern.NewPattern(

    "Is+Noun",

    mp => mp.Add(

        new PatternUnit(P.Single().WithToken("is").WithPOS(PartOfSpeech.VERB)),

        new PatternUnit(P.Multiple().WithPOS(PartOfSpeech.NOUN, PartOfSpeech.PROPN, PartOfSpeech.AUX, PartOfSpeech.DET, PartOfSpeech.ADJ))

));

using(var f = File.OpenWrite("my-pattern-spotter.bin"))

{

    await isApattern.StoreAsync(f);

}

// Load the model back from disk

var isApattern2 = new PatternSpotter(Language.English, 0, tag: "is-a-pattern", captureTag: "IsA");

using(var f = File.OpenRead("my-pattern-spotter.bin"))

{

    await isApattern2.LoadAsync(f);

}

`````

## ✨ Getting Started

Using _**catalyst**_ is as simple as installing its [NuGet Package](https://www.nuget.org/packages/Catalyst), and setting the storage to use our online repository. This way, models will be lazy loaded either from disk or downloaded from our online repository. **Check out also some of the [sample projects](https://github.com/curiosity-ai/catalyst/tree/master/samples)** for more examples on how to use _**catalyst**_.

```csharp

Catalyst.Models.English.Register(); //You need to pre-register each language (and install the respective NuGet Packages)

Storage.Current = new DiskStorage("catalyst-models");

var nlp = await Pipeline.ForAsync(Language.English);

var doc = new Document("The quick brown fox jumps over the lazy dog", Language.English);

nlp.ProcessSingle(doc);

Console.WriteLine(doc.ToJson());

```

You can also take advantage of C# lazy evaluation and native multi-threading support to process a large number of documents in parallel:

```csharp

var docs = GetDocuments();

var parsed = nlp.Process(docs);

DoSomething(parsed);

IEnumerable GetDocuments()

{

    //Generates a few documents, to demonstrate multi-threading & lazy evaluation

    for(int i = 0; i < 1000; i++)

    {

        yield return new Document("The quick brown fox jumps over the lazy dog", Language.English);

    }

}

void DoSomething(IEnumerable docs)

{

    foreach(var doc in docs)

    {

        Console.WriteLine(doc.ToJson());

    }

}

```

Training a new [FastText](https://fasttext.cc/) [word2vec](https://en.wikipedia.org/wiki/Word2vec) embedding model is as simple as this:

```csharp

var nlp = await Pipeline.ForAsync(Language.English);

var ft = new FastText(Language.English, 0, "wiki-word2vec");

ft.Data.Type = FastText.ModelType.CBow;

ft.Data.Loss = FastText.LossType.NegativeSampling;

ft.Train(nlp.Process(GetDocs()));

ft.StoreAsync();

```

For fast embedding search, we have also released a C# version of the ["Hierarchical Navigable Small World" (HNSW)](https://arxiv.org/abs/1603.09320) algorithm on [NuGet](https://www.nuget.org/packages/HNSW/), based on our fork of Microsoft's [HNSW.Net](https://github.com/curiosity-ai/hnsw.net). We have also released a C# version of the "Uniform Manifold Approximation and Projection" ([UMAP](https://umap-learn.readthedocs.io/en/latest/how_umap_works.html)) algorithm for dimensionality reduction on [GitHub](https://github.com/curiosity-ai/umap-csharp) and on [NuGet](https://www.nuget.org/packages/UMAP/).

## 📖 Links

| Documentation     |                                                           |

| ----------------- | --------------------------------------------------------- |

| [Contribute]      | How to contribute to _**catalyst**_ codebase.             |

| [Samples]         | Sample projects demonstrating _**catalyst**_ capabilities |

| [![Gitter](https://badges.gitter.im/curiosityai/catalyst.svg)](https://gitter.im/curiosityai/catalyst?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)  | Join our gitter channel                                    |

[Contribute]: https://github.com/curiosity-ai/catalyst/blob/master/CONTRIBUTING.md

[Samples]: https://github.com/curiosity-ai/catalyst/tree/master/samples

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/curiosity-ai/catalyst

Awesome Lists containing this project

README