https://github.com/jeremylikness/sparkmldoccategorization
Example of automatic categorization using .NET for Spark and ML.NET
https://github.com/jeremylikness/sparkmldoccategorization
Last synced: about 1 month ago
JSON representation
Example of automatic categorization using .NET for Spark and ML.NET
- Host: GitHub
- URL: https://github.com/jeremylikness/sparkmldoccategorization
- Owner: JeremyLikness
- License: mit
- Created: 2020-08-13T01:20:44.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2020-11-17T21:44:22.000Z (over 4 years ago)
- Last Synced: 2025-03-24T20:21:50.881Z (about 2 months ago)
- Language: C#
- Size: 47.9 KB
- Stars: 6
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SparkMLDocCategorization
Example of automatic categorization using .NET for Spark and ML.NET.
This project will parse a set of markdown documents, produce a file with titles and words, then
process the file using .NET for Spark to summarize word counts. It then passes the data to
ML.NET to auto-categorize similar documents.## Prerequisites
For the .NET for Spark portion, follow [this tutorial](https://docs.microsoft.com/dotnet/spark/tutorials/get-started).
You should also have [.NET Core 3.1 installed](https://dotnet.microsoft.com/download/dotnet-core).
## Getting Started
Each flow through is identified with a unique session tag. For example, `1` might point to a
set of documents while `2` points to a different repo. You can specify a file location, but it
will default to your user local app data directory. The jobs will show the path to the files.The `runall.cmd` in the root will step through all phases:
`runall 1 c:\source\repo`
### Build the Spark Data Source
Navigate to the `DocRepoParser` project first.
Type `dotnet run 1 "c:\source\repo"` (replace the last path with the path to your repo).
You'll see a notice that the file has been processed. There is no need to remember the full path.
### Process the Word Counts
Next, navigate to the `SparkWordsProcessor` directory. Build the project:
`dotnet build`
Navigate to the output directory (`bin/Debug/netcoreapp3.1`). You have two options:
1. Debug: run the `debugspark.cmd`, right-click project properties and put "1" in "arguments" under debug and press F5.
2. Alternative: submit the job directly by running `runjob 1` (`1` is the session tag).### Train and Apply the Machine Learning Model
Navigate to the `DocMLCategorization` project.
To train _and_use the model, type:
`dotnet run 1`
Open the generated file and see how well the tool categorized your documents!