Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/go-air/dupi
A tool to find all duplicates in large sets of text documents.
https://github.com/go-air/dupi
analysis analytics golang index nlp search
Last synced: about 18 hours ago
JSON representation
A tool to find all duplicates in large sets of text documents.
- Host: GitHub
- URL: https://github.com/go-air/dupi
- Owner: go-air
- License: apache-2.0
- Created: 2021-09-15T01:41:56.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2021-09-29T19:20:52.000Z (about 3 years ago)
- Last Synced: 2024-08-01T21:46:42.204Z (3 months ago)
- Topics: analysis, analytics, golang, index, nlp, search
- Language: Go
- Homepage:
- Size: 460 KB
- Stars: 16
- Watchers: 3
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
- project-awesome - go-air/dupi - A tool to find all duplicates in large sets of text documents. (Go)
README
# ⊧ dupi
Dupi is an engine for identifying and exploring duplicative text in sets of
documents.## Status
Dupi is in alpha/early beta development stage. Please feel free to give it a try
(and [file issues](https://github.com/go-air/dupi/issues)). We have run it on
several document sets successfully, but it definitely needs more testing.## Input
Throw hundreds of thousands of textual documents at it. Or extract text from
other documents and send that to dupi.## Output
Find and query for repeated chunks of text.
## Tutorial
[Tutorial](docs/tutorial.md)
## Design
[Design Document](docs/design.md)
## Library Reference
[![Go Reference](https://pkg.go.dev/badge/github.com/go-air/dupi.svg)](https://pkg.go.dev/github.com/go-air/dupi)