Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/go-air/dupi

A tool to find all duplicates in large sets of text documents.
https://github.com/go-air/dupi

analysis analytics golang index nlp search

Last synced: about 18 hours ago
JSON representation

A tool to find all duplicates in large sets of text documents.

Awesome Lists containing this project

README

        

# ⊧ dupi

Dupi is an engine for identifying and exploring duplicative text in sets of
documents.

## Status

Dupi is in alpha/early beta development stage. Please feel free to give it a try
(and [file issues](https://github.com/go-air/dupi/issues)). We have run it on
several document sets successfully, but it definitely needs more testing.

## Input

Throw hundreds of thousands of textual documents at it. Or extract text from
other documents and send that to dupi.

## Output

Find and query for repeated chunks of text.

## Tutorial

[Tutorial](docs/tutorial.md)

## Design

[Design Document](docs/design.md)

## Library Reference

[![Go Reference](https://pkg.go.dev/badge/github.com/go-air/dupi.svg)](https://pkg.go.dev/github.com/go-air/dupi)