https://github.com/thijse/plagiarismdetector

A sample showing how vector comparison can be used to detect plagiarismm using a simple in-memory vector store.
https://github.com/thijse/plagiarismdetector

csharp large-language-models openai plagiarism plagiarism-detection rag vector

Last synced: 7 months ago
JSON representation

A sample showing how vector comparison can be used to detect plagiarismm using a simple in-memory vector store.

Host: GitHub
URL: https://github.com/thijse/plagiarismdetector
Owner: thijse
License: mit
Created: 2023-10-14T22:08:36.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-10-14T23:58:12.000Z (almost 2 years ago)
Last Synced: 2023-10-16T05:35:02.263Z (almost 2 years ago)
Topics: csharp, large-language-models, openai, plagiarism, plagiarism-detection, rag, vector
Language: C#
Homepage:
Size: 1.03 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          [![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

# Plagiarism Dectector

A sample showing how Vector comparison can be used to easily and effectively detect plagiarismm using a  simple in-memory vector store 

The repository contains three main projects:

- Memory Vector Store project, which focuses on storing vectors in memory;

- Chunk Creator project, which extracts vectors from PDF files;

-Plagiarism project, which demonstrates how to perform similarity searches using the stored vectors and use OpenAI . Each project has its own set of code and resources, allowing you to explore and understand the implementation details.

 This code is based on the code in [MemoryVectorStore](https://github.com/thijse/MemoryVectorStore)

## Code example

First we need to make chunks both PDFs and build the embedding vectors

```cs

//  OpenAI service that we are going to use for embedding

_openAiService = new OpenAIService(new OpenAiOptions()  {ApiKey = apiKey });

// Set up a MemoryVector database, to be filled with chunks of documents

// including an embedding vector of 1536 dimensions

// Also included is a callback that embeds any text item into a vector

_vectorCollection = new MemoryVectorDB.VectorDB(1536, ChunkEmbedingAsync);

// Get text fom pdf 

_document = PdfTextExtractor.GetText(documentPath);

// Generate sentences of max 50 words

_chunkGenerator = new ChunkGenerator(50,  _document);

// Loop through chunks

foreach (var chunk in _chunkGenerator.GetChunk())

{

    // Add the source reference to the chunk

    chunk.Source = documentPath;

    // Add the chunk to the vector store

    await _vectorCollection.AddAsync(chunk);

    // We remove the text from the chunk to safe memory:

    // we just need the vector, start index, length and source

    // so we can recover the the chunk from the original document later

    chunk.Text = null!;

}

```

Now we can find the best matching sentences between both documents

```cs

var vectorObjects1 = _embedding1.VectorCollection.VectorObjects;

var vectorObjects2 = _embedding2.VectorCollection.VectorObjects;

// Find the closest matching vectors between the 2 documents

var bestMatches    = FindNearestSorted(vectorObjects1, vectorObjects2, 100);

// And here they are

foreach (var item in bestMatches)

{

    ShowMatch(true, item.Value.Item1, item.Value.Item2);

}

```

Note that the FindNearestSorted is just a brute-force comparison of the (normalized) dot products between the query vector and all chunk vectors. For larger vector stores,  a database should be used that implements an indexing system for efficient nearest neighbour searches  [using something like this library](https://github.com/curiosity-ai/hnsw-sharp)

However, not all closely matching vectors might constitute plagiarism. Luckily we have the LLM to compare the highest ranking vector in-porducts

```cs

// Format the query to post to the LLM:

foreach (var item in bestMatches)

{

	var isPlagiarism = await FormulateComparisonAsync(item.Value.Item1, item.Value.Item2);

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thijse/plagiarismdetector

Awesome Lists containing this project

README