{"id":20827400,"url":"https://github.com/cvcio/go-plagiarism","last_synced_at":"2025-05-07T21:05:08.508Z","repository":{"id":47109323,"uuid":"361423146","full_name":"cvcio/go-plagiarism","owner":"cvcio","description":"Plagiarism detection using stopwords n-grams","archived":false,"fork":false,"pushed_at":"2021-09-13T06:46:22.000Z","size":482,"stargazers_count":6,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-05-07T21:05:02.455Z","etag":null,"topics":["algorithm","golang","n-grams","plagiarism","plagiarism-detection","stopwords"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cvcio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-25T12:31:10.000Z","updated_at":"2024-12-16T13:39:32.000Z","dependencies_parsed_at":"2022-09-03T19:11:26.669Z","dependency_job_id":null,"html_url":"https://github.com/cvcio/go-plagiarism","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cvcio%2Fgo-plagiarism","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cvcio%2Fgo-plagiarism/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cvcio%2Fgo-plagiarism/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cvcio%2Fgo-plagiarism/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cvcio","download_url":"https://codeload.github.com/cvcio/go-plagiarism/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252954429,"owners_count":21830903,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithm","golang","n-grams","plagiarism","plagiarism-detection","stopwords"],"created_at":"2024-11-17T23:12:01.308Z","updated_at":"2025-05-07T21:05:08.490Z","avatar_url":"https://github.com/cvcio.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n[![Language](https://img.shields.io/badge/Language-Go-blue.svg)](https://golang.org/)\n[![Build Status](https://github.com/cvcio/go-plagiarism/workflows/Go/badge.svg)](https://github.com/cvcio/go-plagiarism/actions)\n[![GoDoc](https://pkg.go.dev/badge/github.com/cvcio/go-plagiarism)](https://pkg.go.dev/github.com/cvcio/go-plagiarism)\n[![Go Report Card](https://goreportcard.com/badge/github.com/cvcio/go-plagiarism)](https://goreportcard.com/report/github.com/cvcio/go-plagiarism)\n\n# Plagiarism detection using stopwords *n*-grams\n\n`go-plagiarism` is the main algorithm that utilizes [MediaWatch](https://mediawatch.io) and is inspired by [Efstathios Stamatatos](https://www3.icsd.aegean.gr/lecturers/stamatatos/) paper [Plagiarism detection using stopwords *n*-grams](http://dx.doi.org/10.1002/asi.21630).\n\nWe only rely on a small list of stopwords, for each [language](#supported-languages), to calculate the plagiarism probability between two texts, in combination with *n*-grams that let us find, not only plagiarism but also paraphrase and patchwork plagiarism. Take a look at the images below to help you better understand the process.\n\nDuring the 1st step we tokenize the strings and keep only the stopwords (red tokens) for each document, as **SourceStopWords** and **TargetStopWords**.\n\n![Plagiarism Detection Algorithm - Function Words - Tokens](https://github.com/cvcio/go-plagiarism/raw/main/assets/Plagiarism%20Detection%20Algorithm%20-%20Function%20Words%20-%20Tokens.png)\n\nLater we transform the stopwords for each document into *n*-grams, with default **N = 8**, and calculate the score for each set of *n*-grams.\n\n![N-Grams](https://github.com/cvcio/go-plagiarism/raw/main/assets/N-Grams.png)\n\nIn our case (cc [MediaWatch](https://mediawatch.io)) we use this algorithm to create relationships between similar articles and map the process, or **the chain of misinformation**. As our scope is to track propaganda networks in the news ecosystem, this algorithm is only tested in such context.\n\n![The Chain of Misinformation](https://github.com/cvcio/go-plagiarism/raw/main/assets/The%20Chain%20of%20Misinformation.png)\n\n\u003cp align=\"center\"\u003eThe Chain of Misinformation\u003c/p\u003e\n\n![Similarity Network](https://github.com/cvcio/go-plagiarism/raw/main/assets/Similarity%20Network.png)\n\n\u003cp align=\"center\"\u003eSimilarity Network\u003c/p\u003e\n\n## Usage\n\n```bash\ngo get github.com/cvcio/go-plagiarism\n```\n\nTo use the detector you must provide either source/target texts when using with `DetectWithStrings`, or a list of stopwords for each text, when using with `DetectWithStopWords`. You can pass [options](#options) to the detector to set your [language](#supported-languages), *n*-gram size or a custom stopword list. After executing one of the available detection methods, the detector will write in its interface the final score (float64), the similar *n*-grams (int) and the total *n*-grams (int). Though it seems highly experimental you can see the algorithm in action, in real-time, at [app.mediawatch.io](https://app.mediawatch.io), where we continuously monitor Greek news outlets. Read the complete documentation at [go-plagiarism](https://pkg.go.dev/github.com/cvcio/go-plagiarism).\n\n```go\npackage main\n\nimport (\n    \"fmt\"\n\n    \"github.com/cvcio/go-plagiarism\"\n)\n\nvar source = `Plagiarism detection using stopwords n-grams. go-plagiarism is the main algorithm \nthat utilizes MediaWatch and is inspired by Efstathios Stamatatos paper. \nWe only rely on a list of stopwords to calculate \nthe plagiarism probability between two texts, in combination with n-gram \nloops that let us find, not only plagiarism but also \nparaphrase and patchwork plagiarism. In our case (cc MediaWatch) we \nuse this algorithm to create relationships between similar articles and \nmap the process, or the chain of misinformation. As our \nscope is to track propaganda networks in the news ecosystem, \nthis algorithm only tested in such context.`\n\nvar target = `We only rely on a list of stopwords to calculate \nthe plagiarism probability between two texts, in combination with n-gram \nloops that let us find, not only plagiarism but also \nparaphrase and patchwork plagiarism. In our case (cc MediaWatch) we \nuse this algorithm to create relationships between similar articles and \nmap the process, or the chain of misinformation. As our \nscope is to track propaganda networks in the news ecosystem, \nthis algorithm only tested in such context.`\n\nfunc main() {\n    detector, _ := plagiarism.NewDetector()\n    err := detector.DetectWithStrings(source, target)\n    if err != nil {\n        panic(err)\n    }\n\n    fmt.Printf(\"Probability: %.2f, Similar n-grams %d, Total n-grams %d\\n\", detector.Score, detector.Similar, detector.Total)\n}\n\n// \u003e Probability: 0.91, Similar n-grams 72, Total n-grams 79\n```\n\n## Options\n\nDetector can be initialized with options, `SetN` to set the *n*-gram size, `SetLang` to set the detector's language model and assign the appropriate stopwords and `SetStopWords` to assign a custom list of stopwords. Do not use `SetLang` alongside with `SetStopWords` as it will override one another.\n```go\nplagiarism.SetN(n int) Option // will set the desired n-gram size\nplagiarism.SetLang(lang string) Option // will set the detector's language and assign the default stopwords\nplagiarism.SetStopWords(stopWords []string) Option // will set a custom list of stopwords as the default\n```\n\nTo use the detector with options, simple pass the options during initialization.\n```go\n// create a detector with 12 N n-gram size and set the language to Greek\ndetector, err := plagiarism.NewDetector(plagiarism.SetN(12), plagiarism.SetLang(\"el\"))\n```\n\n```go\n// create a detector with default n-gram size (8) and set a custom stopword list\ndetector, err := plagiarism.NewDetector(plagiarism.SetStopWords([]string{\"ο\", \"του\", \"η\", \"της\", \"αλλά\"}))\n```\n\n## Supported Languages\nYou can find all supported languages in the [stopwords.go](/stopwords.go) file. All supported languages use the ISO639-1 code format as a key (string) and the corresponding stopwods list ([]string) as a value.\n\n| ISO 639-1 \t| Language       \t| Tested                  \t| Tests     |\n|-----------\t|----------------\t|-------------------------\t|-------    |\n| bg        \t| Bulgarian      \t| Partially Tested        \t| 1         |\n| de        \t| German         \t| Tested (\u003e10K Articles)  \t| 1         |\n| el        \t| Greek          \t| Tested (\u003e10M Articles)  \t| 5         |\n| en        \t| English        \t| Tested (\u003e1M Articles)   \t| 2         |\n| fi        \t| Finnish        \t| Partially Tested        \t| 1         |\n| fr        \t| French         \t| Partially Tested        \t| 1         |\n| hr        \t| Croatian       \t| Partially Tested        \t| 1         |\n| hu        \t| Hungarian      \t| Partially Tested        \t| 1         |\n| it        \t| Italian        \t| Tested (\u003e10K Articles)  \t| 1         |\n| nl        \t| Dutch, Flemish \t| Partially Tested        \t| 1         |\n| no        \t| Norwegian      \t| Partially Tested        \t| 1         |\n| pl        \t| Polish         \t| Partially Tested        \t| 1         |\n| pt        \t| Portuguese     \t| Partially Tested        \t| 1         |\n| ro        \t| Romanian       \t| Partially Tested        \t| 1         |\n| ru        \t| Russian        \t| Tested (\u003e10K Articles)  \t| 1         |\n| tr        \t| Turkish        \t| Tested (\u003e100K Articles) \t| 1         |\n| uk        \t| Ukrainian      \t| Partially Tested        \t| 1         |\n\n### TODO List\n\n- [ ] Include additional test cases for each language\n- [ ] Include tests with various *n*-gram sizes\n- [ ] Introduce a `GetSimilar` method to retrieve similar passages\n  \n\n## Test Coverage\n```bash\ngo test -v\n```\n## Contributing\n\nIf you're new to contributing to Open Source on Github, [this guide](https://opensource.guide/how-to-contribute/) can help you get started. Please check out the contribution guide for more details on how issues and pull requests work. Before contributing be sure to review the [code of conduct](/CODE_OF_CONDUCT.md).\n\n\u003ca href=\"https://github.com/cvcio/go-plagiarism/graphs/contributors\"\u003e\n  \u003cimg src=\"https://contrib.rocks/image?repo=cvcio/go-plagiarism\" /\u003e\n\u003c/a\u003e\n\n## license\n\nThis library is distributed under the MIT license found in the [LICENSE](/LICENSE) file.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcvcio%2Fgo-plagiarism","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcvcio%2Fgo-plagiarism","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcvcio%2Fgo-plagiarism/lists"}