{"id":36824492,"url":"https://github.com/scanoss/wfp","last_synced_at":"2026-01-12T14:03:23.393Z","repository":{"id":45261984,"uuid":"279117521","full_name":"scanoss/wfp","owner":"scanoss","description":"Winnowing fingerprint extractor","archived":false,"fork":false,"pushed_at":"2025-09-12T18:39:20.000Z","size":1494,"stargazers_count":17,"open_issues_count":0,"forks_count":8,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-09-12T21:00:36.667Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scanoss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSES/GPL-2.0-only.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2020-07-12T17:48:04.000Z","updated_at":"2025-09-12T18:39:24.000Z","dependencies_parsed_at":"2024-07-12T23:41:09.960Z","dependency_job_id":null,"html_url":"https://github.com/scanoss/wfp","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/scanoss/wfp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scanoss%2Fwfp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scanoss%2Fwfp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scanoss%2Fwfp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scanoss%2Fwfp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scanoss","download_url":"https://codeload.github.com/scanoss/wfp/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scanoss%2Fwfp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28340242,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-12T12:22:26.515Z","status":"ssl_error","status_checked_at":"2026-01-12T12:22:10.856Z","response_time":98,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-12T14:03:23.265Z","updated_at":"2026-01-12T14:03:23.342Z","avatar_url":"https://github.com/scanoss.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SOURCE CODE FINGERPRINTING WITH WINNOWING\n\nThe Winnowing algorithm has been used for long years in academic networks to obtain fingerprints from documents and source code. These fingerprints are used to check for plagiarism against known texts and source code. There are several open source implementations of the Winnowing algorithm available today. Given the wide adoption of the Winnowing algorithm and the broad availability of open source implementations, SCANOSS has chosen this algorithm to compare and identify known open source code.\n\n## The Winnowing algorithm\n\nThe algorithm converts source code into fingerprints, which takes four steps:\n\n- Normalization\n- Gram fingerprinting\n- Window selection\n- Output formatting\n\n### Normalization\n\nThe normalization process consists on eliminating all non alphanumeric characters from the input. For example:\n\n#### Original source code\n\n```\n\nfor (uint32_t i = 0; i \u003c src_len; i++)\n{\n\tif (src[i] == '\\n') line++;\n\tuint8_t byte = normalize(src[i]);\n\tif (!byte) continue;\n\tgram[gram_ptr++] = byte;\n\tif (gram_ptr \u003e= GRAM)\n\t{\n\t\twindow[window_ptr++] = calc_crc32c((char *) gram, GRAM);\n\t\tif (window_ptr \u003e= WINDOW)\n\t\t{\n\t\t\thash = smaller_hash(window);\n\t\t\tlast = add_hash(hash, line, hashes, lines, last, \u0026counter);\n\t\t\tif (counter \u003e= limit) break;\n\t\t\tshift_window(window);\n\t\t\twindow_ptr = WINDOW - 1;\n\t\t}\n\t\tshift_gram(gram);\n\t\tgram_ptr = GRAM - 1;\n\t}\n}\n\n```\n\n#### Normalized code\n\n```\n\nforuint32ti0isrcleniifsrcinlineuint8tbytenormalizesrciifbytecontinuegramgramptrbyteifgramptrgramwin\ndowwindowptrcalccrc32cchargramgramifwindowptrwindowhashsmallerhashwindowlastaddhashhashlinehashesli\nneslastcounterifcounterlimitbreakshiftwindowwindowwindowptrwindow1shiftgramgramgramptrgram1\n\n```\n\n### Gram fingerprinting\n\nFrom the normalized code, a series of data samples are taken and fingerprinted. The amount of bytes desired for such sets is called _gram_ and accounts for 10 bytes in the present example. Given the availability of the CRC32C checksum algorithm embedded in most Intel chipsets, we decided to use a simple CRC32C checksum as a gram fingerprint.\n\n#### Gram fingerprints from the previous example\n\n```\nforuint32t = 1adf644b\noruint32ti = 6f72669d\nruint32ti0 = 88ad5ece\nuint32ti0i = d368b44c\nint32ti0is = 2123892a\nnt32ti0isr = 336cdfdd\nt32ti0isrc = 1c8e832d\n32ti0isrcl = 6b7d73f6\n2ti0isrcle = c02dce5b\nti0isrclen = d31d3b69\ni0isrcleni = d8a27ef1\n0isrclenii = f01878ee\nisrcleniif = f51fa9b6\nsrcleniifs = 1e385339\nrcleniifsr = eafcb14a\n[...]\n```\n\n### Window selection\n\nFrom the series of gram fingerprints, a series of data samples are taken and selected. The amount of grams desired for such sets is called _window_ and accounts for 15 grams in the present example. From each window, the smallest gram fingerprint is selected.\n\nThe sorted list of gram fingerprints from the previous example follows:\n\n```\n1adf644b,1c8e832d,1e385339,2123892a,336cdfdd,6b7d73f6,6f72669d,88ad5ece,\nc02dce5b,d31d3b69,d368b44c,d8a27ef1,eafcb14a,f01878ee,f51fa9b6\n```\n\n#### Window fingerprinting\n\nThe smallest fingerprint is selected as the identifier for each window, which naturally results in a reduced output range of fingerprints, favouring low checksum values. This lack of uniformity would lead to an expensive unbalance in database index trees when storing large amounts of data. A simple fix for this is to calculate the checksum of the checksum, which would balance output data uniformity. For the previous example, the CRC32C checksum for **1adf644b** results in **688c09fe**, which is the first window hash for the file.\n\n### Output formatting\n\nWinnowing fingerprints should be represented in a simple machine-readable, yet human-readable format. With this mind, we defined the .wfp (Winnowing fingerprint) file extension and .wfp file format.\n\nThe .wfp file contains a series of file declarations followed by the code fingerprints. Originating line numbers are kept with the purpose of pinpointing exact line numbers where occurences are found.\n\nThe file declaration contains the original file name and the full file hash (MD5 in this example) with the purpose of comparing an entire, unmodified file before comparing subsets.\n\nThe following _.wfp_ file contains the winnowing fingerprints _test.c_ displayed above, with a configuration of _gram=10_ and _window=15_:\n\n```\nfile=34cff02ed13a3d26e716e473d4e8900d,test.c\n3=688c09fe,fc6d701d,61b2b37c\n5=5f7b1b19,99181ce1,79923cb2,64691599\n6=f218cd1c\n8=7cf9f396,17c3dd99\n10=3a693f60,fb9493ca,54fc128c\n12=6f8dfa99,d3f3a3ca,04a0062b\n13=bccec1a8,1657ceac\n15=4dde1f15,a4c8bf7a\n16=b657086d,39b9f206,bec983db,2978bdfa,787f39f2,8145af5e\n18=1fb6cdda\n20=c18636e3,47091215,7f040b14\n21=d3f3a3ca,08db7055\n23=c2506fa2\n24=e3c50129,95383750\n```\n\n## Study on _gram_ and _window_ value pairs\n\nThe Winnowing algorithm admits configuration of two main parameters: _gram_ and _window_. Selecting the right values will have a direct impact in output uniformity and footprint. These values will affect performance and quality of results.\n\nIn order to find a suitable configuration, we executed tests with different values for different programming languages and different applications. Some of these results are made available below.\n\n### Gram\n\nThe smaller the value of _gram_ the lower the output uniformity and the higher the possibility of data colission. For example, a _gram_ value of _4_ would lead to the fingerprint for the word _else_ becoming very popular since the word is common in many programming languages. The bigger the gram value, however, the less likely it would be to find matches on modified code.\n\n### Window\n\nThe bigger the _window_, the lower the output footprint, but also the lower the chances to find matches on modified code.\n\n### Uniformity and footprint\n\nUniformity and footprint are the two resulting factors evaluated when testing different configurations for _gram_ and _window_.\n\n#### Footprint\n\nTo evaluate footprint, we simply count the amount of fingerprints generated in the output. The graphs below illustrate how footprint is affected by different combinations of _gram_ and _window_:\n\n- [C zlib](images/W-C.png)\n- [Java pngtastic](images/W-JAVA.png)\n- [Javascript (jquery)](images/W-JQuery.png)\n- [Ruby (jqueryrails)](images/W-Ruby.png)\n\n#### Uniformity\n\nIn order to evaluate uniformity, we establish a uniformity index, which is a factor indicating how many times the most common fingerprint repeats vs. the less common one. For example, if the less repeating fingerprint appears two times, while a given fingerprint appears 10 times, then it has a uniformity factor of 5 for the exercise. Therefore, the lower the uniformity index, the greater the output uniformity.\n\nThe graphs below illustrate how different combinations of _gram_ and _window_ affect uniformity:\n\n- [C (zlib)](images/H-C.png)\n- [Java (pngtastic)](images/H-JAVA.png)\n- [Javascript (jquery)](images/H-JQuery.png)\n- [Ruby (jqueryrails)](images/H-Ruby.png)\n\n### Conclusion\n\nBased on the different exercises and comparison tests we concluded that _gram=30_ and _window=64_ provides a good balance between footprint and uniformity, and has proven so far to provide good matching capabilities.\n\n## License\n\nWFP is released under the GPL 2.0 license. Please check the LICENSE file for more information.\n\nCopyright (C) 2018-2020 SCANOSS Ltd. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscanoss%2Fwfp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscanoss%2Fwfp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscanoss%2Fwfp/lists"}