https://github.com/cardi/content-reuse-detection
source code for content reuse detection paper
https://github.com/cardi/content-reuse-detection
Last synced: 2 months ago
JSON representation
source code for content reuse detection paper
- Host: GitHub
- URL: https://github.com/cardi/content-reuse-detection
- Owner: cardi
- Created: 2018-10-17T03:49:53.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2018-10-17T07:29:23.000Z (over 6 years ago)
- Last Synced: 2025-02-05T19:12:44.241Z (4 months ago)
- Language: Python
- Size: 2.91 MB
- Stars: 1
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Content Reuse Detection
This repository contains the code and pointers to datasets used in the
paper "Precise Detection of Content Reuse in the Web" by Calvin Ardi and
John Heidemann.`data/` contains pointers to datasets used in the paper, along with
lists of files for verification. The data can usually be accessed on
Amazon's S3 or downloaded via HTTP or BitTorrent.`java/` contains code using Apache Hadoop 1.x MapReduce to generate hashes
of files and their corresponding chunks. For easier processing on files,
it's advisable to convert archives (`.tar`, etc.) to [SequenceFile]s
(`.seq`) using something like [forqlift].`run_scripts/` contains shell scripts used for executing the `.jar`
generated in `java/`. Since most of the Java MapReduce code was run on
Amazon's EMR service, it also contains scripts to create/destroy
instances.[SequenceFile]: https://wiki.apache.org/hadoop/SequenceFile
[forqlift]: http://www.exmachinatech.net/projects/forqlift/`python/` contains code using Apache Hadoop 1.x Streaming and requires
[mrjob]. Code here handles the processing of the intermediate output
from the Java code to generate the final outputs.[mrjob]: https://pypi.org/project/mrjob/
## Other Notes
At the time of writing, we ultimately found that Hadoop MapReduce
performed best on binary and large archives using Java and
SequenceFiles.The intermediate output could then be efficiently processed using Hadoop
Streaming in the language of choice (in our case, Python).