https://github.com/codecliff/fdupesanalyzer
A script to analyze output of fdupes linux utility to find level of overlap between directories. Written in R
https://github.com/codecliff/fdupesanalyzer
bash bash-script directory duplicates fdupes fdupes-linux-utility files r rstudio
Last synced: 11 months ago
JSON representation
A script to analyze output of fdupes linux utility to find level of overlap between directories. Written in R
- Host: GitHub
- URL: https://github.com/codecliff/fdupesanalyzer
- Owner: codecliff
- License: mit
- Created: 2020-01-31T21:39:29.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2020-02-09T17:35:10.000Z (almost 6 years ago)
- Last Synced: 2025-01-04T14:43:17.232Z (about 1 year ago)
- Topics: bash, bash-script, directory, duplicates, fdupes, fdupes-linux-utility, files, r, rstudio
- Language: R
- Homepage:
- Size: 237 KB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.html
- License: LICENSE
Awesome Lists containing this project
README
code{white-space: pre;}
pre:not([class]) {
background-color: white;
}
if (window.hljs) {
hljs.configure({languages: []});
hljs.initHighlightingOnLoad();
if (document.readyState && document.readyState === "complete") {
window.setTimeout(function() { hljs.initHighlighting(); }, 0);
}
}
h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
height: auto;
}
.tabbed-pane {
padding-top: 12px;
}
.html-widget {
margin-bottom: 20px;
}
button.code-folding-btn:focus {
outline: none;
}
summary {
display: list-item;
}
.tabset-dropdown > .nav-tabs {
display: inline-table;
max-height: 500px;
min-height: 44px;
overflow-y: auto;
background: white;
border: 1px solid #ddd;
border-radius: 4px;
}
.tabset-dropdown > .nav-tabs > li.active:before {
content: "";
font-family: 'Glyphicons Halflings';
display: inline-block;
padding: 10px;
border-right: 1px solid #ddd;
}
.tabset-dropdown > .nav-tabs.nav-tabs-open > li.active:before {
content: "";
border: none;
}
.tabset-dropdown > .nav-tabs.nav-tabs-open:before {
content: "";
font-family: 'Glyphicons Halflings';
display: inline-block;
padding: 10px;
border-right: 1px solid #ddd;
}
.tabset-dropdown > .nav-tabs > li.active {
display: block;
}
.tabset-dropdown > .nav-tabs > li > a,
.tabset-dropdown > .nav-tabs > li > a:focus,
.tabset-dropdown > .nav-tabs > li > a:hover {
border: none;
display: inline-block;
border-radius: 4px;
}
.tabset-dropdown > .nav-tabs.nav-tabs-open > li {
display: block;
float: none;
}
.tabset-dropdown > .nav-tabs > li {
display: none;
}
$(document).ready(function () {
window.buildTabsets("TOC");
});
$(document).ready(function () {
$('.tabset-dropdown > .nav-tabs > li').click(function () {
$(this).parent().toggleClass('nav-tabs-open')
});
});
FdupesAnalyzer
A utility to analyze output of fdupes linux utility to find level of overlap between directories. Written in R. https://github.com/codecliff/FdupesAnalyzer
Why:
fdupes by Adrián López gives you a file-by-file list of duplicates. It works very well with renamed copies and files exported by image editors and such. However, to clean up a large dump of files accumulated over years by multiple users, I needed to see things like 70% of files in dir A are also in dir B, dir A has copies of all the files in dir B etc. This utility script creates a csv file with all this information.
How To Use:
- Run fdupes and redirect results to file.
fdupes -Sr rootpath >> fdupes_output.txt
- Edit R script
FDupesParser.R , update path for output file and rootpath.
- Run R script (Preferably in interactive mode, preferably in RStudio)
- Go over the csv file generated by script
- (Optional) Generate fdupes commands for each directory pair and run as a batch
Output file formats:
Generated CSV file
- “dir1” : directory 1
- “dir2” : directory 2
- “matchcnt”: no. of files matching between dir1 and dir2
- “acnt” : file count in dir1
- “bcnt” : file count in dir2
- “aprct” : percent of files in dir1 which have copy in dir2
- “bprct” : same for dir2
- “maxprct” : max of above two
Generated script file
sudo fdupes -dN "./imgs/music" "./imgs/2018-03-oldccombk/stuff/"
sudo fdupes -dN "./ntfs/2017-backup/weds" "./IMAGES/Pictures_2017/.mail_downloads"
sudo fdupes -dN "./IMAGES/Picture/weds" "./IMAGES/Pictures_2017/oldlaptop_hdd"
Prerequisites
- R
- R Packages :
data.table, tools
- fdupes
Tested on
- Ubuntu 18.04
- R 3.6.2
- RStudio 1.1.463
// add bootstrap table styles to pandoc tables
function bootstrapStylePandocTables() {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
}
$(document).ready(function () {
bootstrapStylePandocTables();
});
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();