zsh:finddup(): Replace awk solution with uniq
Replace custom awk solution with uniq, by first flipping filename and filesize so that uniq's `-f` flag can be utilized (as there is no inverse of it, i.e. "only look at field n"). This increases performance by quite a bit.
This commit is contained in:
@@ -560,18 +560,25 @@ suffix() {
|
|||||||
# Find duplicate files
|
# Find duplicate files
|
||||||
finddup() {
|
finddup() {
|
||||||
# find all files, filter the ones out with unique size, calculate md5 and
|
# find all files, filter the ones out with unique size, calculate md5 and
|
||||||
# print duplicates
|
# print duplicates. Assumes that no file contains tab characters in their
|
||||||
# TODO: Fix duplicate lines output in the awk script that currently `sort
|
# name.
|
||||||
# -u` handles
|
#
|
||||||
# TODO: Use cksum to calculate faster CRC with custom awk solution to print
|
# TODO: Use cksum to calculate faster CRC with custom awk solution to print
|
||||||
# duplicates, as `uniq -w32` breaks through the different CRC lengths.
|
# duplicates, as `uniq -w32` breaks through the different CRC lengths.
|
||||||
|
# TODO: The second sort call could be optimized in some way, since we
|
||||||
|
# already grouped files with the same size. Instead of resorting the
|
||||||
|
# whole thing, we only need to check if the files with the same size
|
||||||
|
# have the same hash. Just removing the sort call does almost the
|
||||||
|
# trick just breaks for groups of files with the same size where same
|
||||||
|
# checksums are not behind each other.
|
||||||
|
|
||||||
find "$@" -type f -exec du -b '{}' '+' \
|
find "$@" -type f -exec du -b '{}' '+' \
|
||||||
| sort \
|
| awk -F'\t' '{print $2"\t"$1}' \
|
||||||
| awk '{ if (!_[$1]) { _[$1] = $0 } else { print _[$1]; print $0; } }' \
|
| sort --field-separator=$'\t' -nk2 \
|
||||||
| sort -u \
|
| uniq -f1 -D \
|
||||||
| cut -d$'\t' -f2- \
|
| cut -d$'\t' -f1 \
|
||||||
| xargs -d'\n' md5sum \
|
| xargs -d'\n' md5sum \
|
||||||
| sort \
|
| sort -k1,1 \
|
||||||
| uniq -w32 --all-repeated=separate \
|
| uniq -w32 --all-repeated=separate \
|
||||||
| cut -d' ' -f3-
|
| cut -d' ' -f3-
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user