Ever since i started hosting a few of the boards here i kept an rsync backup of the webserver to my webserver going whenever i remembered to run it...
Fast forward a year or so and the backup archive on my home server is taking ludicrous levels of diskspace ^_^
16.9GB space, 180k files
I tried running a few duplicate image detectors on it to delete all the repeated images, but the only one i found that seems to work properly (dupdetector) i cannot get to automatically delete files... i can't be arsed to push 'ok' individually on several thousand files :P
Any suggestions for programs to try? preferably command-line if possible, windows or Linux is fine: server is Linux and the archive directory is mapped to my XP machine...
HALP :D
For Windows, I'm done it with ACDSEE
write a shell script that compares md5sums
http://mail.python.org/pipermail/python-list/2005-March/311700.html
This might just work, or require minimal changes to work.
As per >>3's suggestion... the following will sort a list of files by SHA1sum (similar to a MD5sum but a little more accurate).
#!/bin/sh
find dir -type f | xargs sha1sum | sort
All the files with the same sha1sum, can be assumed to be identical, and thus duplicates may be safely deleted. With a little shell magic, you could probably tell this script to do that for you, making symlinks pointing back to the original file, so as to prevent breaking existing hyperlinks.
DupDetector for Windows is what i use, haven't really tried any other programs but i think it's doing a decent job.
Here is a bigger problem: quite possibly many files are exactly the same in content, but may cough up different checksums.
>>7 Errm... isn't the whole point of hashing algorithms, to generate the same hash for the same piece of data?
You do have a risk of two different blocks of data having the same hash ... but this is rare. MD5 produces a 128-bit hash, and SHA1 produces a 160-bit hash. Possible, but highly unlikely. You could also use SHA256, which produces a 256-bit hash -- even less chance of a collision. Combining these algorithms (such as CRC32, MD5 and SHA1), you'll be pretty sure which ones are duplicates.
It is possible, that the same image to hash to different values -- typically this happens when you have the same image in multiple formats. That is, the duplicates are in fact different files.
The problem that >>1 describes, suggests that we're dealing with identical copies of images in the same format, not different formats.
Differences in metadata, or junk data at the end of the file.
if ! { compare -quiet $first_image $second_image /dev/null } echo Duplicate images detected: $first_image $second_image
simple #tma way to avoid binary identical dupes:
This is the way we did it for over a year and it works very well, if new dupe arrives nothing happens, because your script will just ignore it and won't move it into hashstore (owner of "4chan archive" had over 500GB of data stored like this, now he uses custom filesystem to avoid unnnecessary overhead, but for small storage of only few gigabytes it performs well on any filesystem)
With some scripting you could code all your imageboards so they use the same hashstore, you would avoid storing all binary dupes over all boards like that. I myself have all of this implemented in perl and ruby, other users used python or shell scripts calling and parsing bitcollider output.
now for non binary dupes alias nbid: short .. google "Schwartz finding similar images". Long ... come to #tma @ freenode and feel free to discuss ;)
God, are those #tma nutters still around? Talk about your obsessive-compulsive nerds.