Dupicate image detector (15)

1 Name: Cyrilthefish!ljAhqzG3aU : 2006-12-22 17:11 ID:B7y1ouZA

Ever since i started hosting a few of the boards here i kept an rsync backup of the webserver to my webserver going whenever i remembered to run it...

Fast forward a year or so and the backup archive on my home server is taking ludicrous levels of diskspace ^_^
16.9GB space, 180k files

I tried running a few duplicate image detectors on it to delete all the repeated images, but the only one i found that seems to work properly (dupdetector) i cannot get to automatically delete files... i can't be arsed to push 'ok' individually on several thousand files :P

Any suggestions for programs to try? preferably command-line if possible, windows or Linux is fine: server is Linux and the archive directory is mapped to my XP machine...

HALP :D

2 Name: 4n0n4ym0u5 h4xx0r : 2006-12-22 21:52 ID:60aLXkcE

For Windows, I'm done it with ACDSEE

3 Name: 4n0n4ym0u5 h4xx0r : 2006-12-23 03:19 ID:xEEM64rw

write a shell script that compares md5sums

4 Name: 4n0n4ym0u5 h4xx0r : 2006-12-23 13:43 ID:iAKCNbZV

http://mail.python.org/pipermail/python-list/2005-March/311700.html

This might just work, or require minimal changes to work.

5 Name: Redhatter : 2006-12-24 04:38 ID:Heaven

As per >>3's suggestion... the following will sort a list of files by SHA1sum (similar to a MD5sum but a little more accurate).

#!/bin/sh

find dir -type f | xargs sha1sum | sort

All the files with the same sha1sum, can be assumed to be identical, and thus duplicates may be safely deleted. With a little shell magic, you could probably tell this script to do that for you, making symlinks pointing back to the original file, so as to prevent breaking existing hyperlinks.

6 Name: 4n0n4ym0u5 h4xx0r : 2006-12-26 11:04 ID:MdZ1ZPyl

DupDetector for Windows is what i use, haven't really tried any other programs but i think it's doing a decent job.

http://www.photo-freeware.net/dupdetector.php

7 Name: 4n0n4ym0u5 h4xx0r : 2006-12-26 18:59 ID:jhoZ76z3

Here is a bigger problem: quite possibly many files are exactly the same in content, but may cough up different checksums.

8 Name: Redhatter : 2006-12-27 03:34 ID:Heaven

>>7 Errm... isn't the whole point of hashing algorithms, to generate the same hash for the same piece of data?

You do have a risk of two different blocks of data having the same hash ... but this is rare. MD5 produces a 128-bit hash, and SHA1 produces a 160-bit hash. Possible, but highly unlikely. You could also use SHA256, which produces a 256-bit hash -- even less chance of a collision. Combining these algorithms (such as CRC32, MD5 and SHA1), you'll be pretty sure which ones are duplicates.

It is possible, that the same image to hash to different values -- typically this happens when you have the same image in multiple formats. That is, the duplicates are in fact different files.

The problem that >>1 describes, suggests that we're dealing with identical copies of images in the same format, not different formats.

9 Name: 4n0n4ym0u5 h4xx0r : 2006-12-29 11:03 ID:F6lIZv+r

Differences in metadata, or junk data at the end of the file.

10 Name: 4n0n4ym0u5 h4xx0r : 2006-12-29 15:21 ID:+aeagM9B

>>7

if ! { compare -quiet $first_image $second_image /dev/null } echo Duplicate images detected: $first_image $second_image

11 Name: 4n0n4ym0u5 h4xx0r : 2007-01-17 21:18 ID:r5y65dVX

simple #tma way to avoid binary identical dupes:

  • hash all files using sha1, name them after their hash in base32 representation
  • store them in a "hashstore" wich is just a simple structure 2 directories deep (first and 2nd letter file name) eg, file IJ3KLR7SUVJZ7JM342EJATDUM7BJZVGM will be stored in /whatever/hashstore/I/J/IJ3KLR7SUVJZ7JM342EJATDUM7BJZVGM
  • if you need to keep the file elsewhere (image board public directory for example) use symlinks which refer to its position in hashstore

This is the way we did it for over a year and it works very well, if new dupe arrives nothing happens, because your script will just ignore it and won't move it into hashstore (owner of "4chan archive" had over 500GB of data stored like this, now he uses custom filesystem to avoid unnnecessary overhead, but for small storage of only few gigabytes it performs well on any filesystem)

With some scripting you could code all your imageboards so they use the same hashstore, you would avoid storing all binary dupes over all boards like that. I myself have all of this implemented in perl and ruby, other users used python or shell scripts calling and parsing bitcollider output.

now for non binary dupes alias nbid: short .. google "Schwartz finding similar images". Long ... come to #tma @ freenode and feel free to discuss ;)

12 Name: 4n0n4ym0u5 h4xx0r : 2007-01-17 21:27 ID:j5jwiPkn

13 Name: 4n0n4ym0u5 h4xx0r : 2007-01-17 21:47 ID:/EUrSzwP

14 Name: 4n0n4ym0u5 h4xx0r : 2007-01-18 13:16 ID:Heaven

God, are those #tma nutters still around? Talk about your obsessive-compulsive nerds.

15 Name: 4n0n4ym0u5 h4xx0r : 2007-02-05 05:47 ID:UlHxEx3v

>>11 Not very useful if you just want to browse through your porn folder and not see the same picture twice.

This thread has been closed. You cannot post in this thread any longer.