Dupicate image detector (15)

11 Name: 4n0n4ym0u5 h4xx0r : 2007-01-17 21:18 ID:r5y65dVX

simple #tma way to avoid binary identical dupes:

  • hash all files using sha1, name them after their hash in base32 representation
  • store them in a "hashstore" wich is just a simple structure 2 directories deep (first and 2nd letter file name) eg, file IJ3KLR7SUVJZ7JM342EJATDUM7BJZVGM will be stored in /whatever/hashstore/I/J/IJ3KLR7SUVJZ7JM342EJATDUM7BJZVGM
  • if you need to keep the file elsewhere (image board public directory for example) use symlinks which refer to its position in hashstore

This is the way we did it for over a year and it works very well, if new dupe arrives nothing happens, because your script will just ignore it and won't move it into hashstore (owner of "4chan archive" had over 500GB of data stored like this, now he uses custom filesystem to avoid unnnecessary overhead, but for small storage of only few gigabytes it performs well on any filesystem)

With some scripting you could code all your imageboards so they use the same hashstore, you would avoid storing all binary dupes over all boards like that. I myself have all of this implemented in perl and ruby, other users used python or shell scripts calling and parsing bitcollider output.

now for non binary dupes alias nbid: short .. google "Schwartz finding similar images". Long ... come to #tma @ freenode and feel free to discuss ;)

This thread has been closed. You cannot post in this thread any longer.