Patch #25215
closed
Re-use existing identical disk files for new attachments
Added by Jens Krämer almost 8 years ago.
Updated over 7 years ago.
Description
Uploaded files can already be associated with multiple attachment records through Attachment#copy
. This patch adds an after_create
hook to change the disk_filename
and disk_directory
attributes of a newly created attachment to point to an already existing, identical (filesize and digest) diskfile if one exists.
Database locks are used to guard against the deletion of the older attachment while the reference of the new attachment is changed.
Usefulness of this feature (i.e. how much space is saved) will vary a lot depending on external circumstances of course (one large scale setup we maintain at Planio saved around 15% / 60GB). Since the possiblity to have 1:n relationships of disk files to attachments already exists it just seems logical to make use of it for new attachments as well.
Files
- Target version set to Candidate for next minor release
- Related to Feature #19289: Exclude attachments from incoming emails based on file content or file hash added
- Status changed from New to Needs feedback
IMO, MD5 is too weak for this purpose and this could lead to potential vulnerabilities. The first that comes to my mind: attacker generates a malicious file and a legitimate file with the same MD5, he first uploads the malicious file then send the legitimate one to a user X who will eventually upload it => people downloading the later from user X will actually get the malicious file.
We should implement this after upgrading the digest to a safer hash function.
Please let me know what you think.
Good point. Since we're also comparing file sizes, an attacker would have to create a collision while also matching the file sizes. Still not impossible I guess. Changing the hash function would make malicious collision creation harder, but still not impossible theoretically.
Maybe adding a real byte by byte comparison for the case of matching filesize / md5 would be the better way? Uploads take a while any way so the added computation time might not weigh in too much.
What do you think about migrating to SHA-1? It is as fast as MD5 and much safer.
One problem is that migrating a existing Redmine instance may take a long time if it stores many large files.
Jens Krämer wrote:
Good point. Since we're also comparing file sizes, an attacker would have to create a collision while also matching the file sizes. Still not impossible I guess.
MD5 collission with the same input size can be created in seconds with a standard computer. The file size comparaison does not make it safer.
Changing the hash function would make malicious collision creation harder, but still not impossible theoretically.
Not just harder, but practically impossible. Unlike MD5, there's no known way to easily generate SHA256 collisions.
Maybe adding a real byte by byte comparison for the case of matching filesize / md5 would be the better way? Uploads take a while any way so the added computation time might not weigh in too much.
That would be the safer option indeed. It also guarantees that the original file is not broken/missing, which is a good thing IMO before discarding the uploaded file. Replacing the MD5 with a safer hash function does make sense anyway.
Go MAEDA wrote:
What do you think about migrating to SHA-1? It is as fast as MD5 and much safer.
SHA-1? No. Maybe SHA256 or SHA512 instead.
One problem is that migrating a existing Redmine instance may take a long time if it stores many large files.
Yes, this should not be done during a db migration for this reason. We can use a new hash function for new files and provide a task that updates the existing hashes.
I gave the upgrade to SHA256 a try in #25240. Independent of that I'll add the bytewise comparison to this feature here next.
- Blocked by Patch #25240: Use SHA256 for attachment digest computation added
Additional patch adding a FileUtils.identical?
check to compare actual file contents before removing the duplicate.
- Target version changed from Candidate for next minor release to 3.4.0
- Status changed from Needs feedback to Closed
- Assignee set to Jean-Philippe Lang
Patches are committed, thanks. I think it's safer to delete the duplicate file after the change is committed to the DB (r16460).
- Related to Patch #25590: prevent deadlocks in attachment deduplication added
- Related to deleted (Patch #25590: prevent deadlocks in attachment deduplication)
- Blocked by Patch #25590: prevent deadlocks in attachment deduplication added
- Status changed from Closed to Reopened
A fix for this patch was submitted as #25590.
- Status changed from Reopened to Closed
- Subject changed from re-use existing identical disk files for new attachments to Re-use existing identical disk files for new attachments
Also available in: Atom
PDF