Project

General

Profile

Actions

Patch #25215

closed

Re-use existing identical disk files for new attachments

Added by Jens Krämer about 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Category:
Attachments
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Uploaded files can already be associated with multiple attachment records through Attachment#copy. This patch adds an after_create hook to change the disk_filename and disk_directory attributes of a newly created attachment to point to an already existing, identical (filesize and digest) diskfile if one exists.

Database locks are used to guard against the deletion of the older attachment while the reference of the new attachment is changed.

Usefulness of this feature (i.e. how much space is saved) will vary a lot depending on external circumstances of course (one large scale setup we maintain at Planio saved around 15% / 60GB). Since the possiblity to have 1:n relationships of disk files to attachments already exists it just seems logical to make use of it for new attachments as well.


Files


Related issues

Related to Redmine - Feature #19289: Exclude attachments from incoming emails based on file content or file hashNew

Actions
Related to Redmine - Feature #23510: Reuse an exist attachmentClosed

Actions
Has duplicate Redmine - Feature #15257: Attachment deduplicationClosed

Actions
Blocked by Redmine - Patch #25240: Use SHA256 for attachment digest computationClosedJean-Philippe Lang

Actions
Blocked by Redmine - Patch #25590: prevent deadlocks in attachment deduplicationClosedJean-Philippe Lang

Actions
Actions #1

Updated by Jan from Planio www.plan.io about 7 years ago

  • Target version set to Candidate for next minor release
Actions #2

Updated by Go MAEDA about 7 years ago

Actions #3

Updated by Go MAEDA about 7 years ago

  • Related to Feature #19289: Exclude attachments from incoming emails based on file content or file hash added
Actions #4

Updated by Go MAEDA about 7 years ago

Actions #5

Updated by Jean-Philippe Lang about 7 years ago

  • Status changed from New to Needs feedback

IMO, MD5 is too weak for this purpose and this could lead to potential vulnerabilities. The first that comes to my mind: attacker generates a malicious file and a legitimate file with the same MD5, he first uploads the malicious file then send the legitimate one to a user X who will eventually upload it => people downloading the later from user X will actually get the malicious file.
We should implement this after upgrading the digest to a safer hash function.
Please let me know what you think.

Actions #6

Updated by Jens Krämer about 7 years ago

Good point. Since we're also comparing file sizes, an attacker would have to create a collision while also matching the file sizes. Still not impossible I guess. Changing the hash function would make malicious collision creation harder, but still not impossible theoretically.

Maybe adding a real byte by byte comparison for the case of matching filesize / md5 would be the better way? Uploads take a while any way so the added computation time might not weigh in too much.

Actions #7

Updated by Go MAEDA about 7 years ago

What do you think about migrating to SHA-1? It is as fast as MD5 and much safer.

One problem is that migrating a existing Redmine instance may take a long time if it stores many large files.

Actions #8

Updated by Jean-Philippe Lang about 7 years ago

Jens Krämer wrote:

Good point. Since we're also comparing file sizes, an attacker would have to create a collision while also matching the file sizes. Still not impossible I guess.

MD5 collission with the same input size can be created in seconds with a standard computer. The file size comparaison does not make it safer.

Changing the hash function would make malicious collision creation harder, but still not impossible theoretically.

Not just harder, but practically impossible. Unlike MD5, there's no known way to easily generate SHA256 collisions.

Maybe adding a real byte by byte comparison for the case of matching filesize / md5 would be the better way? Uploads take a while any way so the added computation time might not weigh in too much.

That would be the safer option indeed. It also guarantees that the original file is not broken/missing, which is a good thing IMO before discarding the uploaded file. Replacing the MD5 with a safer hash function does make sense anyway.

Actions #9

Updated by Jean-Philippe Lang about 7 years ago

Go MAEDA wrote:

What do you think about migrating to SHA-1? It is as fast as MD5 and much safer.

SHA-1? No. Maybe SHA256 or SHA512 instead.

One problem is that migrating a existing Redmine instance may take a long time if it stores many large files.

Yes, this should not be done during a db migration for this reason. We can use a new hash function for new files and provide a task that updates the existing hashes.

Actions #10

Updated by Jens Krämer about 7 years ago

I gave the upgrade to SHA256 a try in #25240. Independent of that I'll add the bytewise comparison to this feature here next.

Actions #11

Updated by Go MAEDA about 7 years ago

  • Blocked by Patch #25240: Use SHA256 for attachment digest computation added
Actions #12

Updated by Jens Krämer about 7 years ago

Additional patch adding a FileUtils.identical? check to compare actual file contents before removing the duplicate.

Actions #13

Updated by Jean-Philippe Lang about 7 years ago

  • Target version changed from Candidate for next minor release to 3.4.0
Actions #14

Updated by Hugo García about 7 years ago

OK.

Actions #15

Updated by Jean-Philippe Lang almost 7 years ago

  • Status changed from Needs feedback to Closed
  • Assignee set to Jean-Philippe Lang

Patches are committed, thanks. I think it's safer to delete the duplicate file after the change is committed to the DB (r16460).

Actions #16

Updated by Go MAEDA almost 7 years ago

  • Related to Patch #25590: prevent deadlocks in attachment deduplication added
Actions #17

Updated by Go MAEDA almost 7 years ago

  • Related to deleted (Patch #25590: prevent deadlocks in attachment deduplication)
Actions #18

Updated by Go MAEDA almost 7 years ago

  • Blocked by Patch #25590: prevent deadlocks in attachment deduplication added
Actions #19

Updated by Go MAEDA almost 7 years ago

  • Status changed from Closed to Reopened

A fix for this patch was submitted as #25590.

Actions #20

Updated by Jean-Philippe Lang almost 7 years ago

  • Status changed from Reopened to Closed

Fix committed.

Actions #21

Updated by Jean-Philippe Lang almost 7 years ago

  • Subject changed from re-use existing identical disk files for new attachments to Re-use existing identical disk files for new attachments
Actions

Also available in: Atom PDF