Project

General

Profile

Actions

Defect #41464

closed

CSV file encoding auto-detection may fail with multibyte characters

Added by Go MAEDA about 2 months ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Importers
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Resolution:
Fixed
Affected version:

Description

When importing a CSV file, Redmine attempts to auto-detect the file's encoding (see #34718). However, this auto-detection may fail when the file contains multibyte characters.

This is because the Redmine::CodesetUtil.guess_encoding method checks the first 256 bytes of the file without considering character boundaries. As a result, an incomplete multibyte character will be present at the end of the data. For example, the Japanese Hiragana character "あ" in UTF-8 encoding consists of three bytes, "\xE3\x81\x82", but it may appear at the end as "\xE3\x81". When this occurs, the guess_encoding method fails to identify the encoding, as String#valid_encoding? in the method returns false due to the presence of a partial character.

The attached patch fixes this issue by truncating the data at the last line break to discard the last line that may contain such an incomplete multibyte character. This approach ensures that characters causing String#valid_encoding? to return false are excluded.


Files

Actions #1

Updated by Go MAEDA about 1 month ago

I have updated the patch.

This version moved the logic to discard the last line from lib/redmine/codeset_util.rb to app/models/import.rb because I think data cleanup should be done within the method that reads the data from a file.

Actions #2

Updated by Go MAEDA about 1 month ago

  • Status changed from New to Closed
  • Assignee set to Go MAEDA
  • Target version changed from Candidate for next major release to 6.0.0
  • Resolution set to Fixed

Committed the fix in r23150.

Actions

Also available in: Atom PDF