Defect #41464
closedCSV file encoding auto-detection may fail with multibyte characters
0%
Description
When importing a CSV file, Redmine attempts to auto-detect the file's encoding (see #34718). However, this auto-detection may fail when the file contains multibyte characters.
This is because the Redmine::CodesetUtil.guess_encoding
method checks the first 256 bytes of the file without considering character boundaries. As a result, an incomplete multibyte character will be present at the end of the data. For example, the Japanese Hiragana character "あ" in UTF-8 encoding consists of three bytes, "\xE3\x81\x82", but it may appear at the end as "\xE3\x81". When this occurs, the guess_encoding method fails to identify the encoding, as String#valid_encoding?
in the method returns false due to the presence of a partial character.
The attached patch fixes this issue by truncating the data at the last line break to discard the last line that may contain such an incomplete multibyte character. This approach ensures that characters causing String#valid_encoding?
to return false are excluded.
Files