Defect #41464: CSV file encoding auto-detection may fail with multibyte characters - Redmine

Actions

Copy link

Defect #41464

closed

CSV file encoding auto-detection may fail with multibyte characters

Added by Go MAEDA 9 months ago. Updated 8 months ago.

Status:

Closed

Priority:

Normal

Assignee:

Go MAEDA

Category:

Importers

Target version:

6.0.0

Start date:

Due date:

% Done:

Estimated time:

Resolution:

Fixed

Affected version:

Description

When importing a CSV file, Redmine attempts to auto-detect the file's encoding (see #34718). However, this auto-detection may fail when the file contains multibyte characters.

This is because the Redmine::CodesetUtil.guess_encoding method checks the first 256 bytes of the file without considering character boundaries. As a result, an incomplete multibyte character will be present at the end of the data. For example, the Japanese Hiragana character "あ" in UTF-8 encoding consists of three bytes, "\xE3\x81\x82", but it may appear at the end as "\xE3\x81". When this occurs, the guess_encoding method fails to identify the encoding, as String#valid_encoding? in the method returns false due to the presence of a partial character.

The attached patch fixes this issue by truncating the data at the last line break to discard the last line that may contain such an incomplete multibyte character. This approach ensures that characters causing String#valid_encoding? to return false are excluded.

Files

Download all files

fix-guess_encoding.patch (2.38 KB) fix-guess_encoding.patch		Go MAEDA, 2024-10-10 09:28
0001-Fix-CSV-file-encoding-auto-detection-failure-with-mu.patch (7.34 KB) 0001-Fix-CSV-file-encoding-auto-detection-failure-with-mu.patch		Go MAEDA, 2024-10-19 09:03

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Redmine

Custom queries

Defect #41464

CSV file encoding auto-detection may fail with multibyte characters

Updated by Go MAEDA 8 months ago

Updated by Go MAEDA 8 months ago