-
Reproducer: Non-ASCII chars get double-UTF-8-encoded when passing a URL to read_html(). Passing the same HTML as a string works correctly.
-
Regression: Works in xml2 1.3.6, broken in 1.5.2. Likely introduced in 1.3.7 (switch to Rtools libxml2) or 1.3.8 (libxml2 update to 2.11.5).
-
Environment: R 4.5.2, Windows, l10n_info()$UTF-8 == TRUE, codepage 65001.
# Minimal reproducible example: xml2::read_html(url) double-encodes UTF-8 on Windows
# Environment: R 4.5.2, Windows, l10n_info()$`UTF-8` == TRUE, codepage 65001
# xml2 version: 1.5.2 (works correctly in 1.3.6)
library(xml2)
url <- "https://translate.google.com/m?tl=de&sl=en&q=apples"
# --- Method 1: read_html(url) - BROKEN ---
page1 <- read_html(url)
node1 <- xml_find_first(page1, "//div[@class='result-container']")
result1 <- xml_text(node1)
Encoding(result1)
#> [1] "UTF-8"
charToRaw(result1)
#> [1] c3 83 c2 84 70 66 65 6c
#> Expected: c3 84 70 66 65 6c ("Äpfel")
#> Actual: c3 83 c2 84 70 66 65 6c (double-encoded UTF-8)
result1
#> [1] "Ã\u0084pfel"
# --- Method 2: read_html(string) - WORKS ---
resp <- curl::curl_fetch_memory(url)
html <- rawToChar(resp$content)
Encoding(html) <- "UTF-8"
page2 <- read_html(html)
node2 <- xml_find_first(page2, "//div[@class='result-container']")
result2 <- xml_text(node2)
Encoding(result2)
#> [1] "UTF-8"
charToRaw(result2)
#> [1] c3 84 70 66 65 6c
result2
#> [1] "Äpfel"
Reproducer: Non-ASCII chars get double-UTF-8-encoded when passing a URL to read_html(). Passing the same HTML as a string works correctly.
Regression: Works in xml2 1.3.6, broken in 1.5.2. Likely introduced in 1.3.7 (switch to Rtools libxml2) or 1.3.8 (libxml2 update to 2.11.5).
Environment: R 4.5.2, Windows, l10n_info()$UTF-8 == TRUE, codepage 65001.