Fast character set encoding detection using International Components for Unicode C++ library.
require 'open-uri' require 'uchardet' encoding = ICU::UCharsetDetector.detect open('http://google.jp').read encoding # => {:language=>"ja", :encoding=>"Shift_JIS", :confidence=>100}
From command line:
$ uchardet
Usage: uchardet [options] file
-l, --list Display list of detectable character sets.
-s, --strip Strip HTML or XML markup before detection.
-e, --encoding Hint the charset detector about possible encoding.
-a, --all Show all matching encodings.
-h, --help Show this help message.
$ uchardet `which uchardet`
ISO-8859-1 (confidence 60%)
ICU (International Components for Unicode):
on Mac OS X:
sudo port install icu
on Debian/Ubuntu
sudo apt-get install libicu-dev
sudo gem install uchardet
Copyright © 2009 Dmitri Goutnik, released under the MIT license.