Skip to content

[Feature Request]: BOCU-1 for unicode multibyte chars text compression #2551

@HDDen

Description

@HDDen

Hello! How about adding BOCU-1 compression for Unicode multilingual messages?
Here is the description: https://www.unicode.org/notes/tn6/

Here are a few examples:

Это было моё первое принятое сообщение... 👩👨🧑👧🧒👨‍🦱👸🤶👮‍♀️🕵️‍♀️
input: 143 UTF-8 bytes
output: 90 BOCU-1 bytes
BOCU-1/UTF-8: 0.629371
Заменил антенну на 5-ягу. Если кто через нее ходит, дайте обратную связь
input: 129 UTF-8 bytes
output: 79 BOCU-1 bytes
BOCU-1/UTF-8: 0.612403
Доброго утра всем! 17,5 ° C и солнце 📡 )))
input: 68 UTF-8 bytes
output: 51 BOCU-1 bytes
BOCU-1/UTF-8: 0.750000
Погодка КАЙФ
input: 23 UTF-8 bytes
output: 13 BOCU-1 bytes
BOCU-1/UTF-8: 0.565217
Первый рабочий день после длинных выходных
input: 79 UTF-8 bytes
output: 43 BOCU-1 bytes
BOCU-1/UTF-8: 0.544304

You can download and test the Win32 console implementation here: https://www.unicode.org/notes/tn6/bocu1.exe

Alternatively, there is the «UCF» encoding, which also resolves the issue of bloated file sizes caused by characters outside the a-z range: https://github.com/hyoo-ru/mam_mol/tree/master/charset/ucf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions