#codepoint #utf8 #automaton #range

utf8-ranges

Convert ranges of Unicode codepoints to UTF-8 byte ranges

7 releases (3 stable)

1.0.2 Oct 29, 2018
1.0.1 Aug 25, 2018
1.0.0 Dec 30, 2016
0.1.3 Oct 17, 2015

#8 in Algorithms

Download history 72167/week @ 2018-11-14 65628/week @ 2018-11-21 72600/week @ 2018-11-28 79829/week @ 2018-12-05 73818/week @ 2018-12-12 58915/week @ 2018-12-19 53275/week @ 2018-12-26 74435/week @ 2019-01-02 78289/week @ 2019-01-09 85476/week @ 2019-01-16 79798/week @ 2019-01-23 79972/week @ 2019-01-30 85719/week @ 2019-02-06 86685/week @ 2019-02-13 89826/week @ 2019-02-20

297,453 downloads per month
Used in 4,367 crates (8 directly)

Unlicense/MIT

22KB
333 lines

utf8-ranges

This crate converts contiguous ranges of Unicode scalar values to UTF-8 byte ranges. This is useful when constructing byte based automata from Unicode. Stated differently, this lets one embed UTF-8 decoding as part of one's automaton.

Linux build status

Dual-licensed under MIT or the UNLICENSE.

Documentation

https://docs.rs/utf8-ranges

Example

This shows how to convert a scalar value range (e.g., the basic multilingual plane) to a sequence of byte based character classes.

extern crate utf8_ranges;

use utf8_ranges::Utf8Sequences;

fn main() {
    for range in Utf8Sequences::new('\u{0}', '\u{FFFF}') {
        println!("{:?}", range);
    }
}

The output:

[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]

These ranges can then be used to build an automaton. Namely:

  1. Every arbitrary sequence of bytes matches exactly one of the sequences of ranges or none of them.
  2. Every match sequence of bytes is guaranteed to be valid UTF-8. (Erroneous encodings of surrogate codepoints in UTF-8 cannot match any of the byte ranges above.)

No runtime deps