Skip to content

Reach level 1 conformance with UTS #18 #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 8 tasks
pschiffmann opened this issue Jan 19, 2019 · 0 comments
Open
2 of 8 tasks

Reach level 1 conformance with UTS #18 #12

pschiffmann opened this issue Jan 19, 2019 · 0 comments

Comments

@pschiffmann
Copy link
Owner

pschiffmann commented Jan 19, 2019

http://unicode.org/reports/tr18/#Basic_Unicode_Support

  • RL1.1 Hex Notation
    To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode code point (from U+0000 to U+10FFFF), using the hexadecimal code point representation.
  • RL1.2 Properties
    To meet this requirement, an implementation shall provide at least a minimal list of properties, consisting of the following: General_Category, Script and Script_Extensions, Alphabetic, Uppercase, Lowercase, White_Space, Noncharacter_Code_Point, Default_Ignorable_Code_Point, ANY, ASCII, ASSIGNED
  • RL1.2a Compatibility Properties
    To meet this requirement, an implementation shall provide the properties listed in Annex C: Compatibility Properties, with the property values as listed there. Such an implementation shall document whether it is using the Standard Recommendation or POSIX-compatible properties.
  • RL1.3 Subtraction and Intersection
    To meet this requirement, an implementation shall supply mechanisms for union, intersection and set-difference of sets of characters within regular expression character class expressions.
  • RL1.4 Simple Word Boundaries
    To meet this requirement, an implementation shall extend the word boundary mechanism so that:
    The class of <word_character> includes all the Alphabetic values from the Unicode character database, from UnicodeData.txt, plus the decimals (General_Category=Decimal_Number, or equivalently Numeric_Type=Decimal), and the U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER (Join_Control=True). See also Annex C: Compatibility Properties.
    Nonspacing marks are never divided from their base characters, and otherwise ignored in locating boundaries.
  • RL1.5 Simple Loose Matches
    To meet this requirement, if an implementation provides for case-insensitive matching, then it shall provide at least the simple, default Unicode case-insensitive matching, and specify which properties are closed and which are not.
    To meet this requirement, if an implementation provides for case conversions, then it shall provide at least the simple, default Unicode case folding.
  • RL1.6 Line Boundaries
    To meet this requirement, if an implementation provides for line-boundary testing, it shall recognize not only CRLF, LF, CR, but also NEL (U+0085), PARAGRAPH SEPARATOR (U+2029) and LINE SEPARATOR (U+2028).
  • RL1.7 Supplementary Code Points
    To meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching.
@pschiffmann pschiffmann pinned this issue Feb 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant