textUtils package

Classes and utilities to deal with offsets variable width encodings, particularly utf_16.

class textUtils.OffsetConverter(text: str)

Bases: object

decoded: str
abstract property encodedStringLength: int

Returns the length of the string in itssubclass-specific encoded representation.

property strLength: int

Returns the length of the string in its pythonic string representation.

abstract strToEncodedOffsets(strStart: int, strEnd: int | None = None, raiseOnError: bool = False) int | Tuple[int, int]

This method takes two offsets from the str representation of the string the object is initialized with, and converts them to subclass-specific encoded string offsets. @param strStart: The start offset in the str representation of the string. @param strEnd: The end offset in the str representation of the string.

This offset is exclusive.

@param raiseOnError: Raises an IndexError when one of the given offsets

exceeds L{strLength} or is lower than zero. If C{False}, the out of range offset will be bounded to the range of the string.

@raise ValueError: if strEnd < strStart

abstract encodedToStrOffsets(encodedStart: int, encodedEnd: int | None = None, raiseOnError: bool = False) int | Tuple[int, int]

This method takes two offsets from subclass-specific encoded string representation of the string the object is initialized with, and converts them to str offsets. @param encodedStart: The start offset in the wide character representation of the string. @param encodedEnd: The end offset in the wide character representation of the string.

This offset is exclusive.

@param raiseOnError: Raises an IndexError when one of the given offsets

exceeds L{encodedStringLength} or is lower than zero. If C{False}, the out of range offset will be bounded to the range of the string.

@raise ValueError: if wideStringEnd < wideStringStart

_abc_impl = <_abc._abc_data object>
class textUtils.WideStringOffsetConverter(text: str)

Bases: OffsetConverter

Object that holds a string in both its decoded and its UTF-16 encoded form. The object allows for easy conversion between offsets in str type strings, and offsets in wide character (UTF-16) strings (that are aware of surrogate characters). This representation is used by all wide character strings in Windows (i.e. with characters of type L{ctypes.c_wchar}).

In Python 3 strings, every offset in a string corresponds with one unicode codepoint. In UTF-16 encoded strings, 32-bit unicode characters (such as emoji) are encoded as one high surrogate and one low surrogate character. Therefore, they take not one, but two offsets in such a string. This behavior is equivalent to how Python 2 unicode strings behave, which are internally encoded as UTF-16.

For example: 😂 takes one offset in a Python 3 string. However, in a Python 2 string or UTF-16 encoded wide string, this character internally consists of two characters: ud83d and ude02.

_encoding: str = 'utf_16_le'
_bytesPerIndex: int = 2
property encodedStringLength: int

Returns the length of the string in its wide character (UTF-16) representation.

strToEncodedOffsets(strStart: int, strEnd: int | None = None, raiseOnError: bool = False) int | Tuple[int, int]

This method takes two offsets from the str representation of the string the object is initialized with, and converts them to wide character string offsets. @param strStart: The start offset in the str representation of the string. @param strEnd: The end offset in the str representation of the string.

This offset is exclusive.

@param raiseOnError: Raises an IndexError when one of the given offsets

exceeds L{strLength} or is lower than zero. If C{False}, the out of range offset will be bounded to the range of the string.

@raise ValueError: if strEnd < strStart

encodedToStrOffsets(encodedStart: int, encodedEnd: int, raiseOnError: bool = False) Tuple[int, int]

This method takes two offsets from the wide character representation of the string the object is initialized with, and converts them to str offsets. encodedEnd is considered an exclusive offset. If either encodedStart or encodedEnd corresponds with an offset in the middel of a surrogate pair, it is yet counted as one offset in the string. For example, when L{decoded} is “😂”, which is one offset in the str representation, this method returns (0, 1) in all of the following cases:

  • encodedStart=0, encodedEnd=1

  • encodedStart=0, encodedEnd=2

  • encodedStart=1, encodedEnd=2

However, encodedStart=1, encodedEnd=1 results in (0, 0) @param encodedStart: The start offset in the wide character representation of the string. @param encodedEnd: The end offset in the wide character representation of the string.

This offset is exclusive.

@param raiseOnError: Raises an IndexError when one of the given offsets

exceeds L{encodedStringLength} or is lower than zero. If C{False}, the out of range offset will be bounded to the range of the string.

@raise ValueError: if encodedEnd < encodedStart

property wideStringLength: int

Returns the length of the string in its wide character (UTF-16) representation.

strToWideOffsets(strStart: int, strEnd: int | None = None, raiseOnError: bool = False) int | Tuple[int, int]

This method takes two offsets from the str representation of the string the object is initialized with, and converts them to wide character string offsets. @param strStart: The start offset in the str representation of the string. @param strEnd: The end offset in the str representation of the string.

This offset is exclusive.

@param raiseOnError: Raises an IndexError when one of the given offsets

exceeds L{strLength} or is lower than zero. If C{False}, the out of range offset will be bounded to the range of the string.

@raise ValueError: if strEnd < strStart

wideToStrOffsets(encodedStart: int, encodedEnd: int, raiseOnError: bool = False) Tuple[int, int]

This method takes two offsets from the wide character representation of the string the object is initialized with, and converts them to str offsets. encodedEnd is considered an exclusive offset. If either encodedStart or encodedEnd corresponds with an offset in the middel of a surrogate pair, it is yet counted as one offset in the string. For example, when L{decoded} is “😂”, which is one offset in the str representation, this method returns (0, 1) in all of the following cases:

  • encodedStart=0, encodedEnd=1

  • encodedStart=0, encodedEnd=2

  • encodedStart=1, encodedEnd=2

However, encodedStart=1, encodedEnd=1 results in (0, 0) @param encodedStart: The start offset in the wide character representation of the string. @param encodedEnd: The end offset in the wide character representation of the string.

This offset is exclusive.

@param raiseOnError: Raises an IndexError when one of the given offsets

exceeds L{encodedStringLength} or is lower than zero. If C{False}, the out of range offset will be bounded to the range of the string.

@raise ValueError: if encodedEnd < encodedStart

_abc_impl = <_abc._abc_data object>
textUtils.getTextFromRawBytes(buf: bytes, numChars: int, encoding: str | None = None, errorsFallback: str = 'replace')

Gets a string from a raw bytes object, decoded using the specified L{encoding}. In most cases, the bytes object is fetched by passing the raw attribute of a ctypes.c_char-Array to this function. If L{encoding} is C{None}, the bytes object is inspected on whether it contains single byte or multi byte characters. As a first attempt, the bytes are encoded using the surrogatepass error handler. This handler behaves like strict for all encodings without surrogates, while making sure that surrogates are properly decoded when using UTF-16. If that fails, the exception is logged and the bytes are decoded according to the L{errorsFallback} error handler.

textUtils.isHighSurrogate(ch: str) bool

Returns if the given character is a high surrogate UTF-16 character.

textUtils.isLowSurrogate(ch: str) bool

Returns if the given character is a low surrogate UTF-16 character.

class textUtils.UTF8OffsetConverter(text: str)

Bases: OffsetConverter

Object that holds a string in both its decoded and its UTF-8 encoded form. The object allows for easy conversion between offsets in str type strings, and offsets in UTF-8 encoded strings.

A single character in UTF-8 encoding might take 1, 2, or 4 bytes. Examples of applications using UTF-8 encoding are all Scintilla-based text editors, including Notepad++.

_encoding: str = 'utf-8'
property encodedStringLength: int

Returns the length of the string in its UTF-8 representation.

strToEncodedOffsets(strStart: int, strEnd: int | None = None, raiseOnError: bool = False) int | Tuple[int, int]

This method takes two offsets from the str representation of the string the object is initialized with, and converts them to subclass-specific encoded string offsets. @param strStart: The start offset in the str representation of the string. @param strEnd: The end offset in the str representation of the string.

This offset is exclusive.

@param raiseOnError: Raises an IndexError when one of the given offsets

exceeds L{strLength} or is lower than zero. If C{False}, the out of range offset will be bounded to the range of the string.

@raise ValueError: if strEnd < strStart

encodedToStrOffsets(encodedStart: int, encodedEnd: int | None = None, raiseOnError: bool = False) int | Tuple[int, int]

This method takes two offsets from UTF-8 representation of the string the object is initialized with, and converts them to str offsets. This implementation ignores raiseOnError argument and it will allways raise UnicodeDecodeError if indices are invalid.

_abc_impl = <_abc._abc_data object>
class textUtils.IdentityOffsetConverter(text: str)

Bases: OffsetConverter

This is a dummy converter that assumes 1:1 correspondence between encoded and decoded characters.

property encodedStringLength: int

Returns the length of the string in itssubclass-specific encoded representation.

strToEncodedOffsets(strStart: int, strEnd: int | None = None, raiseOnError: bool = False) int | Tuple[int, int]

This method takes two offsets from the str representation of the string the object is initialized with, and converts them to subclass-specific encoded string offsets. @param strStart: The start offset in the str representation of the string. @param strEnd: The end offset in the str representation of the string.

This offset is exclusive.

@param raiseOnError: Raises an IndexError when one of the given offsets

exceeds L{strLength} or is lower than zero. If C{False}, the out of range offset will be bounded to the range of the string.

@raise ValueError: if strEnd < strStart

encodedToStrOffsets(encodedStart: int, encodedEnd: int | None = None, raiseOnError: bool = False) int | Tuple[int, int]

This method takes two offsets from subclass-specific encoded string representation of the string the object is initialized with, and converts them to str offsets. @param encodedStart: The start offset in the wide character representation of the string. @param encodedEnd: The end offset in the wide character representation of the string.

This offset is exclusive.

@param raiseOnError: Raises an IndexError when one of the given offsets

exceeds L{encodedStringLength} or is lower than zero. If C{False}, the out of range offset will be bounded to the range of the string.

@raise ValueError: if wideStringEnd < wideStringStart

_abc_impl = <_abc._abc_data object>
decoded: str
class textUtils.UnicodeNormalizationOffsetConverter(text: str, normalizationForm: str = 'NFKC')

Bases: OffsetConverter

Object that holds a string in both its decoded and its unicode normalized form. The object allows for easy conversion between offsets in strings which may or may not be normalized,

For example, when using the NFKC algorithm, the “ij” ligature normalizes to “ij”, which takes two characters instead of one.

computedStrToEncodedOffsets: list[int]
computedEncodedToStrOffsets: list[int]
normalizationForm: str
_processReordered(a: str, b: str) Generator[int, None, None]

“Yields the offset in b of every character in a

property encodedStringLength: int

Returns the length of the string in its normalized representation.

strToEncodedOffsets(strStart: int, strEnd: int | None = None, raiseOnError: bool = False) int | Tuple[int]

This method takes two offsets from the str representation of the string the object is initialized with, and converts them to subclass-specific encoded string offsets. @param strStart: The start offset in the str representation of the string. @param strEnd: The end offset in the str representation of the string.

This offset is exclusive.

@param raiseOnError: Raises an IndexError when one of the given offsets

exceeds L{strLength} or is lower than zero. If C{False}, the out of range offset will be bounded to the range of the string.

@raise ValueError: if strEnd < strStart

encodedToStrOffsets(encodedStart: int, encodedEnd: int | None = None, raiseOnError: bool = False) int | Tuple[int]

This method takes two offsets from subclass-specific encoded string representation of the string the object is initialized with, and converts them to str offsets. @param encodedStart: The start offset in the wide character representation of the string. @param encodedEnd: The end offset in the wide character representation of the string.

This offset is exclusive.

@param raiseOnError: Raises an IndexError when one of the given offsets

exceeds L{encodedStringLength} or is lower than zero. If C{False}, the out of range offset will be bounded to the range of the string.

@raise ValueError: if wideStringEnd < wideStringStart

_abc_impl = <_abc._abc_data object>
textUtils.isUnicodeNormalized(text: str, normalizationForm: str = 'NFKC') bool

Convenience function to wrap unicodedata.is_normalized with a default normalization form.

textUtils.unicodeNormalize(text: str, normalizationForm: str = 'NFKC') str

Convenience function to wrap unicodedata.normalize with a default normalization form.

textUtils.getOffsetConverter(encoding: str) Type[OffsetConverter]

Submodules

textUtils.uniscribe module

Wrapper functions for NVDAHelper uniscribe functions.

textUtils.uniscribe.splitAtCharacterBoundaries(text: str) Generator[str, None, None]

Splits a given string into real visible characters (or glyphs), thereby respecting character boundaries. Contrary to just iterating over a string, this respects surrogate pairs, decomposite characters, etc.