Module `pywander.nlp.tokenize`

Classes

class ChineseSentenceTokenizer

A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.

>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')

:type pattern: str :param pattern: The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:…), instead) :type gaps: bool :param gaps: True if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves. :type discard_empty: bool :param discard_empty: True if any empty tokens '' generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps == True. :type flags: int :param flags: The regexp flags used to compile this tokenizer's pattern. By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL.

Expand source code

class ChineseSentenceTokenizer(RegexpTokenizer):
    def __init__(self):
        RegexpTokenizer.__init__(self, r"(。|？|！)", gaps=True)

    def tokenize(self, text):
        res = super(ChineseSentenceTokenizer, self).tokenize(text)
        return combine_odd_even(res)

Ancestors

Inherited members

RegexpTokenizer:
- span_tokenize
- span_tokenize_sents
- tokenize
- tokenize_sents

class NewlineTokenizer

A processing interface for tokenizing a string. Subclasses must define tokenize() or tokenize_sents() (or both).

Expand source code

class NewlineTokenizer(TokenizerI):
    def tokenize(self, text):
        return text.splitlines()

Ancestors

TokenizerI
abc.ABC

Inherited members

TokenizerI:
- span_tokenize
- span_tokenize_sents
- tokenize
- tokenize_sents

class SimpleTokenizer

A processing interface for tokenizing a string. Subclasses must define tokenize() or tokenize_sents() (or both).

Expand source code

class SimpleTokenizer(TokenizerI):
    def tokenize(self, text):
        return text.split()

Ancestors

TokenizerI
abc.ABC

Inherited members

TokenizerI:
- span_tokenize
- span_tokenize_sents
- tokenize
- tokenize_sents