Module pywander.nlp.tokenize
Classes
class ChineseSentenceTokenizer
-
A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
:type pattern: str :param pattern: The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:…), instead) :type gaps: bool :param gaps: True if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves. :type discard_empty: bool :param discard_empty: True if any empty tokens
''
generated by the tokenizer should be discarded. Empty tokens can only be generated if_gaps == True
. :type flags: int :param flags: The regexp flags used to compile this tokenizer's pattern. By default, the following flags are used:re.UNICODE | re.MULTILINE | re.DOTALL
.Expand source code
class ChineseSentenceTokenizer(RegexpTokenizer): def __init__(self): RegexpTokenizer.__init__(self, r"(。|?|!)", gaps=True) def tokenize(self, text): res = super(ChineseSentenceTokenizer, self).tokenize(text) return combine_odd_even(res)
Ancestors
- RegexpTokenizer
- TokenizerI
- abc.ABC
Inherited members
class NewlineTokenizer
-
A processing interface for tokenizing a string. Subclasses must define
tokenize()
ortokenize_sents()
(or both).Expand source code
class NewlineTokenizer(TokenizerI): def tokenize(self, text): return text.splitlines()
Ancestors
- TokenizerI
- abc.ABC
Inherited members
class SimpleTokenizer
-
A processing interface for tokenizing a string. Subclasses must define
tokenize()
ortokenize_sents()
(or both).Expand source code
class SimpleTokenizer(TokenizerI): def tokenize(self, text): return text.split()
Ancestors
- TokenizerI
- abc.ABC
Inherited members