API reference

class vibrato.Vibrato(dict_data, /, ignore_space=False, max_grouping_len=0)

Python binding of Vibrato tokenizer.

Examples:

>>> import vibrato
>>> with open('path/to/system.dic', 'rb') as fp:
...     dict_data = fp.read()
>>> tokenizer = vibrato.Vibrato(dict_data)
>>> tokens = tokenizer.tokenize('社長は火星猫だ')
>>> len(tokens)
5
>>> list(tokens)
[Token { surface: "社長", feature: "名詞,一般,*,*,*,*,社長,シャチョウ,シャチョー,," },
 Token { surface: "は", feature: "助詞,係助詞,*,*,*,*,は,ハ,ワ,," },
 Token { surface: "火星", feature: "名詞,一般,*,*,*,*,火星,カセイ,カセイ,," },
 Token { surface: "猫", feature: "名詞,一般,*,*,*,*,猫,ネコ,ネコ,," },
 Token { surface: "だ", feature: "助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ,," }]
>>> tokens[0].surface()
'社長'
>>> tokens[0].feature()
'名詞,一般,*,*,*,*,社長,シャチョウ,シャチョー,,'
>>> tokens[0].start()
0
>>> tokens[0].end()
2

Parameters:

dict_data (bytes) – A byte sequence of the dictionary.
ignore_space (bool) – Ignores spaces from tokens. This option is for compatibility with MeCab. Enable this if you want to obtain the same results as MeCab.
max_grouping_len (int) – Specifies the maximum grouping length for unknown words. By default, the length is infinity. This option is for compatibility with MeCab. Specifies the argument with 24 if you want to obtain the same results as MeCab.

Return type:

vibrato.Vibrato

static from_textdict(lex_data, matrix_data, char_data, unk_data, /, ignore_space=False, max_grouping_len=0)

Create a tokenizer from the text dictionary.

Parameters:

lex_data (str) – The content of lex.csv.
matrix_data (str) – The content of matrix.def.
char_data (str) – The content of char.def.
unk_data (str) – The content of unk.def.
ignore_space (bool) – Ignores spaces from tokens. This option is for compatibility with MeCab. Enable this if you want to obtain the same results as MeCab.
max_grouping_len (int) – Specifies the maximum grouping length for unknown words. By default, the length is infinity. This option is for compatibility with MeCab. Specifies the argument with 24 if you want to obtain the same results as MeCab.

Return type:

vibrato.Vibrato

tokenize(text, /)

Tokenize a given text and return as a list of tokens.

Parameters:: text (str) – A text to tokenize.
Return type:: vibrato.TokenList

tokenize_to_surfaces(text, /)

Tokenize a given text and return as a list of surfaces.

Parameters:: text (str) – A text to tokenize.
Return type:: list[str]

class vibrato.TokenList

List of Token returned by the tokenizer.

__getitem__(key, /): Return self[key].

__iter__(): Implement iter(self).

__len__(): Return len(self).

class vibrato.TokenIterator

Iterator that returns Token.

__next__(): Implement next(self).

class vibrato.Token

Representation of a token.

end()

Return the end position (exclusive) in characters.

Return type:: int

feature()

Return the feature of this token.

Return type:: str

start()

Return the start position (inclusive) in characters.

Return type:: int

surface()

Return the surface of this token.

Return type:: str

VIBRATO_VERSION: str: Indicates the version number of vibrato used by this wrapper. It can be used to check the compatibility of the model file.