tokenize
Syntax
tokenize(text, parser, [full=false], [lowercase=true],
[stem=false])
Arguments
text is a STRING scalar specifying the text to be tokenized.
parser is a STRING scalar specifying the tokenizer. It has no default value and must be explicitly set. Options include:
- none: not tokenized
- english: tokenizes based on spaces and punctuations
lowercase specifies whether to convert words lowercase (without affecting the original data), which only takes effect when parser is set to english. The default value is true, which applies to case-insensitive scenarios.
stem specifies whether to match English words by their stem, which only takes effect when parser=english and lowercase=true. The default value is false, indicating exact searches.
Note: The full parameter is not applicable for English text.
Details
Tokenize the input text according to the specified configurations.
Return value: A STRING vector containing the tokenization result.
Examples
text1 = "The sun was shining brightly as I walked down the street, enjoying the warmth of the summer day."
tokenize(text=text1, parser='english', lowercase=false, stem=true)
// output:["The","sun","shine","bright","I","walk","down","street","enjoy","warmth","summer","day"]