tokenize

Syntax

tokenize(text, parser, [full=false], [lowercase=true], [stem=false])

Arguments

text is a STRING scalar specifying the text to be tokenized.

parser is a STRING scalar specifying the tokenizer. It has no default value and must be explicitly set. Options include:

  • none: not tokenized
  • english: tokenizes based on spaces and punctuations

lowercase specifies whether to convert words lowercase (without affecting the original data), which only takes effect when parser is set to english. The default value is true, which applies to case-insensitive scenarios.

stem specifies whether to match English words by their stem, which only takes effect when parser=english and lowercase=true. The default value is false, indicating exact searches.

Note: The full parameter is not applicable for English text.

Details

Tokenize the input text according to the specified configurations.

Return value: A STRING vector containing the tokenization result.

Examples

text1 = "The sun was shining brightly as I walked down the street, enjoying the warmth of the summer day."
tokenize(text=text1, parser='english', lowercase=false, stem=true)
// output:["The","sun","shine","bright","I","walk","down","street","enjoy","warmth","summer","day"]