matchFuzzy

Syntax

matchFuzzy(textCol, term, minimumSimilarity, prefixLength, [scoreColName])

Arguments

textColThe column to be searched, i.e., the column with text indexing set in the PKEY engine.

term A STRING scalar specifying the term(s) to search for. Only support single word searches.

minimumSimilarity A DOUBLE scalar representing the minimum similarity required for a search result, with the value range of [0,1].

prefixLength A non-negative integer indicating that the prefix length of the search result must be the same as that of term.

scoreColName (optional) A STRING scalar representing the name of the text search score column in the output. The default value is null, in which case the score column is not output. The search score represents the degree of match within the partition, and scores from different partitions are not comparable.

Details

This function is used in the where clause of a SQL statement to perform fuzzy searches of word-based text on the column with text indexing set in the PKEY engine, so that the search results remain highly relevant even if the term contains spelling errors or words in the output text do not match it exactly.

  • When minimumSimilarity is set to 1, words in the output text match the searched term exactly, which is equivalent to the matchAny function.
  • Only support single word searches. Return null values when search for multiple terms.
  • When the prefixLength is larger than the term's length, it is automatically adjusted to the the term’s length.

Examples

// Generate data for queries
stringColumn = ["There are some apples and oranges.","Mike likes apples.","Alice likes oranges.","Mike gives Alice an apple.","Alice gives Mike an orange.","John likes peaches, so he does not give them to anyone.","Mike, can you give me some apples?","Alice, can you give me some oranges?","Alice made apple pie."]
t = table([1,1,1,2,2,2,3,3,3] as id1, [1,2,3,1,2,3,1,2,3] as id2, stringColumn as remark) 
if(existsDatabase("dfs://textDB")) dropDatabase("dfs://textDB")
db = database(directory="dfs://textDB", partitionType=VALUE, partitionScheme=[1,2,3], engine="PKEY")
pt = createPartitionedTable(dbHandle=db, table=t, tableName="pt", partitionColumns="id1",primaryKey=`id1`id2,indexes={"remark":"textindex(parser=english, full=false, lowercase=true, stem=true)"})
pt.tableInsert(t)

// Fuzzy search for make; The prefix must be m
select * from pt where matchFuzzy(textCol=remark,term="make",minimumSimilarity=0.6,prefixLength=1)
id1 id2 remark
1 2 Mike likes apples.
2 1 Mike gives Alice an apple.
2 2 Alice gives Mike an orange.
3 1 Mike, can you give me some apples?
3 3 Alice made apple pie.
// Fuzzy search for make; The prefix must be m; Output the score column name as score
select * from pt where matchFuzzy(textCol=remark,term="make",minimumSimilarity=0.6,prefixLength=2,scoreColName="score")
id1 id2 remark score
3 3 Alice made apple pie. 0.7027325630187988