Support for relaxed query terms against indexed fields #25970

mattweber · 2023-02-09T19:21:23Z

Is your feature request related to a problem? Please describe.

I am unable to use relaxed query matching such as wildcards, regular expressions, and fuzzy matching against indexed fields and inside their related query operators such as phrase/near/onear.

Describe the solution you'd like

Add support for relaxed query terms against indexed fields.
Ideally support analysis where possible (wildcard terms can be lowercased, have character normalization, etc).
Add the ability to set the max number of term expansions where hitting that limit will be an error or a signal to stop expanding.
Be smart about duplicated relaxed terms, for example ((a* NEAR b) OR (a* ONEAR c)) will only perform the expensive term dictionary scan for a* once.

Describe alternatives you've considered

Using attributes which are in-memory and don't work with phrase/near/onear.
Doing the expansion outside of the engine or in a query component.

Additional context
Wildcards and proximity operations against large free text fields is a very popular scenario in enterprise search use-cases.

The text was updated successfully, but these errors were encountered:

bratseth · 2023-03-13T10:06:04Z

Refer to https://swtch.com/~rsc/regexp/regexp4.html

Alexander-Mark · 2025-02-06T02:03:34Z

We have a similar situation where fuzzy matching an (array<string>) indexed field in streaming mode would be very convenient.

If we just try fuzzy matching we can see that

"FUZZY(waste management,1,0,false) toc_label:waste management field is not a string attribute"

So then we could try using gram matching, however

n-gram matching is not supported for streaming search

We could try substring/prefix matching, which is a slight improvement but still doesn't handle typos.

So then our only other option currently is a synthetic string attribute field stored outside the document:

field myStringArrayAttribute type array<string> {
    indexing: input myStringArray | attribute
}

But then the string field would be stored in memory, significantly increasing memory resources and defeating the point of using streaming mode. Is that understanding correct?

It would be great if Vespa could support an option to help us in this situation:

n-gram support in streaming mode. I've opened an issue here: n-gram matching support in streaming mode #33051
fuzzy matching support for indexed fields.

jobergum added enhancement good first issue labels Feb 9, 2023

mattweber mentioned this issue Feb 9, 2023

Support physical fieldsets / composite fields #25971

Closed

johans1 assigned bratseth Feb 15, 2023

johans1 added this to the later milestone Feb 15, 2023

johans1 assigned vekterli Feb 15, 2023

johans1 removed the good first issue label Feb 15, 2023

jobergum mentioned this issue Sep 27, 2023

MatchCount didn't calculate number of keywords occurs in text #28628

Closed

Alexander-Mark mentioned this issue Feb 6, 2025

n-gram matching support in streaming mode #33051

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for relaxed query terms against indexed fields #25970

Support for relaxed query terms against indexed fields #25970

mattweber commented Feb 9, 2023

bratseth commented Mar 13, 2023

Alexander-Mark commented Feb 6, 2025

Support for relaxed query terms against indexed fields #25970

Support for relaxed query terms against indexed fields #25970

Comments

mattweber commented Feb 9, 2023

bratseth commented Mar 13, 2023

Alexander-Mark commented Feb 6, 2025