Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for relaxed query terms against indexed fields #25970

Open
mattweber opened this issue Feb 9, 2023 · 2 comments
Open

Support for relaxed query terms against indexed fields #25970

mattweber opened this issue Feb 9, 2023 · 2 comments
Assignees
Milestone

Comments

@mattweber
Copy link

Is your feature request related to a problem? Please describe.

I am unable to use relaxed query matching such as wildcards, regular expressions, and fuzzy matching against indexed fields and inside their related query operators such as phrase/near/onear.

Describe the solution you'd like

  • Add support for relaxed query terms against indexed fields.
  • Ideally support analysis where possible (wildcard terms can be lowercased, have character normalization, etc).
  • Add the ability to set the max number of term expansions where hitting that limit will be an error or a signal to stop expanding.
  • Be smart about duplicated relaxed terms, for example ((a* NEAR b) OR (a* ONEAR c)) will only perform the expensive term dictionary scan for a* once.

Describe alternatives you've considered

  • Using attributes which are in-memory and don't work with phrase/near/onear.
  • Doing the expansion outside of the engine or in a query component.

Additional context
Wildcards and proximity operations against large free text fields is a very popular scenario in enterprise search use-cases.

@bratseth
Copy link
Member

@Alexander-Mark
Copy link

We have a similar situation where fuzzy matching an (array<string>) indexed field in streaming mode would be very convenient.

If we just try fuzzy matching we can see that

"FUZZY(waste management,1,0,false) toc_label:waste management field is not a string attribute"

So then we could try using gram matching, however

n-gram matching is not supported for streaming search

We could try substring/prefix matching, which is a slight improvement but still doesn't handle typos.

So then our only other option currently is a synthetic string attribute field stored outside the document:

field myStringArrayAttribute type array<string> {
    indexing: input myStringArray | attribute
}

But then the string field would be stored in memory, significantly increasing memory resources and defeating the point of using streaming mode. Is that understanding correct?

It would be great if Vespa could support an option to help us in this situation:

  1. n-gram support in streaming mode. I've opened an issue here: n-gram matching support in streaming mode #33051
  2. fuzzy matching support for indexed fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants