FTS indexing and underscores

michael.renaud · July 21, 2022, 12:39pm

Hi

I have a case in which the full-text search in the screen produces strange weird results :

1st search “13_”, the object “13_TRANS[…]” is found

2nd search with “trans”, not found

And it’s consistent across all screens.

User expects finding the 1st object to appear also in the 2nd search. I suspect this is related to the underscore in the name.

I’ve read some material about issues with Lucene indexing and underscores, is there some specific configuration I could act on ?

Regards
Michael

gorbunkov · July 26, 2022, 11:42am

Hi,

In FTS, the underscore character is not considered as a word separator, so it doesn’t split the word into separate tokens. That’s why your search doesn’t work. You may try adding the * in the beginning of the search query (see docs) - in this case the search will be performed in any part of the word not from beginning only.

Alternatively, you may try to reconfigure the analyzer in the way that it considers the underscore character as a word separator. You’ll probably need to override the com.haulmont.fts.core.sys.IndexWriterProviderBean. This discussion may be helpful.

You will need to change the com.haulmont.fts.core.sys.EntityAttributeTokenizer class, because it says that underscore is not a token separator:

public class EntityAttributeTokenizer extends CharTokenizer {

    public EntityAttributeTokenizer() {
        super();
    }

    protected boolean isTokenChar(int c) {
        return FTS.isTokenChar(c);
    }
}

    public static boolean isTokenChar(int c) {
        return Character.isLetterOrDigit(c) || c == '_' || c == '-' || c == '/' || c == '\\' || c == '$' || c == '^';
    }

michael.renaud · August 5, 2022, 12:39pm

HI @gorbunkov
That’s clear, thanks for the hint, I will try that.
Michael