I just announced the new Spring Boot 2 material, coming in REST With Spring:

>> CHECK OUT THE COURSE

1. Introduction

In a previous article, we demonstrated how to configure and use Spring Data Elasticsearch for a project. In this article we will examine several query types offered by Elasticsearch and we’ll also talk about field analyzers and their impact on search results.

2. Analyzers

All stored string fields are, by default, processed by an analyzer. An analyzer consists of one tokenizer and several token filters, and is usually preceded by one or more character filters.

The default analyzer splits the string by common word separators (such as spaces or punctuation) and puts every token in lowercase. It also ignores common English words.

Elasticsearch can also be configured to regard a field as analyzed and not-analyzed at the same time.

For example, in an Article class, suppose we store the title field as a standard analyzed field. The same field with the suffix verbatim will be stored as a not-analyzed field:

@MultiField(
  mainField = @Field(type = Text, fielddata = true),
  otherFields = {
      @InnerField(suffix = "verbatim", type = Keyword)
  }
)
private String title;

Here, we apply the @MultiField annotation to tell Spring Data that we would like this field to be indexed in several ways. The main field will use the name title and will be analyzed according to the rules described above.

But we also provide a second annotation, @InnerField, which describes an additional indexing of the title field. We use FieldType.keyword to indicate that we do not want to use an analyzer when performing the additional indexing of the field, and that this value should be stored using a nested field with the suffix verbatim.

2.1. Analyzed Fields

Let’s look at an example. Suppose an article with the title “Spring Data Elasticsearch” is added to our index. The default analyzer will break up the string at the space characters and produce lowercase tokens: “spring“, “data“, and “elasticsearch“.

Now we may use any combination of these terms to match a document:

SearchQuery searchQuery = new NativeSearchQueryBuilder()
  .withQuery(matchQuery("title", "elasticsearch data"))
  .build();

2.2. Non-analyzed Fields

A non-analyzed field is not tokenized, so can only be matched as a whole when using match or term queries:

SearchQuery searchQuery = new NativeSearchQueryBuilder()
  .withQuery(matchQuery("title.verbatim", "Second Article About Elasticsearch"))
  .build();

Using a match query, we may only search by the full title, which is also case-sensitive.

3. Match Query

A match query accepts text, numbers and dates.

There are three type of “match” query:

  • boolean
  • phrase and
  • phrase_prefix

In this section we will explore the boolean match query.

3.1. Matching with Boolean Operators

boolean is the default type of a match query; you can specify which boolean operator to use (or is default):

SearchQuery searchQuery = new NativeSearchQueryBuilder()
  .withQuery(matchQuery("title","Search engines").operator(AND))
  .build();
List<Article> articles = getElasticsearchTemplate()
  .queryForList(searchQuery, Article.class);

This query would return an article with the title “Search engines” by specifying two terms from the title with and operator. But what will happen if we search with the default (or) operator when only one of the terms matches?

SearchQuery searchQuery = new NativeSearchQueryBuilder()
  .withQuery(matchQuery("title", "Engines Solutions"))
  .build();
List<Article> articles = getElasticsearchTemplate()
  .queryForList(searchQuery, Article.class);
assertEquals(1, articles.size());
assertEquals("Search engines", articles.get(0).getTitle());

The “Search engines” article is still matched, but it will have a lower score because not all of the terms matched.

The sum of the scores of each matching term add up to the total score of each resulting document.

There may be situations in which a document containing a rare term entered in the query will have higher rank then a document which contains several common terms.

3.2. Fuzziness

When the user makes a typo in a word, it is still possible to match it with a search by specifying a fuzziness parameter, which allows inexact matching.

For string fields fuzziness means the edit distance: the number of one-character changes that need to be made to one string to make it the same as another string.

SearchQuery searchQuery = new NativeSearchQueryBuilder()
  .withQuery(matchQuery("title", "spring date elasticsearch")
  .operator(AND)
  .fuzziness(Fuzziness.ONE)
  .prefixLength(3))
  .build();

The prefix_length parameter is used to improve performance. In this case, we require that the first three characters should match exactly, which reduces the number of possible combinations.

5. Phrase Search

Phase search is stricter, although you can control it with the slop parameter. This parameter tells the phrase query how far apart terms are allowed to be while still considering the document a match.

In other words, it represents the number of times you need to move a term in order to make the query and document match:

SearchQuery searchQuery = new NativeSearchQueryBuilder()
  .withQuery(matchPhraseQuery("title", "spring elasticsearch").slop(1))
  .build();

Here the query will match the document with the title “Spring Data Elasticsearch” because we set the slop to one.

6. Multi Match Query

When you want to search in multiple fields then you could use QueryBuilders#multiMatchQuery() where you specify all the fields to match:

SearchQuery searchQuery = new NativeSearchQueryBuilder()
  .withQuery(multiMatchQuery("tutorial")
    .field("title")
    .field("tags")
    .type(MultiMatchQueryBuilder.Type.BEST_FIELDS))
  .build();

Here we search the title and tags fields for a match.

Notice that here we use the “best fields” scoring strategy. It will take the maximum score among the fields as a document score.

7. Aggregations

In our Article class we have also defined a tags field, which is non-analyzed. We could easily create a tag cloud by using an aggregation.

Remember that, because the field is non-analyzed, the tags will not be tokenized:

TermsAggregationBuilder aggregation = AggregationBuilders.terms("top_tags")
  .field("tags")
  .order(Terms.Order.count(false));
SearchResponse response = client.prepareSearch("blog")
  .setTypes("article")
  .addAggregation(aggregation)
  .execute().actionGet();

Map<String, Aggregation> results = response.getAggregations().asMap();
StringTerms topTags = (StringTerms) results.get("top_tags");

List<String> keys = topTags.getBuckets()
  .stream()
  .map(b -> b.getKeyAsString())
  .collect(toList());
assertEquals(asList("elasticsearch", "spring data", "search engines", "tutorial"), keys);

8. Summary

In this article we discussed the difference between analyzed and non-analyzed fields, and how this distinction affects search.

We also learned about several types of queries provided by Elasticsearch, such as the match query, phrase match query, full-text search query, and boolean query.

Elasticsearch provides many other types of queries, such as geo queries, script queries and compound queries. You can read about them in the Elasticsearch documentation and explore the Spring Data Elasticsearch API in order to use these queries in your code.

You can find a project containing the examples used in this article in the GitHub repository.

I just announced the new Spring Boot 2 material, coming in REST With Spring:

>> CHECK OUT THE LESSONS

newest oldest most voted
Notify of
Vijay Mohan
Guest
Vijay Mohan

Newer version of spring data removed @NestedField and added @InnerField annotation, Hope this will helpful!

Eugen Paraschiv
Guest

Definitely useful Vijay – thanks. I’m adding it to the content calendar to update it soon. Cheers,
Eugen.

Eugen Paraschiv
Guest

Quick followup on this – the update is done. Cheers,
Eugen.

hsen
Guest
hsen

Thanks Eugen, nice article.
Can elastic search be used to search for words in files?
E.g. how many times a word exist in a file etc. ?

Eugen Paraschiv
Guest

The simple answer is yes, but not exactly like you’re describing there.
Simply put, you’ll have to index the contents of the file, and then – sure, you’ll be able to perform search normally.