Solr is one of the most popular Lucene-based search solutions. It's fast, distributed, robust, flexible and has an active developer community behind it. SolrCloud is the new, distributed version of Solr.
One of its key features here is the near real-time (NRT) search, i.e., documents being available for search as soon as they are indexed.
2. Indexing in SolrCloud
A collection in Solr is made up of multiple shards, and each shard has various replicas. One of the replicas of a shard is selected as the leader for that shard when a collection is created:
- When a client tries to index a document, the document is first assigned a shard based on the hash of the id of the document
- The client gets the URL of the leader of that shard from zookeeper, and finally, the index request is made to that URL
- The shard leader indexes the document locally before sending it to replicas
- Once the leader receives an acknowledgment from all active and recovering replicas, it returns confirmation to the indexing client application
When we index a document in Solr, it doesn't go to the index directly. It's written in what is called a tlog (transaction log). Solr uses the transaction log to ensure that documents are not lost before they are committed, in case of a system crash.
If the system crashes before the documents in the transaction log are committed, i.e., persisted to disk, the transaction log is replayed when the system comes back up, leading to zero loss of documents.
Every index/update request is logged to the transaction log which continues to grow until we issue a commit.
3. Commits in SolrCloud
A commit operation means finalizing a change and persisting that change on disk. SolrCloud provides two kinds of commit operations viz. a commit and a soft commit.
3.1. Commit (Hard Commit)
A commit or hard commit is one in which Solr flushes all uncommitted documents in a transaction log to disk. The active transaction log is processed, and then a new transaction log file is opened.
It also refreshes a component called a searcher so that the newly committed documents become available for searching. A searcher can be considered as a read-only view of all committed documents in the index.
The commit operation can be done exclusively by the client by calling the commit API:
String zkHostString = "zkServer1:2181,zkServer2:2181,zkServer3:2181/solr"; SolrClient solr = new CloudSolrClient.Builder() .withZkHost(zkHostString) .build(); SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField("id", "123abc"); doc1.addField("date", "14/10/2017"); doc1.addField("book", "To kill a mockingbird"); doc1.addField("author", "Harper Lee"); solr.add(doc1); solr.commit();
Equivalently, it can be automated as autoCommit by specifying it in the solrconfig.xml file, see section 3.4.
Softcommit has been added from Solr 4 onwards, primarily to support the NRT feature of SolrCloud. It's a mechanism for making documents searchable in near real-time by skipping the costly aspects of hard commits.
During a softcommit, the transaction log is not truncated, it continues to grow. However, a new searcher is opened, which makes the documents since last softcommit visible for searching. Also, some of the top-level caches in Solr are invalidated, so it's not a completely free operation.
When we specify the maxTime for softcommit as 1000, it means that the document will be available in queries no later than 1 second from the time it got indexed.
This feature grants SolrCloud the power of near real-time searching, as new documents can be made searchable even without committing them. Softcommit can be triggered only as autoSoftCommit by specifying it in solrconfig.xml file, see section 3.4.
3.3. Autocommit and Autosoftcommit
The solrconfig.xml file is one of the most important configuration files in SolrCloud. It is generated at the time of collection creation. To enable autoCommit or autoSoftCommit, we need to update the following sections in the file:
<autoCommit> <maxDocs>10000</maxDocs> <maxTime>30000</maxTime> <openSearcher>true</openSearcher> </autoCommit> <autoSoftCommit> <maxTime>6000</maxTime> <maxDocs>1000</maxDocs> </autoSoftCommit>
maxTime: The number of milliseconds since the earliest uncommitted update after which the next commit/softcommit should happen.
maxDocs: The number of updates that have occurred since the last commit and after which the next commit/softcommit should happen.
openSearcher: This property tells Solr whether to open a new searcher after a commit operation or not. If it's true, after a commit, the old searcher is closed, and a new searcher is opened, making the committed document visible for searching, If it's false, the document won't be available for searching after commit.
4. Near Real-Time Search
Near Real-Time Searching is achieved in Solr using a combination of commit and softcommit. As mentioned before, when a document is added to Solr, it won’t be visible in search results until it’s committed to the index.
Normal commits are costly, which is why softcommits are useful. But, as softcommit doesn't persist the documents, we do need to set the autocommit maxTime interval (or maxDocs) to a reasonable value, depending upon the load we are expecting.
4.1. Real-Time Gets
There is another feature provided by Solr which is in-fact real time – the get API. The get API can return us a document that is not even soft committed yet.
It searches directly in the transaction logs if the document is not found in the index. So we can fire a get API call, immediately after the index call returns and we'll still be able to retrieve the document.
However, like all too-good things, there is a catch here. We need to pass the id of the document in the get API call. Of course, we can provide other filter queries along with the id, but without id, the call doesn't work:
Solr provides quite a bit of flexibility to us regarding tweaking the NRT capability. To get the best performance out of the server, we need to experiment with the values of commits and softcommits, based upon our use case and expected load.
We shouldn't keep our commit interval too long, or else our transaction log will grow to a considerable size. We shouldn't execute our softcommits too frequently though.
It is also advised to do a proper performance testing of our system before we go to production. We should check if the documents are becoming searchable within our desired time interval.