Elasticsearch document inconsistency

Problem :­
A particular search query would show results sometimes and sometimes won't.


Root cause analysis :­
A pattern observed with the hits and misses was of 011100 occuring in a repeated way.
Elasticsearch by default delegates the queries amongst the nodes till it finds the results and hence, if you want to direct your query to a particular node, a "preference" parameter can be set to "nodeID"

The way to get a nodeID (the one that I found) is ­
curl ­XGET 'localhost:9200/_nodes'

wherein every node has a ID associated in the json header.
The nodeID can be used as ­
searchRequestBuilder.setPreference("_only_node:xyz");

Using this preference, one can narrow down the shard that was showing this inconsistent behaviour for the given document. The shard stats showed that it had the same number of "numDocs" and hence it seemed for some reason, the document was not searchable on that shard.

Found a similar issue in elasticsearch groups ­

Resolution :­
The elasticsearch logs did not show any issue on the nodes and that "refresh" and "flush" on the index did not produce any exception/error in the logs and did not resolve the issue.
As per the issue thread, people had recreated their indexes or stuck to using the primary shard(which wasnt an option as recreating the index would take a lot of time and hitting the primary shard only always gave us no hits)
The no­-downtime way was to bring the replica shards down to 0 and create them again. (which luckily fixed the issue for us)

curl ­XPUT 'localhost:9200/my_index/_settings' ­d '{
  "index" : {
    "number_of_replicas" : 0
  }
}'

Run an indexing job to fill back in the documents lost in the process.

Possible Causes :­
We had some downtimes in the cluster due to high CPU usage and maybe such outages would have caused an inconsistency in the shards. Though, we still kept getting consistent results for most of the queries we had.
So the guess is that any such issue can cause elasticsearch shards to become inconsistent and the resolution is variable/dependent on the number of documents in the index and the possible downtime that can be taken.

0 comments :