The ElasticSearch cat APIs
I like ElasticSearch, it’s a great piece of open source technology. Although it was built as a Lucene based search engine, it can do more than just that. It’s an awesome analytics engine, but it’s also a pretty good NoSQL database.
Interacting with ElasticSearch happens through the REST API and the output is JSON. JSON is cool, JSON is fun, but it’s not really made for human readable output.
That’s where the ElasticSearch cat APIs come into play.
Cat? Are there animals involved?
Not really … the ElasticSearch cat APIs are not related to the feline creatures. The API refers to the cat binary in Unix. Instead of outputting JSON, the cat APIs sends it output line by line. No parsing required: new items are separated by a new line, properties of an item by a space.
Makes sense right?
Calling them
Calling them is quite easy: you just issue a GET request to the “_cat” resource of your ElasticSearch server. This could look like this when using curl:
curl "http://localhost:9200/_cat"
This is the output you get:
=^.^= /_cat/allocation /_cat/shards /_cat/shards/{index} /_cat/master /_cat/nodes /_cat/indices /_cat/indices/{index} /_cat/segments /_cat/segments/{index} /_cat/count /_cat/count/{index} /_cat/recovery /_cat/recovery/{index} /_cat/health /_cat/pending_tasks /_cat/aliases /_cat/aliases/{alias} /_cat/thread_pool /_cat/plugins /_cat/fielddata /_cat/fielddata/{fields}
As you can see, the output contains a pretty extensive list of meta information items you can query.
Let’s try calling a specific one:
curl "http://localhost:9200/_cat/nodes"
Here’s the output:
localhost 127.0.0.1 3 43 1.13 d * Sabretooth
What does all of this mean?
There’s no header line with the column names. Or is there?
By adding “-v” parameter to the query string of the ElasticSearch cat APIs, we can have more verbose output.
This is what the URL looks like:
curl "http://localhost:9200/_cat/nodes?v"
And here is some meaningful output:
host ip heap.percent ram.percent load node.role master name localhost 127.0.0.1 3 45 1.72 d * Sabretooth
We can also limit the amount of columns by adding the “-h” parameter to the query string.
curl "http://localhost:9200/_cat/nodes?v&h=host,ip,name"
The example above adds the column names and outputs the server host, the ip and the name of the server.
host ip name localhost 127.0.0.1 Sabretooth
Another thing we can do is perform a “help” call on a specific API. This call gives more information about the meaning of each column.
curl "http://localhost:9200/_cat/nodes?help"
The output will contain a lot more fields than you’d expect. That’s because some API calls will not list certain fields, unless you explicitly address them using the “-h” option.
id | id,nodeId | unique node id pid | p | process id host | h | host name ip | i | ip address port | po | bound transport port version | v | es version build | b | es build hash jdk | j | jdk version disk.avail | d,disk,diskAvail | available disk space heap.current | hc,heapCurrent | used heap heap.percent | hp,heapPercent | used heap ratio heap.max | hm,heapMax | max configured heap ram.current | rc,ramCurrent | used machine memory ram.percent | rp,ramPercent | used machine memory ratio ram.max | rm,ramMax | total machine memory ...
What kind of APIs are available?
- Allocation: information about the resource allocation on each server in the cluster
- Shards: information about the allocation of (specific) shards on each server in the cluster
- Master: information about the master server in the cluster
- Indices: information about (specific) indices in the cluster
- Segments: information on how an index is segmented across several servers in the cluster
- Count: count documents in (specific) indices
- Recovery: information about shard recovery when a shard is moved to a different node in the cluster
- Health: display the cluster health
- Pending tasks: as the name indicates. What is the server doing right now?
- Aliases: information about aliases given to specific indices
- Thread pool: thread pool statistics per node
- Plugins: a list of running plugins per node
- Fielddata: information about loaded body & text fields per node
Pick one and dig deeper?
The cat API documentation is pretty extensive. And I could just quote the docs line by line. That wouldn’t be to useful. Instead I’ll pick one and explain why and how I use it.
The “health” API is the most important one to me. If the cluster is not healthy, searches will not return consistent data sets. Based on the health status the cluster could either be:
- Green: it’s all good man. Saul Goodman 😉 The nodes are up, the shards for each index are loaded and the replicas have been recovered on a separate nodes
- Yellow: something is wrong. Not all replicas have been recovered. If a node goes down, there can be data loss
- Red: some primary data shards are missing. This means there data loss right now. This is bad but not disastrous as some nodes might still be rebooting.
You can actually call a specific health call from your monitoring system:
curl "http://localhost:9200/_cat/health?h=status"
If the output is not “green”, engineers should be alerted. Very convenient!
Let’s look at some video footage
I recorded a short video where I feature a couple of cat APIs on a 3 node cluster. The cluster runs on my laptop.
What I’m doing in this video is showing random API calls that are focused on the cluster, the nodes in the cluster and the indices running on the cluster.
I’m creating an index called “myindex” with a type called “mytype”. At first the index is empty, then I’m adding a document, then another one. Using the API calls I’m checking the size of the index, the allocation of the shards in the cluster and the cluster state.
Have a look:
Why should you use the cat APIs?
Long story short: the cat APIs are the easiest way to manage an ElasticSearch cluster.
OK, you can’t really change anything using these APIs, but at least you get a very detailed view on the current status of “things”. And these things could vary.
Questions that could be answered are:
- Is the cluster doing OK?
- Are all primary shards loaded? What about the replicas?
- What’s the master node in the cluster?
- How much RAM is each node consuming?
- Which fields does index x have?
- How is index x scattered across the nodes of our cluster?
the cat APIs are the easiest way to manage an ElasticSearch cluster.
Try it yourself, you’ll love it!