Rolf28/03/20225 min read

Kotlin Health Monitor

No matter how impressive the application, it needs to report its health so we can make sure it stays available to its users. It is quite common to provide a /health endpoint in your application which returns a json response with a small report on the application health. Of course there are many frameworks and libraries like Spring and Dropwizzard which can provide such an endpoint and also monitor parts of your application, but we were looking for something more lightweight.

A good health endpoint provides some key figures about the application to put in a graph, which enables system administrators to put limits on them, or monitor dangerous trends like ever-increasing memory usage or a steep drop in webshob sales. Monitoring these requires different tactics. Checking memory usage when no-one asks for it is an expensive waste of time, and counting all items in a table as soon as someone asks for it generates needless load on the database.

Separation of concerns

Measuring system performance or uptime can roughly be categorized in counting and checking. The number of sales in a webshop can be counted as they happen, and the counter can be kept in memory.

For things we can't count like memory usage or database connection checks it gets a bit trickier. It would require the code to connect to the database, or check memory usage each time the information is required.

Both ways of measuring can mean that the health endpoint can get quite messy, needs access to all parts of inner workings of your application, and there is no longer a good separation of concerns:

direct

To separate the responsibilities, we introduced a simple HealthService class which is a singleton and can be called by different parts of the system to report their status or register callbacks:

indirect

The good thing is that this is not only separates concerns, it also makes it easy to register more systems and values to the HealthService. There are a few simple but effective tricks in this HealthService which work quite nicely for us, and we thought we'd share it with you.

So how does it work?

If you just want to dive in and start using it, go check out our Health Service Demo project on GitHub. Everything discussed below can be seen in action in HealthServiceTest.kt. For the people who stuck around for the tour, here we go:

When looking at the HealthService.kt, we notice that the Kotlin keyword object is used to make it a singleton:

object HealthService {

Because of this, we are sure that there is only one HealthService living in our application, and we can access it by simply referencing it. So if we want to report the number of sales to the HealthService, we can simply do:

HealthService.updateItem("sales", 123)

For monitoring database connectivity or external systems however, it would be nice that these checks are only done when asked. To be able to do that, you can tell the HealthService to call a function at the moment it is needed. To monitor our database connectivity we could write a function in our repository class that looks somewhat like this:

canConnectToDatabase(): Bool {
    if ( /* do checks here */ ) {
        return True
    } else {
        return False
    }
}

Now that we have a function that we can call, we can register that with the HealthService like so:

HealthService.registerCallback("database") {
    canConnectToDatabase()
}

After this, every time the HealthService is asked to getCurrentHealth(), it will call your callback function to see if the database can still be reached.

Exposing a health endpoint

Now that we have a simple class which is able to keep track of the health of our system, we need to expose it to the outside world in a meaningful way. In HealthDemo.kt we've created a small Ktor demo application where the HealthService is exposed as /health by just a few lines of code:

routing {
    get("/health") {
        call.respond(
            HttpStatusCode.OK, 
            HealthService.getCurrentHealth()
        )
    }
}

At the top of the DemoApplication class you can also spot a bit of code to make sure dates and times are rendered in a meaningful way. This is out of scope of this article, but you can have a look at ZonedDateTimeModule.kt to see how this is done.

That's it, we have a working /health endpoint! Let's see what it returns:

{
    "applicationStartedAt":"2021-03-15T09:30:17.222565+01:00[Europe/Amsterdam]",
    "healthTimestamp":"2021-03-15T11:38:37.045973+01:00[Europe/Amsterdam]",
    "sales":123,
    "database":true
}

Rendering this on a webpage or processing it in your favorite server monitoring software should be a breaze from this point on.

Preventing denial of service

Exposing a healh endpoint which can cause system load to the outside world is of course a risky thing to do. That is why we have to take some security measures when adding this endpoint to the application.

An easy and effective protection measure is to block access to the health endpoint from the outside world. If your application is hosted on AWS, this is realtively easy to do with the Web Application Firewall.

Even when the endpoint is only accessible from whitelisted locations, we still need to protect it from request floods. We've chosen a simple solution for that: We cache the health endpoint results for about a second. This means that if the HealthService is called multiple times per second, it will only call its callbacks once per second. This is usally accurate and responsive enough for any system administrator, and prevents the system from being brought down by an endless loop in a buggy curl script or a really motivated system admin constantly reloading the page.

Why not dropwizard?

As mentioned in the beginning of this article, there are quite a few nice monitoring libraries out there. At the time of writing our health monitor, it seemed that requests to the /health endpoint in a dropwizard application resulted in direct system calls, potentially causing denial of service attacks. This has been fixed in the latest versions of dropwizard, be sure to check it out.

What if the system [insert favourite doomscenario]?

To be able to report it's health, the system needs to respond. At some point, everything breaks, and health monitoring code suffers from that too. There is no point in spending a lot of time making your health monitoring code more robust than any part of your system. A big part of the responisibility for monitoring the health of an application is in the caller of the health endpoint. When monitoring a health endpoint, it is a good idea to monitor response times and report any anomaly in response time or response contents.

Available for free!

As mentioned before, all this is available at our Health Service Demo project on GitHub. The code is MIT licensed, so please go and use it to your advantage.

Rolf