At Indix, we started evaluating Mesos and Marathon infrastructure around one and a half years back for some internal docker-based microservices. We’ve come a long way since then, where we have close to 30 different production applications running on that infrastructure. As we grew in scale from a small 5 node Mesos cluster to large 70+ nodes, we realized there were some missing pieces in the puzzle. In this blog post, we’ll be talking about one of them – Alerting.
Every application deployed on Marathon can have health check(s) configured to identify if a task is healthy. If found otherwise, Marathon will automatically launch a replacement. All of this happens spontaneously behind the scenes in Marathon with zero user intervention. But, a number of things can happen – we could be running out of resources, or a new version of the app could fail to pass the health check, or there could be a sudden outage on the underlying Mesos slave infrastructure – any of this could mean either a downtime or reduced performance of your application. Marathon doesn’t provide any kind of alerting mechanism for detecting these, but it has a really awesome API from which we can derive this information.
Since Marathon is a PaaS system, we expect multiple teams to share the same infrastructure for their deployment. This taught us to look at alerting at an app level rather than at the infrastructure level. Inspired from tools like kubernetes-alerts and consul-alerts, we built – marathon-alerts.
Marathon-alerts helps you set up a set of checks for each app that’s deployed on a Marathon infrastructure, and push alerts to a notifier implementation – currently only Slack. It supports the following three checks:
It also supports the following levels for each check
As the name suggests, an app check is labelled “Critical” when it crosses a critical threshold and so on. Similarly, if a check is already in either “Warning” or “Critical” stage and it passes, it is labelled as “Resolved.” Any consecutive Pass checks are only labelled as “Pass”.
Fig. 1 – Screenshot of a Critical check for one of our production apps triggered by marathon-alerts.
Alerts are useful only if they are sent to the right set of folks who can act on them. In marathon-alerts, apart from the channel where the alert should go to, you can also say who should be tagged in the alert so that they are notified – like in Fig. 1.
Any alerting system needs escalation features, wherein for checks of “Warning”, it is fine to have it posted on Slack, but for “Critical” alerts, it’s best to send a Pager call to the current on-call person. As of when we wrote this, we don’t have PD integration. But we have the ability to route different check levels to different notifiers.
At Indix, we host staging apps also on the same Marathon infrastructure. During the development phase of the app, these apps generally tend to go down pretty often and it would lead to too much SPAM by the alerting system. For cases like these, you can choose to disable alerts for an app and turn it back on when needed.
All this configuration can be done at an app level, by specifying these as part of labels in the app specification that’s sent as payload to Marathon. You can find all the properties that can be overridden on the README.
We’re happy to open source marathon-alerts under Apache2 license. Fork away! You can download the latest binary from Releases or use marathonctl to deploy the latest release to your marathon cluster via marathon.json.conf available on the project root.