Monitoring distributed backends at SVTi, a technical overview

Dickerson’s Hierarchy of Reliability – monitoring as the foundation

At SVTi we are strong believers of the microservice trend where each team builds and maintains fine-grained reusable services. Our journey started a few years ago when the decision to split our monolithic architecture to a distributed one was taken.

One of the great challenges that engineering organizations typically face when it comes to distributed systems is the fact that your critical applications are spread out, meaning that it becomes very difficult to get a holistic view of your entire system. Without the proper tooling and strategies you’re in the dark when it comes to failure detection, alerts and root cause analysis.

This blog post aims to give an insight into the monitoring tools used by SVTi, and in particular the team “Videosamarbeten” that supplies the metadata for the SVT online video services such as SVT Play.

Our principles when choosing tooling are:

  • Open source first – both from a cost perspective as well as to foster innovation
  • Configure instead of build – our engineering division is small in comparison to the big players, so we prefer to reuse existing solutions rather than building our own.
  • Usability – the tooling must be simple and extensible

 

Resource monitoring

The most generic monitoring comes from our internal cloud infrastructure based on Docker and Marathon (https://mesosphere.github.io/marathon/). All applications that gets deployed in our internal cloud automatically gets their own basic dashboard with metrics such as memory/cpu usage, response times and log levels. The dashboards are built using the combination of a Grafana UI (http://grafana.org/) coupled with Graphite as the storage backend.

These dashboards are great for basic level monitoring, verifying that our applications are up and running without consuming too much resources.

 

Application logs

When deploying lots and lots of applications you’ll want a central logging platform where all your application logs are aggregated and easily searchable. The days of tediously logging in to a server to download log files are gone as that takes lots of time from the teams and is not scalable. Being able to supply your development organization with one frontend for all application logs is critical and a game changer when it comes to quickly finding the root cause of an error.

Our logging infrastructure is built on a custom variant of the now classic ELK stack (https://www.elastic.co/products), meaning ElasticSearch as the searchable database and Kibana as the UI. The L typically denotes Logstash as log forwarder, but our infra team has instead used a setup containing of Heka together coupled with a Kafka messaging bus. All applications deployed to our internal cloud automatically gets their syslog’s forwarded to the logging datastore which means that all our teams at SVTi gets centralized application logging for free, without any manual setup.

Coming from a background where we had to resort to manually searching in log files this is something that our team wouldn’t be able to live without again.

Our logging infrastructure also contains all the access logs from our internal load balancers, which makes it possible to generate lots of interesting charts, metrics and dashboards (more about that later).

 

Application metrics

Moving into more fine grained application metrics we again utilize the powerful combination of Grafana and Graphite. This is where we store i.e. time series and business specific metrics that are specific to each particular application. Each team is responsible for setting up individual dashboards that are of interest to their specific applications and services.

As an example, a lot of our teams use Hystrix (https://github.com/Netflix/Hystrix), a Java reliability library from Netflix to shield API’s from cascading failures. A nice side benefit of using Hystrix is that all the external calls gets instrumented out of the box. All you have to do is to forward the metrics to the central Graphite storage and you get full surveillance of the latencies, timeouts etc for all your external invocations.

This setup has saved us countless times and helped us identify and solve performance problems both in our networking infrastructure as well as our applications.

 

Service topology and high level API overview

All of the aforementioned monitoring dashboards are perfect solutions when it comes to visualize and monitor single specific applications, but so far none of them deals with the challenge of visualizing the interconnections between your API’s and how they behave together.

We had the requirement for a dashboard where we with one quick glance could answer questions such as:

  • Which services uses each other?
  • Where is our critical path with the most traffic?
  • Do we have any api’s that are showing signs of failure?

As often the case it turned out that we were not alone in our requirements. Netflix recently open-sourced their internal dashboard called Vizceral (https://github.com/Netflix/vizceral), a graph based UI to display in one single screen how all your services call each other in real-time.

We generate the graph of interconnected nodes by using our access logs that are already available in ElasticSearch (see application logs above). Everything needed is a custom query against the database and a transformation to the format required by Vizceral, and the entire graph of our realtime API communication unfolds.

This dashboard has become an integral piece of our monitoring and is a good first step to notify us about new users of our API’s as well as errors in production.

 

Call traces – distributed performance tuning

Performance tuning an application that relies on other services can quickly become difficult as a change in one external system can cause your application performance to degrade without any changes in your application. It is therefore vital to understand what happens in all participating apps when your app fetches data from a mesh of connected backend applications.

Our latest tool for dealing with this complexity is Zipkin (http://zipkin.io/), a distributed tracing solution built and deployed internally at Twitter. The idea is that all your services propagate correlation id’s between each other and report tracing information to a central server for later querying.

This makes it possible see the whole tree of calls from all participating services and find out if you have specific bottlenecks further down in your call chain.

 

Zipkin is currently used foremost as a development tool to find out where the time is spent during a service call. We’ve just started experimenting with it but have already identified multiple areas where we can improve our backend services to lower our overall latencies.

 

Conclusion

When making the decision to move towards a microservice-based architecture you need to understand that it brings a lot of complexities with it. Without the proper monitoring setup it becomes impossible to know how your system behaves and you’re basically left in the dark. You cannot improve what you don’t see, so making sure that your teams have the proper tooling to troubleshoot your infrastructure is one of the first steps you need to take before being able to realize the goals of a distributed architecture.

At SVTi we’ve invested countless of engineering hours in our internal cloud setup and the tools for delivering true DevOps capabilities to our engineers. This makes our product teams faster and ultimately leads to better products as less time is dealt with infrastructure concerns.

The key to foster innovation is to make your teams autonomous and allow them to operate independently. This requires excellent tooling and our monitoring is a small, but important part of the puzzle.

On a personal level I’ve yet to experience a more sophisticated infrastructure setup than the one at SVTi and it’s a developers dream to work in the type of flexible hosting environment as this one, with the culture and technical know-how that makes everything possible.

If you are interested in site reliability, scaling and monitoring SVTi is a great place to work at and we’re currently hiring. Just drop us a note and we’ll keep in touch!

 


Grafana: http://grafana.org/

ELK (ElasticSearch, Logstash): https://www.elastic.co/products

Heka (log forwarding): https://hekad.readthedocs.io

Hystrix (Fault tolerance library): https://github.com/Netflix/Hystrix

Vizceral: https://github.com/Netflix/vizceral

Zipkin: http://zipkin.io/