IT Monitoring in the Era of Containers: Tapping into eBPF Observability

Navigate to:

Article by Daniella Pontes and Luca Deri

Containers are a game-changer for everyone. An abstraction between the infrastructure and the application layers, containers are a group play and concern IT system engineers, Ops, NetOps and DevOps. However, professionals in different roles approach container monitoring from different perspectives, all valid and important, and necessary to build a complete monitoring strategy.

With the separation of applications into containerized microservices running in clusters, each container packages the necessary resources to execute its part of the deal. However, unless its workload, counterparts, and also the network putting the pieces together are equally performant, the sum of the parts will fall short from delivering an application environment that meets performance and service levels objectives.

New monitoring metrics for containers and networks

Container monitoring — from an infrastructure perspective — has come a long way with systems like Kubernetes. The emphasis has shifted from the health of individual containers to cluster health. This is because Kubernetes simplified orchestration by providing a logical layer to commoditize the infrastructure layer on which microservices run, while automating deployment and optimizing resource assignment. Infrastructure cost lowered, agility up — a great value proposition! However, this new approach has its benefits and drawbacks.

On the one hand, a misbehaving or underperforming microservice, running on a compliant cluster to the desired optimum declared state, can still create a dismantling effect on the network and overall application performance. On the other hand, observing how processes are handling the workload and which services are generating them from within the containers can provide very positive outcomes with regard to application performance improvements. Easier said than done due to the disposable nature of containers and the short window of time to make these observations.

Troubled or struggling microservices can move around from container to container, from host to host, from interface to interface. Finding the epicenter of an underperforming application in a containerized microservice environment by only looking from the outside would require tracking down a moving, intermittent target: the worst nightmare for the diagnosis teams.

IT managers must keep their monitoring sharpened for metrics from the orchestration system, containers and nodes; and NetOps must keep network latency, service availability, responsiveness and bandwidth consumption under control. Unless the health of the microservices running on this containerized infrastructure and their vast, meshed inter-container network is also kept close to heart, containerized application monitoring will suffer from shortsightedness.

It is clear that monitoring only from the outside will not fit the bill, especially for those full-stack teams. To these teams, every layer and moving part needs to work harmoniously. With that in mind, performance monitoring of microservices and how it impacts inter-container flows, more than the availability of resources (accomplished by killing and spinning new pods and containers) will show where trouble is brewing. For that, it is necessary to have visibility into what is going on from the inside.

Measuring resource and service metrics

Baseline resource monitoring metrics — such as CPU, memory and disk space — are well-covered in Kubernetes on three main fronts: containers, nodes and the master node. Additional application- specific metrics and event monitoring can be done deploying sidecars. However, symptoms on resource usage, saturation and failures can be better understood when correlated with information collected from a microservice performance perspective, which includes:

  • Latency: time to service a request
  • Traffic generation: communication with other services
  • Error: how often errors occur

Container activity impacts infrastructure, network utilization and application performance; therefore, it should not be overlooked. Nothing should be taken for granted because speed and complexity don’t leave much room for guessing. A thoughtful monitoring plan for containerized application environments must include ways to verify what is going on inside and between containers — identifying users, processes, pods and containers presenting faulting, abusing or suspicious behavior.

Approaches such as in-and-out packet inspection and grouping packets into flows based on IP, port and protocol provide a viable way to monitor bandwidth utilization as well as detect non-legit, malicious and malformed traffic. However, that was before container proliferation. For containerized applications, the packet paradigm is no longer enough to provide the necessary visibility, as they do not carry context (i.e. application, user, process, pod or container) and thus monitoring them will not provide the pursued understanding of accountability. Sometimes services interact inside a system and not over a network, where it is possible to capture packets. To make matters worse, the rate at which networks and applications run today demands that network packet analyzers crank up the packet processing speed, putting a tremendous load on the CPU. One could also say that packets are used simply because networks are based on them, but applications do not see packets. They operate in terms of data sent/received, action latency and code response. So, instead of using packet-level data to infer application information, it would make more sense to read system metrics directly to extract the desired information without the need for “translations.” And as a nice bonus, taking the system introspection approach incurs lighter computational load on the system.

One way to get system information about the workload running on the container is to use eBPF to collect kernel events. This technology was introduced as Berkeley Packet Filtering (BPF), a Linux kernel technology for in-kernel packet filtering by Steven McCanne and Van Jacobson at Lawrence Berkeley National Laboratory. eBPF was extended to process multiple types of events, executing actions other than packet filtering. When using eBPF, you can correlate kernel events with network flow data to identify which containers are participating in communication sessions objectively. This will point to which users, processes and containers are presenting abnormal behaviors.

InfluxData and its partner ntop are taking the next step in monitoring containerized application environments with the use of the extended Berkeley Packet Filter (eBPF). This work will shed light on activities within the container to guide IT to find out where things are broken or breaking, and who is causing the performance issue.

Correlate network metrics with other with eBPF event data

System introspection via eBPF plays the role of adding context to infrastructure and network monitoring. This allows for root-cause to be quickly identified, and can act as the only viable source of interaction information in container deployment scenarios, for two main reasons:

With current network speeds, any meaningful sampling rate will generate an insurmountable volume of packets to be processed for inspection, consuming precious CPU cycles. If deployed in a hosted cloud environment, those cycles will slowly and surely eat up the budget.

Traffic between microservices could never really reach monitored interfaces. And therefore, there will be no opportunity to capture them.

Adding eBPF monitoring to packet inspection binds system events to network traffic, and by doing so, provides the contextual multi-dimensionality necessary to reduce the entropy of monitored data and target alerts to actionable information. Narrowing down activities to who and what to account for the process being monitored is something neither of the approaches would be able to do in isolation — not to mention doing so efficiently and effectively.

In this new accelerated and complex world of ephemeral container infrastructure and fragmented applications, organizations must become data-driven in order to cope with ever-increasing performance expectations. Monitoring has to go beyond availability, consumption and performance of isolated resources. It must provide binocular and panoramic views in order to provide a good understanding of what is going on and what could be brewing undetected. Monitoring must shed light on trends while providing real-time insights — and ultimately seek anticipation and automation.

In order to pursue such goals, IT should look into inter-container traffic and intra-container events and correlate this data with other metrics and information gathered, thereby identifying where the trouble starts: container, process, and user.

One place for metrics, events and network traffic

eBPF opens one more channel to gain observability in the inner systems required to link anomalous behavior and performance variations in distributed containerized environments. But in order to connect the dots, all the information from metrics, events and intra/inter container events need to be in one platform capable of fully utilizing this data and cross-analyzing it in conjunction.

All data should come to the same place, and reduce the burden from setup, ramp-up, management and gathering of information pieces from multiple siloed sources. The ntop eBPF solution uses InfluxDB as its time series storage engine and is therefore ready to include all types of monitoring data like metrics, kernel events, log, tracing and business KPIs. This converged set of metrics and events is useful when used for alerting and prediction modeling. Container era complexity demands one integrated data source, one engine for multiple data type analysis, and one UI for all visualizations. Bringing it all together will compound insights and perspectives leading to a monitoring solution that enables more intelligent alerts and actionable information.