JVM Profiling in Kubernetes with Java Flight Recorder

Published in

OLX Engineering

9 min readApr 15, 2021

At OLX Autos, we run a lot of JVM workloads in our Kubernetes cluster. A lot of our applications are developed using Java or Kotlin. We, at times, encounter performance issues with these workloads — be it issues around thread contention or frequent major GC cycles causing undesired autoscaling of applications and unnecessary resource usage. We use NewRelic as our primary Application Performance Monitoring (APM) tool, and though NewRelic helps us find whether an application performance issue is actually caused by JVM, it falls short if we want to drill down to find the actual cause within JVM. This is where we need the detailed JVM profiling.

There are plenty of JVM profiling tools available and most of them are geared towards profiling local applications. Although some of them are suitable for production usage, getting them to work in a dynamic environment like Kubernetes is a real challenge. In this post, we’ll discuss our approach to profiling JVM applications running in Kubernetes.

A Short Primer on JVM Profiling

JVM Profiling is the process of performing application diagnostics to figure out any performance, memory, or I/O-related bottlenecks that could be plaguing the applications that run on JVMs.

The JVM Profiler monitors the java byte-code constructs and operations at the JVM level. These constructs and operations include object creation, iterative executions (including recursive calls), method executions, thread executions, and garbage collections. It then generates relevant metrics against the same and sends them to a reporter, the reporter can be either a file, stream or a downstream application.

How Profilers Work

Profilers usually consist of two components:

The first component gathers the metrics/events from a running application and logs them to a reporter (generally a file, stream, or downstream application). In a JVM profiler, this component generally looks at the JVM-specific events such as thread samples (which show where the program spends its time), lock profiles, and garbage collection details.
The second component uses the metrics/events from the reporter and generates relevant visualizations- these can be time trends, flame graphs, etc. which help the developers in analyzing the applications.

Choosing the right Profiler

We evaluated a few options that were commonly used across the industry to perform JVM application profiling, a few notable mentions are:

Async Profiler
JProfiler
VisualVM
YourKit
Java Flight Recorder (JFR)

Our Evaluation Criteria

Low overhead: The profiler should be suitable for continuous profiling in production.
Low TCO: We did not want to spend big bucks on it.
Detailed profiling and visualization: The profiler should be able to capture detailed metrics and allow for visualizations.

Java Flight Recorder

Based on our evaluation, we chose Java Flight Recorder (JFR). Primary reasons for choosing JFR:

It facilitates the profiling applications that run locally as well as in production.
Profile applications retrospectively as well as live applications.
Minimal performance overhead due to the use of internal buffers that limit Disk I/O operations.
Reduces total cost of ownership as it’s freely available with JDK.
More extensible and community-driven as this is a community-maintained open-source project.

Challenges of JVM Profiling in Kubernetes

JVM profilers are typically used with applications running locally, even though they may not generate the same level of useful insights that you might get when profiling an application running under actual production load. This is because most of them introduce a performance overhead of their own, so using them with an application running in production is usually met with hesitation and apprehension. Even if a profiler is low-overhead and suitable for production use (like JFR), using it in a dynamic environment such as Kubernetes has its own set of challenges.

Let’s take a look at different ways we can profile applications running in production and the associated challenges when that environment is Kubernetes:

Profiling by connecting to a remote process

Most JVM profilers support profiling remote applications such as those running in production. They run locally and connect to the remote process via mechanisms like JMX to collect and visualize profiling data.

This is not straightforward in Kubernetes as application pods are not directly exposed to the outside world. It needs to be done via a host of workarounds which can be tricky and may not be aligned with infra maintenance standards.

Production agent-based profiling

In this approach, the profiling data is collected on the remote server by running a profiling agent along with the application. This data can then be fetched and visualized using the profiler’s UI application.

In the case of JFR, it collects profiling events from the running application and writes them to a jfr file. The profiling data in jfr file can be visualized using Java Mission Control (JMC).

This is again quite challenging with Kubernetes.

Application pods in Kubernetes are ephemeral in nature. Running pods are frequently terminated as part of the deployment or downscaling activities. If we enable profiling in a pod and store profiling data locally, as soon as the pod is terminated, the profiling data will be lost. So we need to figure out a way in which the profiling data can persist and remain available even after the pod has been terminated.
Even if we can have a pod running long enough for profiling, the storage space on these pods can become a concern. Application pods are generally stateless in nature and hence do not have high storage requirements. This can lead to a shortage of storage space when it comes to keeping our profiling data on the application pods.
Even if we have the required storage space available, SSHing into the pods to copy profiling data is cumbersome and not a great security posture. We need to make the profiling data available to developers easily without requiring Ops intervention or production SSH access.

Operationalizing JFR in Kubernetes

Approach

To overcome the challenges that exist in the Kubernetes environment w.r.t JVM Profiling, we came up with an approach whereby we baked the steps of profiling the application into the docker files of our applications. This way we enable profiling of these applications when these docker containers are deployed into our production environment as application pods. The applications then start reporting the profiling data into the small file chunks (jfr files) locally onto the filesystem of the application pods. These chunks are synced to Amazon S3 and deleted from the local filesystem. This way we limit the storage requirements on the application pods.

The jfr file chunks can be retrieved and stitched together for any arbitrary time period using an in-house tool called Springboard. The consolidated jfr file can be visualized in JDK Mission Control in the developer’s local machines for retrospectively analyzing the profiled data of these applications.

High-Level Architecture

Below you can find the high-level architecture of the approach discussed above and a more detailed explanation of the various stages involved,

Enabling profiling for applications

Enabling the profiling for an application is on-demand and part of the CI/CD pipeline. It can be done by setting the right pipeline variable when required. By default, the application is deployed without profiling enabled.

This is achieved by modifying the Application Docker file and embedding certain steps into the same that are only executed when the feature flag for profiling is enabled in the pipeline. This way the application docker image that is built has the necessary steps required for its profiling baked into it.

Once this application is deployed, the steps that are baked in perform 2 basic tasks:

Task 1: Enable Runtime argument to enable application profiling at startup.

ARG ENABLE_PROFILINGENV JFR_ARG=${ENABLE_PROFILING:+"-XX:StartFlightRecording=disk=true,maxage=3m,dumponexit=true, filename=/tmp/jfr/dump/logs/<APP_NAME>.jfr -XX:FlightRecorderOptions=maxchunksize=1M,memorysize=1M,repository=/tmp/jfr/dump/logs"}ENV JAVA_OPTS="$CONTAINER_MEMORY $G1_GC $SECURITY_OPTS $SHOW_SETTINGS $JFR_ARG"

Task 2: Setup the Amazon S3 sync script to sync JFR files with the FS

RUN if [ "$ENABLE_PROFILING" = "TRUE" ] ; then apt-get -y install awscli ; fi RUN if [ "$ENABLE_PROFILING" = "TRUE" ] ; then touch tmp/cron.log ; fiCOPY s3_sync.sh tmp/s3_sync.shRUN if [ "$ENABLE_PROFILING" = "TRUE" ] ; then (crontab -l; echo "*/3 * * * * sh /tmp/s3_sync.sh >> /tmp/cron.log 2>&1") | crontab ; fi

A few things to note in these tasks:

In the first task, the runtime argument for JFR has the 2 attributes maxchunksize and memorysize. These attributes enable us to create chunks of a defined size. It also has another attribute maxage that defines the maximum duration of disk data. These parameters help us to rotate and only keep a set number of file chunks each with a unique name on the disk so that we do not bloat up the disk space on the application container.
In the second task, s3_sync.sh is a custom script we’ve created to copy over the JFR file chunks to a defined location in Amazon S3. The script runs periodically defined by a cron job set up on the container.

Retrieving JFR Files

These JFR files that are synced to Amazon S3 can then later be retrieved using an in-house tooling application called SpringBoard. This application requires the developer to specify a time range and the ID of the application pod that they want to analyze. The application then creates a job in the background to do the following tasks sequentially:

Fetch the JFR Files from Amazon S3 by filtering them based on the pod ID and time range provided by the developer.
Stitch the JFR Files using the jfr utility to create a single JFR File.
Upload the stitched JFR File back to Amazon S3.
Update the Download URL for the same as the output of the job.

This way once the job is complete and the download URL is available to the developer, they can then download that JFR file over to their local system for further analysis.

Visualizing JFR Files with Java Mission Control

Once we have a recorded and stitched JFR file, we can import that into Java Mission Control (JMC). JMC will read the time series data and generate insights and graphs that can help you dig out the performance gremlins (if any).

It also generates a nifty summary report that indicates if there are any performance bottlenecks that it has automatically detected.

Road Ahead

While we have a system in place, there is certainly a lot of scope for improvements in terms of how we have integrated it with our applications.

First, we want to completely decouple the activity of syncing the JFR files from the application using a sidecar container. The sidecar can be dynamically attached to the application pod using the Kubernetes admission controller. This will avoid the applications having to change their setup to enable profiling. The separate container would also help us manage resources better.

Secondly, we have planned to look at the recent developments that have been made to the Java Flight Recorder to now be able to perform JFR Event Streaming since Java 14. We are looking at how we can leverage this to move away from a file-based approach altogether.

These are certainly exciting times for someone looking into this space with the fact that Java Flight Recorder is now open source, it makes it all the more exciting to contribute towards the development of a tool that certainly provides so much value.