The spark-operator is a convenient way to start managing Spark applications in a Kubernetes native way. It does however miss some basic functionalities, like adding livenessProbes to name something. Let's see how we can overcome these shortcomings quite easily.

The problem

For the openEO project, we deploy a long running Spark application that exposes a web API endpoint. Because this is not just a regular batch job, like most Spark jobs, we need the functionalities that we're used to from deploying standard web applications. Some of these are livenessProbe, readinessProbe and startupProbe. We run our Spark applications via the spark-operator, that provides a CRD called SparkApplication. This CRD however is missing these health checks.

Solution #1: Robusta

Because we mainly wanted a way to auto-restart the pod when it becomes unhealthy, we first looked into Robusta. Integrated with Prometheus and Alertmanager, it allowed us to write a playbook to restart a pod when the API andpoint became unavailable. While this worked quite well, it wasn't really a Kubernetes-native solution and it was never as fast as we're used to when deploying regular web applications with health checks.

Solution #2: podTemplate

Spark has a feature to define a podTemplate. The podTemplate then has the health checks defined and the Spark driver and or executors are created from this template. While this solution sounds exactly what we need, I never got it to work correctly with the spark-operator. A podTemplate for our driver could've been as easy as:

---
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: spark-driver
      livenessProbe:
        httpGet:
          path: /health
          port: http
        initialDelaySeconds: 60
        periodSeconds: 5

We then set this file with

spark.kubernetes.driver.podTemplateFile: "/opt/spark/work-dir/driver_template.yaml"

in the sparkConf of the SparkApplication. The pods were created from this template, but the issue was that the regular settings defined in the SparkApplication manifest, like volumes, were not added anymore to the pod. While we could've written it all in the sparkConf, it would've been too big of a change to rewrite all of our SparkApplications.

Solution #3: Kyverno

The podTemplate allowed us to patch the SparkApplication with some custom settings, but it didn't work end-to-end for our use case. So we went looking for another solution that let us patch our SparkApplications: enter Kyverno.

Kyverno is a policy manager. It gives teams the possiblity to enforce certain policies when Kubernetes resources are submitted to the server. An example is that the Kubernetes operators expect that every developer sets resources limits for their applications. When the manifest doesn't contain these limits, Kyverno refuses the resource to be applied on the cluster. These validation policies, however, don't help for our problem.

Kyverno can do more than just validating, it can also mutate resources when they are submitted. Any setting can be added to a resource with a mutating policy. This is exactly what we need!

Install Kyverno

We're installing Kyverno with Helm:

helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update
helm install kyverno kyverno/kyverno-policies -n kyverno

Check and configure the chart's values according to your needs. The defaults allow you to follow the next steps.

Kyverno policies are defined as regular Kubernetes resources in a yaml manifest. For our use case, we need to be able to add for example a livenessProbe:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-liveness-probe
spec:
  rules:
  - name: add-liveness-probe
    match:
      all:
        - resources:
            kinds:
             - Pod
            selector:
              matchLabels:
                driver-type: webapp
    mutate:
      patchesJson6902: |
        - path: "/spec/containers/0/livenessProbe"
          op: add
          value:
            failureThreshold: 1
            initialDelaySeconds: 60
            periodSeconds: 5
        - path: "/spec/containers/0/livenessProbe/httpGet"
          op: add
          value:
            path: /health
            port: http

Let's break it down:

    match:
      all:
        - resources:
            kinds:
             - Pod
            selector:
              matchLabels:
                driver-type: webapp

We don't want to add the livenessProbe to any pod that is submitted to our cluster, so we do some simple matching. First we define that we want to add it to Pods only. Then we narrow that down to pods that have the label driver-type=webapp.

    mutate:
      patchesJson6902: |
        - path: "/spec/containers/0/livenessProbe"
          op: add
          value:
            failureThreshold: 1
            initialDelaySeconds: 60
            periodSeconds: 5
        - path: "/spec/containers/0/livenessProbe/httpGet"
          op: add
          value:
            path: /health
            port: http

Then we add our mutating rule. We perform a JSON patch, where we add the a path that defines where exactly the livenessProbe should be added. Here it should be added to the spec part where we select the first container. We add a section livenessProbe and define its settings. If we need to go a level deeper, we have to add another JSON patch with a different path.

Conclusion

When we deploy the Spark aplications, we can see that Kyverno adds the necessary extra fields to our Pods. We now have native livenessProbes that allows Kubernetes to discover when our application should be restarted because the HTTP endpoint is not healthy anymore. With exactly the same procedure, we can also add startupProbes and readinessProbes.

If we would encounter more shortcomings of the spark-operator, we will certainly check if we can solve them using Kyverno.