Google Kubernetes Engine The node was low on resource: ephemeral-storage. which exceeds its request of 0

I have a GKE cluster where I create jobs through django, it runs my c++ code images and the builds are triggered through github. It was working just fine up until now. However I have recently pushed a new commit to github (It was a really small change, like three-four lines of basic operations) and it built an image as usual. But this time, it said Pod errors: BackoffLimitExceeded, Error with exit code 137 when trying to create the job through my simple job, and the job is not completed.

I did some digging into the problem and through runnig kubectl describe POD_NAME I got this output from a failed pod:

Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-nqgnl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason               Age    From               Message
  ----     ------               ----   ----               -------
  Normal   Scheduled            7m32s  default-scheduler  Successfully assigned default/xvb8zfzrhhmz-jk9vf to gke-cluster-1-default-pool-ee7e99bb-xzhk
  Normal   Pulling              7m7s   kubelet            Pulling image "gcr.io/videoo3-360019/github.com/videoo-io/videoo-render:latest"
  Normal   Pulled               4m1s   kubelet            Successfully pulled image "gcr.io/videoo3-360019/github.com/videoo-io/videoo-render:latest" in 3m6.343917225s
  Normal   Created              4m1s   kubelet            Created container jobcontainer
  Normal   Started              4m     kubelet            Started container jobcontainer
  Warning  Evicted              3m29s  kubelet            The node was low on resource: ephemeral-storage. Container jobcontainer was using 91144Ki, which exceeds its request of 0.
  Normal   Killing              3m29s  kubelet            Stopping container jobcontainer
  Warning  ExceededGracePeriod  3m19s  kubelet            Container runtime did not kill the pod within specified grace period.

The error occurs because of this line:

The node was low on resource: ephemeral-storage. Container jobcontainer was using 91144Ki, which exceeds its request of 0.

I do not have a yaml file where I set my pod informations, instead I make a django call handle configuration which looks like this:

def kube_create_job_object(name, container_image, namespace="default", container_name="jobcontainer", env_vars={}):
    # Body is the object Body
    body = client.V1Job(api_version="batch/v1", kind="Job")
    # Body needs Metadata
    # Attention: Each JOB must have a different name!
    body.metadata = client.V1ObjectMeta(namespace=namespace, name=name)
    # And a Status
    body.status = client.V1JobStatus()
     # Now we start with the Template...
    template = client.V1PodTemplate()
    template.template = client.V1PodTemplateSpec()
    # Passing Arguments in Env:
    env_list = []
    for env_name, env_value in env_vars.items():
        env_list.append( client.V1EnvVar(name=env_name, value=env_value) )

    print(env_list)
    security = client.V1SecurityContext(privileged=True, allow_privilege_escalation=True, capabilities= client.V1Capabilities(add=["CAP_SYS_ADMIN"]))
    container = client.V1Container(name=container_name, image=container_image, env=env_list, stdin=True, security_context=security)
    template.template.spec = client.V1PodSpec(containers=[container], restart_policy='Never')
    body.spec = client.V1JobSpec(backoff_limit=0, ttl_seconds_after_finished=600, template=template.template)
    return body



def kube_create_job(manifest, output_uuid, output_signed_url, webhook_url, valgrind, sleep, isaudioonly):
    credentials, project = google.auth.default(
        scopes=['https://www.googleapis.com/auth/cloud-platform', ])
    credentials.refresh(google.auth.transport.requests.Request())
    cluster_manager = ClusterManagerClient(credentials=credentials)
    cluster = cluster_manager.get_cluster(name=f"path/to/cluster")

    with NamedTemporaryFile(delete=False) as ca_cert:
        ca_cert.write(base64.b64decode(cluster.master_auth.cluster_ca_certificate))

    config = client.Configuration()
    config.host = f'https://{cluster.endpoint}:443'
    config.verify_ssl = True
    config.api_key = {"authorization": "Bearer " + credentials.token}
    config.username = credentials._service_account_email
    config.ssl_ca_cert = ca_cert.name
    client.Configuration.set_default(config)

    # Setup K8 configs
    api_instance = kubernetes.client.BatchV1Api(kubernetes.client.ApiClient(config))

    container_image = get_first_success_build_from_list_builds(client)
    name = id_generator()
    body = kube_create_job_object(name, container_image,
                                  env_vars={
                                            "PROJECT"           : json.dumps(manifest),
                                            "BUCKET"            : settings.GS_BUCKET_NAME,
                                           })
    try:
        api_response = api_instance.create_namespaced_job("default", body, pretty=True)
        print(api_response)
    except ApiException as e:
        print("Exception when calling BatchV1Api->create_namespaced_job: %s\n" % e)
    return body

What causes this and how can I fix it? Am I supposed to set resource/limit varibles to a value and if so how can I do that inside my django job call?

It looks like you are running out of storage on the actual node itself. Since your job spec does not have a request for ephemeral storage, it is being scheduled on any node and in this case it appears like that particular node does not have enough storage available.

I'm not a Python expert, but looks like you should be able to do something like:

storage_size = SOME_VALUE
requests = {'ephemeral-storage': storage_size}
resources = client.V1ResourceRequirements(requests=requests)
container = client.V1Container(name=container_name, image=container_image, env=env_list, stdin=True, security_context=security, resources=resources)
Back to Top