Cloud Native 17 min read

Building Tubi Data Runtime on JupyterHub: Architecture, Authentication, Storage, GPU Support, and Autoscaling

This article details how Tubi built the Tubi Data Runtime platform on JupyterHub using Kubernetes, covering authentication with Okta SSO, custom Docker images, shared EFS storage, multi‑service support, GPU enablement, node affinity, cluster autoscaling, and monitoring with Prometheus.

Bitu Technology

Jun 5, 2020

Building Tubi Data Runtime on JupyterHub: Architecture, Authentication, Storage, GPU Support, and Autoscaling

Overview

Tubi Data Runtime (TDR) started as a Python library but needed a production‑grade platform. JupyterHub was chosen as the entry point, allowing users to access a unified URL and launch personal Jupyter services.

Basic Features

Running on AWS, TDR uses kops, kubectl and the official JupyterHub Helm chart. The Helm chart simplifies deployment but additional work is required for production readiness.

Authentication / Single Sign‑On

TDR integrates Okta SSO (OAuth 2.0). Adding the following to values.yaml enables Okta login:

auth:
  type: custom
  custom:
    className: oauthenticator.generic.GenericOAuthenticator
    config:
      login_service: "Okta"
      client_id: "{{ okta_client_id }}"
      client_secret: "{{ okta_client_secret }}"
      token_url: https://{{ tubi_okta_domain }}/oauth2/v1/token
      userdata_url: https://{{ tubi_okta_domain }}/oauth2/v1/userinfo
      userdata_method: GET
      userdata_params: {'state': 'state'}
      username_key: preferred_username

After configuration users can click the TDR icon on Okta to log in directly.

Deeply Customized Docker Image

The official Helm chart uses a standard JupyterLab image. TDR builds its own Alpine‑based image to keep size small and to embed core TDR features.

Shared Storage

Each user pod mounts two directories from an AWS EFS volume: a private home directory and a shared directory. The following Helm snippet configures the mounts:

singleuser:
  storage:
    homeMountPath: '/home/tubi/notebooks/{username}'
    type: "static"
    static:
      pvcName: "efs-persist"
      subPath: 'home/{username}'
    extraVolumeMounts:
      - name: home
        mountPath: /home/tubi/notebooks/shared
        subPath: shared/notebooks

Only the PVC needs to be defined once; the extraVolumeMounts array can be extended as needed.

Multiple Jupyter Services per User

Setting c.JupyterHub.allow_named_servers=True lets a user run both a default and named server simultaneously, with the caveat that links must reference the correct server name.

Advanced Features

Deep Learning (GPU)

Two separate images (CPU and GPU) are provided. The GPU image includes CUDA and cuDNN, and the NVIDIA device plugin is required. Users select the desired image via the JupyterHub profile list:

singleuser:
  profileList:
    - display_name: "Default"
      description: |
        Tubi Data Runtime
      default: True
      kubespawner_override:
        image: <private-docker-registry>/tubi-data-runtime
        extra_resource_limits:
          nvidia.com/gpu: "0"
    - display_name: "GPU"
      description: |
        Tubi Data Runtime with GPU Support. 1 GPU ONLY.
      kubespawner_override:
        image: <private-docker-registry>/tubi-data-runtime-gpu
        extra_resource_limits:
          nvidia.com/gpu: "1"

TensorBoard Proxy Patch

To make TensorBoard reachable through JupyterHub, a small patch to tensorboard/notebook.py rewrites the URL to use the Jupyter Server Proxy.

diff --git a/tensorboard/notebook.py b/tensorboard/notebook.py
index fe0e13aa..ab774377 100644
--- a/tensorboard/notebook.py
+++ b/tensorboard/notebook.py
@@ -378,8 +378,17 @@ def _display_ipython(port, height, display_handle):
   const frame = document.getElementById(%JSON_ID%);
-  const url = new URL("/", window.location);
-  url.port = %PORT%;
+  var baseUrl = "/";
+  try {
+    baseUrl = JSON.parse(document.getElementById('jupyter-config-data').text || '').baseUrl;
+  } catch {
+    try {
+      baseUrl = $('body').data('baseUrl');
+    } catch {}
+  }
+  const url = new URL(baseUrl, window.location) + "proxy/%PORT%/";
   frame.src = url;
 })();

Node Affinity

CPU pods prefer CPU instance groups, while GPU pods require GPU instance groups. The Helm configuration uses node_affinity_preferred and node_affinity_required:

singleuser:
  profileList:
    - display_name: "Default"
      kubespawner_override:
        image: <private-docker-registry>/tubi-data-runtime
        node_affinity_preferred:
          - weight: 1
            preference:
              matchExpressions:
                - key: kops.k8s.io/instancegroup
                  operator: NotIn
                  values:
                    - gpu
        extra_resource_limits:
          nvidia.com/gpu: "0"
    - display_name: "GPU"
      kubespawner_override:
        image: <private-docker-registry>/tubi-data-runtime-gpu
        node_affinity_required:
          - matchExpressions:
              - key: kops.k8s.io/instancegroup
                operator: In
                values:
                  - gpu
        extra_resource_limits:
          nvidia.com/gpu: "1"

Cluster Autoscaling

AWS Cluster Autoscaler is enabled by adding IAM permissions, labeling nodes with k8s.io/cluster-autoscaler/enabled and k8s.io/cluster-autoscaler/<CLUSTER‑NAME>, and adjusting the SSL certificates path. This reduces idle node costs by ~50%.

Monitoring

Prometheus, kube‑prometheus, Grafana, and Alertmanager are deployed for internal monitoring. Access is restricted to company IP ranges.

Conclusion

TDR now provides a production‑grade, cloud‑native data platform with authentication, storage, GPU support, autoscaling, and observability. Future work includes fine‑grained permission control, task scheduling, reproducibility, and model serving to make the platform usable by non‑technical analysts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud native Docker Kubernetes autoscaling AWS GPU JupyterHub

Written by

Bitu Technology

Bitu Technology is the registered company of Tubi's China team. We are engineers passionate about leveraging advanced technology to improve lives, and we hope to use this channel to connect and advance together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.