Building Tubi Data Runtime on JupyterHub: Architecture, Authentication, Storage, GPU Support, and Autoscaling
This article details how Tubi built the Tubi Data Runtime platform on JupyterHub using Kubernetes, covering authentication with Okta SSO, custom Docker images, shared EFS storage, multi‑service support, GPU enablement, node affinity, cluster autoscaling, and monitoring with Prometheus.
Overview
Tubi Data Runtime (TDR) started as a Python library but needed a production‑grade platform. JupyterHub was chosen as the entry point, allowing users to access a unified URL and launch personal Jupyter services.
Basic Features
Running on AWS, TDR uses kops, kubectl and the official JupyterHub Helm chart. The Helm chart simplifies deployment but additional work is required for production readiness.
Authentication / Single Sign‑On
TDR integrates Okta SSO (OAuth 2.0). Adding the following to values.yaml enables Okta login:
auth:
type: custom
custom:
className: oauthenticator.generic.GenericOAuthenticator
config:
login_service: "Okta"
client_id: "{{ okta_client_id }}"
client_secret: "{{ okta_client_secret }}"
token_url: https://{{ tubi_okta_domain }}/oauth2/v1/token
userdata_url: https://{{ tubi_okta_domain }}/oauth2/v1/userinfo
userdata_method: GET
userdata_params: {'state': 'state'}
username_key: preferred_usernameAfter configuration users can click the TDR icon on Okta to log in directly.
Deeply Customized Docker Image
The official Helm chart uses a standard JupyterLab image. TDR builds its own Alpine‑based image to keep size small and to embed core TDR features.
Shared Storage
Each user pod mounts two directories from an AWS EFS volume: a private home directory and a shared directory. The following Helm snippet configures the mounts:
singleuser:
storage:
homeMountPath: '/home/tubi/notebooks/{username}'
type: "static"
static:
pvcName: "efs-persist"
subPath: 'home/{username}'
extraVolumeMounts:
- name: home
mountPath: /home/tubi/notebooks/shared
subPath: shared/notebooksOnly the PVC needs to be defined once; the extraVolumeMounts array can be extended as needed.
Multiple Jupyter Services per User
Setting c.JupyterHub.allow_named_servers=True lets a user run both a default and named server simultaneously, with the caveat that links must reference the correct server name.
Advanced Features
Deep Learning (GPU)
Two separate images (CPU and GPU) are provided. The GPU image includes CUDA and cuDNN, and the NVIDIA device plugin is required. Users select the desired image via the JupyterHub profile list:
singleuser:
profileList:
- display_name: "Default"
description: |
Tubi Data Runtime
default: True
kubespawner_override:
image: <private-docker-registry>/tubi-data-runtime
extra_resource_limits:
nvidia.com/gpu: "0"
- display_name: "GPU"
description: |
Tubi Data Runtime with GPU Support. 1 GPU ONLY.
kubespawner_override:
image: <private-docker-registry>/tubi-data-runtime-gpu
extra_resource_limits:
nvidia.com/gpu: "1"TensorBoard Proxy Patch
To make TensorBoard reachable through JupyterHub, a small patch to tensorboard/notebook.py rewrites the URL to use the Jupyter Server Proxy.
diff --git a/tensorboard/notebook.py b/tensorboard/notebook.py
index fe0e13aa..ab774377 100644
--- a/tensorboard/notebook.py
+++ b/tensorboard/notebook.py
@@ -378,8 +378,17 @@ def _display_ipython(port, height, display_handle):
const frame = document.getElementById(%JSON_ID%);
- const url = new URL("/", window.location);
- url.port = %PORT%;
+ var baseUrl = "/";
+ try {
+ baseUrl = JSON.parse(document.getElementById('jupyter-config-data').text || '').baseUrl;
+ } catch {
+ try {
+ baseUrl = $('body').data('baseUrl');
+ } catch {}
+ }
+ const url = new URL(baseUrl, window.location) + "proxy/%PORT%/";
frame.src = url;
})();Node Affinity
CPU pods prefer CPU instance groups, while GPU pods require GPU instance groups. The Helm configuration uses node_affinity_preferred and node_affinity_required:
singleuser:
profileList:
- display_name: "Default"
kubespawner_override:
image: <private-docker-registry>/tubi-data-runtime
node_affinity_preferred:
- weight: 1
preference:
matchExpressions:
- key: kops.k8s.io/instancegroup
operator: NotIn
values:
- gpu
extra_resource_limits:
nvidia.com/gpu: "0"
- display_name: "GPU"
kubespawner_override:
image: <private-docker-registry>/tubi-data-runtime-gpu
node_affinity_required:
- matchExpressions:
- key: kops.k8s.io/instancegroup
operator: In
values:
- gpu
extra_resource_limits:
nvidia.com/gpu: "1"Cluster Autoscaling
AWS Cluster Autoscaler is enabled by adding IAM permissions, labeling nodes with k8s.io/cluster-autoscaler/enabled and k8s.io/cluster-autoscaler/<CLUSTER‑NAME>, and adjusting the SSL certificates path. This reduces idle node costs by ~50%.
Monitoring
Prometheus, kube‑prometheus, Grafana, and Alertmanager are deployed for internal monitoring. Access is restricted to company IP ranges.
Conclusion
TDR now provides a production‑grade, cloud‑native data platform with authentication, storage, GPU support, autoscaling, and observability. Future work includes fine‑grained permission control, task scheduling, reproducibility, and model serving to make the platform usable by non‑technical analysts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bitu Technology
Bitu Technology is the registered company of Tubi's China team. We are engineers passionate about leveraging advanced technology to improve lives, and we hope to use this channel to connect and advance together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
