Optimizing Data Access in Tubi Data Runtime: Redshift Connector, SQL Cell Magic, and JupyterLab Extensions
This article explains how Tubi Data Runtime (TDR) streamlines data access on JupyterHub by introducing an optimized Redshift connector, custom SQL cell magic, and JupyterLab extensions for data exploration, reducing latency and resource usage while enhancing collaboration and usability for data scientists and engineers.
Tubi uses Jupyter Notebook as the unified platform for data analysis and data science. All notebooks run on a customized JupyterHub deployment, referred to as Tubi Data Runtime (TDR), which is built on Kubernetes.
Optimized Data Connector – Most queries to AWS Redshift and S3 return less than 1 GB of data, which fits comfortably in memory. The legacy workflow required launching a Spark cluster, converting Redshift results to a Spark DataFrame, then to Pandas, incurring high latency. TDR replaces this with a direct Redshift UNLOAD command that writes CSV shards to S3, reads them concurrently with multiprocessing, and concatenates the parts with pandas.concat:
df = query_redshift(sc, sql).toPandas() # 95% of Redshift queries def to_df(self, nproc=None, **kwargs):
parser_options = self._get_parser_options(**kwargs)
s3 = S3FileSystem(key=self.temp_credentials[0], secret=self.temp_credentials[1], token=self.temp_credentials[2])
args = [(s3.open(f, 'rb'), self.manifest, True, parser_options, self.temp_credentials)
for f in self.manifest.iter_files()]
with util.create_executor(nproc) as executor:
parts = executor(_read_part, args)
return pd.concat(parts, ignore_index=True).reset_index(drop=True)The helper create_executor uses multiprocessing.Pool.starmap to read CSV parts in parallel:
@contextmanager
def create_executor(nproc):
pool = mp.Pool(nproc)
try:
yield pool.starmap
finally:
pool.close()
pool.join()Because the CSV files are generated directly by Redshift UNLOAD, they are well‑formed, eliminating read errors.
SQL Cell Magic – To lower the barrier for users comfortable with SQL but new to Python, TDR provides a %sql cell magic. The magic executes the cell content via tdr.query_redshift(cell).to_df() and stores the resulting DataFrame in the variable df:
from IPython.core.magic import cell_magic, Magics, magics_class
@magics_class
class TubiDataRuntimeMagics(Magics):
@cell_magic
def sql(self, line='', cell=None):
import tubi_data_runtime as tdr
self.shell.user_ns['df'] = tdr.query_redshift(cell).to_df()The magic also supports Jinja2 templating, allowing reusable SQL snippets such as {{ott_app}} to be inserted into queries.
JupyterLab Extension for Data Explorer – To provide an integrated visual exploration experience, a JupyterLab extension renders the nteract data explorer directly inside a notebook cell and persists its state in notebook metadata:
renderModel(model: IRenderMime.IMimeModel): Promise<void> {
const data = model.data[this._mimeType] as JSONObject; // data explorer data
const metadata = model.metadata.dataExplorer as JSONObject; // data explorer metadata
const onMetadataChange = (data: object) => {
model.setData({ metadata: { ...model.metadata, dataExplorer: data } });
const notebookPanel = findNotebookPanel(this).context.save(); // save
};
return new Promise<void>((resolve) => {
ReactDOM.render(<DataExplorer data={data} metadata={metadata} onMetadataChange={onMetadataChange} />, this.node, resolve);
});
}Combined with the SQL magic, users can retrieve data with a single %sql cell and visualize it with display(df), dramatically simplifying the workflow for beginners.
Collaboration Features – The team built extensions for shareable notebook URLs (using JupyterHub /user-redirect/), directory copy‑paste operations, and open‑source the extensions on GitHub.
Additional Capabilities – TDR bundles pandas_profiling for quick data quality reports, a searchable data catalog built from Redshift metadata, and a Plotly‑based 3D content explorer for media assets. Future plans include integrating TensorBoard and connecting TDR to remote Spark and TensorFlow clusters.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bitu Technology
Bitu Technology is the registered company of Tubi's China team. We are engineers passionate about leveraging advanced technology to improve lives, and we hope to use this channel to connect and advance together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
