Big Data 12 min read

Optimizing Data Access in Tubi Data Runtime: Redshift Connector, SQL Cell Magic, and JupyterLab Extensions

This article explains how Tubi Data Runtime (TDR) streamlines data access on JupyterHub by introducing an optimized Redshift connector, custom SQL cell magic, and JupyterLab extensions for data exploration, reducing latency and resource usage while enhancing collaboration and usability for data scientists and engineers.

Bitu Technology

May 29, 2020

Optimizing Data Access in Tubi Data Runtime: Redshift Connector, SQL Cell Magic, and JupyterLab Extensions

Tubi uses Jupyter Notebook as the unified platform for data analysis and data science. All notebooks run on a customized JupyterHub deployment, referred to as Tubi Data Runtime (TDR), which is built on Kubernetes.

Optimized Data Connector – Most queries to AWS Redshift and S3 return less than 1 GB of data, which fits comfortably in memory. The legacy workflow required launching a Spark cluster, converting Redshift results to a Spark DataFrame, then to Pandas, incurring high latency. TDR replaces this with a direct Redshift UNLOAD command that writes CSV shards to S3, reads them concurrently with multiprocessing, and concatenates the parts with pandas.concat:

df = query_redshift(sc, sql).toPandas()  # 95% of Redshift queries

def to_df(self, nproc=None, **kwargs):
    parser_options = self._get_parser_options(**kwargs)
    s3 = S3FileSystem(key=self.temp_credentials[0], secret=self.temp_credentials[1], token=self.temp_credentials[2])
    args = [(s3.open(f, 'rb'), self.manifest, True, parser_options, self.temp_credentials)
            for f in self.manifest.iter_files()]
    with util.create_executor(nproc) as executor:
        parts = executor(_read_part, args)
    return pd.concat(parts, ignore_index=True).reset_index(drop=True)

The helper create_executor uses multiprocessing.Pool.starmap to read CSV parts in parallel:

@contextmanager
def create_executor(nproc):
    pool = mp.Pool(nproc)
    try:
        yield pool.starmap
    finally:
        pool.close()
        pool.join()

Because the CSV files are generated directly by Redshift UNLOAD, they are well‑formed, eliminating read errors.

SQL Cell Magic – To lower the barrier for users comfortable with SQL but new to Python, TDR provides a %sql cell magic. The magic executes the cell content via tdr.query_redshift(cell).to_df() and stores the resulting DataFrame in the variable df:

from IPython.core.magic import cell_magic, Magics, magics_class

@magics_class
class TubiDataRuntimeMagics(Magics):
    @cell_magic
    def sql(self, line='', cell=None):
        import tubi_data_runtime as tdr
        self.shell.user_ns['df'] = tdr.query_redshift(cell).to_df()

The magic also supports Jinja2 templating, allowing reusable SQL snippets such as {{ott_app}} to be inserted into queries.

JupyterLab Extension for Data Explorer – To provide an integrated visual exploration experience, a JupyterLab extension renders the nteract data explorer directly inside a notebook cell and persists its state in notebook metadata:

renderModel(model: IRenderMime.IMimeModel): Promise<void> {
    const data = model.data[this._mimeType] as JSONObject; // data explorer data
    const metadata = model.metadata.dataExplorer as JSONObject; // data explorer metadata
    const onMetadataChange = (data: object) => {
        model.setData({ metadata: { ...model.metadata, dataExplorer: data } });
        const notebookPanel = findNotebookPanel(this).context.save(); // save
    };
    return new Promise<void>((resolve) => {
        ReactDOM.render(<DataExplorer data={data} metadata={metadata} onMetadataChange={onMetadataChange} />, this.node, resolve);
    });
}

Combined with the SQL magic, users can retrieve data with a single %sql cell and visualize it with display(df), dramatically simplifying the workflow for beginners.

Collaboration Features – The team built extensions for shareable notebook URLs (using JupyterHub /user-redirect/), directory copy‑paste operations, and open‑source the extensions on GitHub.

Additional Capabilities – TDR bundles pandas_profiling for quick data quality reports, a searchable data catalog built from Redshift metadata, and a Plotly‑based 3D content explorer for media assets. Future plans include integrating TensorBoard and connecting TDR to remote Spark and TensorFlow clusters.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Kubernetes bigdata JupyterHub DataConnector Redshift SQLMagic

Written by

Bitu Technology

Bitu Technology is the registered company of Tubi's China team. We are engineers passionate about leveraging advanced technology to improve lives, and we hope to use this channel to connect and advance together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.