Artificial Intelligence 19 min read

End‑to‑End Breast Cancer Prediction Solution Using Decision Tree on Tencent Cloud AI Platform

This guide details an end‑to‑end breast‑cancer prediction pipeline on Tencent Cloud, covering offline decision‑tree training with TI‑ONE, model packaging as a PMML service, real‑time feature generation via Oceanus and CKafka, and live inference stored in ClickHouse, all within a secure VPC.

Tencent Cloud Developer

Nov 19, 2021

End‑to‑End Breast Cancer Prediction Solution Using Decision Tree on Tencent Cloud AI Platform

This document describes a complete AI‑driven solution for breast‑cancer prediction. It leverages the TI‑ONE machine‑learning platform, the TI‑EMS model‑service, Tencent Cloud Oceanus (Flink), CKafka, ClickHouse, and COS to build, serve, and evaluate a decision‑tree classification model.

1. Overview

Artificial‑intelligence techniques are increasingly applied in medical diagnosis, disease prediction, and therapy selection. For breast cancer—second only to lung cancer in incidence—machine‑learning algorithms can identify the most predictive clinical features, enabling early detection.

2. Solution Architecture

The workflow consists of:

Offline model training on TI‑ONE; the trained model is stored in COS.

Model packaging as a PMML service via TI‑EMS.

Real‑time feature generation using Oceanus (Flink) with a Datagen connector, publishing CSV‑formatted features to CKafka.

Oceanus consumes CKafka data, calls the TI‑EMS PMML service, and writes prediction results to ClickHouse.

All services are deployed within the same VPC to ensure network connectivity.

3. Pre‑deployment Preparations

• Create a private VPC and subnets. • Provision Oceanus clusters, CKafka instances, COS buckets, ClickHouse clusters, and enable TI‑ONE/TI‑EMS services. • Configure role authorizations for each service.

4. Offline Model Training

Use the public Breast Cancer Wisconsin dataset (569 samples, 32 features, 10 selected for this demo). Train a decision‑tree classifier on TI‑ONE, evaluate the binary‑classification metrics, and save the model to the model repository (PMML format).

5. Real‑time Feature Engineering

Generate synthetic patient records with a Flink SQL source table:

-- random source 用于模拟患者病历实时特征数据
CREATE TABLE random_source (
    ClumpThickness INT,
    UniformityOfCellSize INT,
    UniformityOfCellShape INT,
    MarginalAdhsion INT,
    SingleEpithelialCellSize INT,
    BareNuclei INT,
    BlandChromation INT,
    NormalNucleoli INT,
    Mitoses INT
) WITH (
  'connector' = 'datagen',
  'rows-per-second'='1',
  'fields.ClumpThickness.kind'='random',
  'fields.ClumpThickness.min'='0',
  'fields.ClumpThickness.max'='10',
  ...
);

Define a Kafka sink to write the generated features:

CREATE TABLE KafkaSink (
    ClumpThickness INT,
    UniformityOfCellSize INT,
    UniformityOfCellShape INT,
    MarginalAdhsion INT,
    SingleEpithelialCellSize INT,
    BareNuclei INT,
    BlandChromation INT,
    NormalNucleoli INT,
    Mitoses INT
) WITH (
    'connector' = 'kafka',
    'topic' = 'topic-decision-tree-predict-1',
    'properties.bootstrap.servers' = '172.28.28.211:9092',
    'properties.group.id' = 'RealTimeFeatures',
    'format' = 'csv'
);

Insert the synthetic data into the sink: INSERT INTO KafkaSink SELECT * FROM random_source; 6. Real‑time Prediction

Two approaches are provided:

Public HTTP endpoint: after starting the model service in TI‑EMS, obtain a public URL and invoke it with JSON payloads (example curl command shown).

VPC‑internal call from Oceanus JAR job: a Flink Java program reads from CKafka, formats the payload, sends an HTTP POST to the model service, and writes the prediction back to ClickHouse.

Key Java code (simplified):

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.HttpClientBuilder;
... 
public class OnlinePredict {
    public static void main(String[] args) throws Exception {
        // build Kafka source, transform data, call model service, write to ClickHouse
    }
    // helper methods: buildKafkaSource, inputDataTransfer, sendHttpData
}

Configuration for the Kafka source (properties file):

kafka.source.bootstrap.servers=172.28.28.211:9092
kafka.source.topic=topic-decision-tree-predict-1
kafka.source.group.id=RealTimePredict1
kafka.source.auto.offset.reset=latest

Required Maven dependencies (Flink, Kafka, ClickHouse connector, HTTP client, JSON library) are listed in the document.

7. Summary

The solution demonstrates how to combine offline model training with real‑time feature engineering and online inference on Tencent Cloud. It highlights the flexibility of choosing different data warehouses (ClickHouse, Hive, MySQL, etc.) and the ability to scale the PMML service for higher throughput.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink AI Real-time Streaming Decision Tree Tencent Cloud breast cancer prediction

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.