Artificial Intelligence 19 min read

End‑to‑End Breast Cancer Prediction Solution Using Decision Tree on Tencent Cloud AI Platform

This guide details an end‑to‑end breast‑cancer prediction pipeline on Tencent Cloud, covering offline decision‑tree training with TI‑ONE, model packaging as a PMML service, real‑time feature generation via Oceanus and CKafka, and live inference stored in ClickHouse, all within a secure VPC.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
End‑to‑End Breast Cancer Prediction Solution Using Decision Tree on Tencent Cloud AI Platform

This document describes a complete AI‑driven solution for breast‑cancer prediction. It leverages the TI‑ONE machine‑learning platform, the TI‑EMS model‑service, Tencent Cloud Oceanus (Flink), CKafka, ClickHouse, and COS to build, serve, and evaluate a decision‑tree classification model.

1. Overview

Artificial‑intelligence techniques are increasingly applied in medical diagnosis, disease prediction, and therapy selection. For breast cancer—second only to lung cancer in incidence—machine‑learning algorithms can identify the most predictive clinical features, enabling early detection.

2. Solution Architecture

The workflow consists of:

Offline model training on TI‑ONE; the trained model is stored in COS.

Model packaging as a PMML service via TI‑EMS.

Real‑time feature generation using Oceanus (Flink) with a Datagen connector, publishing CSV‑formatted features to CKafka.

Oceanus consumes CKafka data, calls the TI‑EMS PMML service, and writes prediction results to ClickHouse.

All services are deployed within the same VPC to ensure network connectivity.

3. Pre‑deployment Preparations

• Create a private VPC and subnets. • Provision Oceanus clusters, CKafka instances, COS buckets, ClickHouse clusters, and enable TI‑ONE/TI‑EMS services. • Configure role authorizations for each service.

4. Offline Model Training

Use the public Breast Cancer Wisconsin dataset (569 samples, 32 features, 10 selected for this demo). Train a decision‑tree classifier on TI‑ONE, evaluate the binary‑classification metrics, and save the model to the model repository (PMML format).

5. Real‑time Feature Engineering

Generate synthetic patient records with a Flink SQL source table:

-- random source 用于模拟患者病历实时特征数据
CREATE TABLE random_source (
    ClumpThickness INT,
    UniformityOfCellSize INT,
    UniformityOfCellShape INT,
    MarginalAdhsion INT,
    SingleEpithelialCellSize INT,
    BareNuclei INT,
    BlandChromation INT,
    NormalNucleoli INT,
    Mitoses INT
) WITH (
  'connector' = 'datagen',
  'rows-per-second'='1',
  'fields.ClumpThickness.kind'='random',
  'fields.ClumpThickness.min'='0',
  'fields.ClumpThickness.max'='10',
  ...
);

Define a Kafka sink to write the generated features:

CREATE TABLE KafkaSink (
    ClumpThickness INT,
    UniformityOfCellSize INT,
    UniformityOfCellShape INT,
    MarginalAdhsion INT,
    SingleEpithelialCellSize INT,
    BareNuclei INT,
    BlandChromation INT,
    NormalNucleoli INT,
    Mitoses INT
) WITH (
    'connector' = 'kafka',
    'topic' = 'topic-decision-tree-predict-1',
    'properties.bootstrap.servers' = '172.28.28.211:9092',
    'properties.group.id' = 'RealTimeFeatures',
    'format' = 'csv'
);

Insert the synthetic data into the sink:

INSERT INTO KafkaSink SELECT * FROM random_source;

6. Real‑time Prediction

Two approaches are provided:

Public HTTP endpoint: after starting the model service in TI‑EMS, obtain a public URL and invoke it with JSON payloads (example curl command shown).

VPC‑internal call from Oceanus JAR job: a Flink Java program reads from CKafka, formats the payload, sends an HTTP POST to the model service, and writes the prediction back to ClickHouse.

Key Java code (simplified):

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.HttpClientBuilder;
... 
public class OnlinePredict {
    public static void main(String[] args) throws Exception {
        // build Kafka source, transform data, call model service, write to ClickHouse
    }
    // helper methods: buildKafkaSource, inputDataTransfer, sendHttpData
}

Configuration for the Kafka source (properties file):

kafka.source.bootstrap.servers=172.28.28.211:9092
kafka.source.topic=topic-decision-tree-predict-1
kafka.source.group.id=RealTimePredict1
kafka.source.auto.offset.reset=latest

Required Maven dependencies (Flink, Kafka, ClickHouse connector, HTTP client, JSON library) are listed in the document.

7. Summary

The solution demonstrates how to combine offline model training with real‑time feature engineering and online inference on Tencent Cloud. It highlights the flexibility of choosing different data warehouses (ClickHouse, Hive, MySQL, etc.) and the ability to scale the PMML service for higher throughput.

machine learningflinkAIReal-time Streamingdecision treeTencent Cloudbreast cancer prediction
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.