Mobile Development 16 min read

How to Build AI-Powered iOS Apps with Core ML, Create ML, and Vision

This article explains how to integrate artificial‑intelligence capabilities such as image classification, speech‑to‑text, and facial‑expression analysis into iOS applications using Apple’s Core ML, Create ML, and Vision frameworks, providing step‑by‑step guidance, code samples, and future‑direction insights.

iKang Technology Team

Dec 12, 2024

How to Build AI-Powered iOS Apps with Core ML, Create ML, and Vision

Why Integrate AI on Mobile?

Embedding AI into mobile apps enhances user experience, improves processing efficiency, and enables offline operation through on‑device inference, which reduces latency and protects privacy.

Enhanced UX: intelligent recommendations, voice assistants, image recognition.

Higher efficiency: large‑scale data processing and predictive analysis.

Offline capability: Core ML runs models locally, eliminating network dependence.

Expected Outcomes

1. Image Classification

Goal: user uploads a photo, the app automatically identifies the content (e.g., car or landscape).

Output: class label with confidence, e.g., "Car (95%)".

2. Speech Recognition

Goal: convert spoken input to text in real time.

Output: transcribed text such as "Hello, how's the weather today?".

3. Facial Expression Analysis

Goal: detect emotional state (happy, sad, etc.) from a face image.

Output: emotion label with confidence, e.g., "Happy".

Core Frameworks for AI on iOS

Core ML

Apple’s on‑device machine‑learning framework that supports multiple model formats (TensorFlow, PyTorch, ONNX) and leverages CPU, GPU, and Neural Engine for fast inference. It can handle tasks such as deep‑learning image recognition, text processing, and speech recognition.

Create ML

A graphical tool and Swift API for quickly creating, training, and exporting .mlmodel files. It offers a low‑code workflow for common tasks like image classification, text classification, object detection, and time‑series prediction.

Vision

A powerful computer‑vision framework tightly integrated with Core ML. It provides image‑and‑video analysis capabilities such as face detection, object recognition, barcode scanning, and real‑time tracking, all running locally on the device.

Implementation Steps

Image Classification

Pre‑process the image: resize to the model’s required dimensions and optionally apply data augmentation.

Feature extraction: use a CNN (e.g., ResNet‑50 or MobileNet V2) to obtain hierarchical features.

Classification: map features to class probabilities via a fully‑connected layer and select the highest‑confidence label.

Core code example:

enum RecognitionModel {
    case vision        // Vision framework
    case resnet        // ResNet‑50 model
    case mobileNet     // MobileNet V2 model
}

class ImageClassifier: ObservableObject {
    @Published var image: UIImage?
    @Published var imageLabel: String?

    private func classifyWithVision(image: UIImage) {
        let request = VNClassifyImageRequest { request, error in
            guard let results = request.results as? [VNClassificationObservation],
                  let topResult = results.first else { return }
            self.imageLabel = "\(topResult.identifier) (\(Int(topResult.confidence * 100))%)"
        }
        let handler = VNImageRequestHandler(ciImage: CIImage(image: image)!)
        try? handler.perform([request])
    }

    private func classifyWithResNet(image: UIImage) {
        guard let model = try? Resnet50FP16(configuration: MLModelConfiguration()),
              let buffer = image.resize(to: CGSize(width: 224, height: 224))?.cvPixelBuffer else { return }
        let output = try? model.prediction(image: buffer)
        self.imageLabel = "\(output!.classLabel) (\(Int(output!.classLabelProbs[output!.classLabel]! * 100))%)"
    }

    private func classifyWithMobileNet(image: UIImage) {
        guard let model = try? MobileNetV2(configuration: MLModelConfiguration()),
              let buffer = image.resize(to: CGSize(width: 224, height: 224))?.cvPixelBuffer else { return }
        let output = try? model.prediction(image: buffer)
        self.imageLabel = "\(output!.classLabel) (\(Int(output!.classLabelProbs[output!.classLabel]! * 100))%)"
    }
}

extension UIImage {
    func resize(to size: CGSize) -> UIImage? {
        UIGraphicsBeginImageContextWithOptions(size, false, 1.0)
        self.draw(in: CGRect(origin: .zero, size: size))
        let resized = UIGraphicsGetImageFromCurrentImageContext()
        UIGraphicsEndImageContext()
        return resized
    }
    var cvPixelBuffer: CVPixelBuffer? {
        let attrs = [kCVPixelBufferCGImageCompatibilityKey: kCFBooleanTrue,
                     kCVPixelBufferCGBitmapContextCompatibilityKey: kCFBooleanTrue] as CFDictionary
        var pixelBuffer: CVPixelBuffer?
        CVPixelBufferCreate(kCFAllocatorDefault, Int(self.size.width), Int(self.size.height), kCVPixelFormatType_32ARGB, attrs, &pixelBuffer)
        return pixelBuffer
    }
}

Speech Recognition

Audio pre‑processing: capture audio frames with AVAudioEngine and extract features.

Real‑time transcription: feed audio buffers to SFSpeechRecognizer for on‑device speech‑to‑text conversion.

Optional file recognition: use SFSpeechURLRecognitionRequest for pre‑recorded audio files.

Core code example:

class SpeechRecognitionManager: ObservableObject {
    private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "zh-CN"))
    private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
    private var recognitionTask: SFSpeechRecognitionTask?
    private let audioEngine = AVAudioEngine()
    @Published var isRecording = false
    @Published var recognizedText = ""

    func startRecording() {
        let audioSession = AVAudioSession.sharedInstance()
        try? audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
        try? audioSession.setActive(true)
        recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
        recognitionRequest?.shouldReportPartialResults = true
        let inputNode = audioEngine.inputNode
        let format = inputNode.outputFormat(forBus: 0)
        recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest!) { result, error in
            if let result = result { self.recognizedText = result.bestTranscription.formattedString }
        }
        inputNode.installTap(onBus: 0, bufferSize: 1024, format: format) { buffer, _ in
            self.recognitionRequest?.append(buffer)
        }
        audioEngine.prepare()
        try? audioEngine.start()
    }

    func recognizeAudioFile(url: URL) {
        let request = SFSpeechURLRecognitionRequest(url: url)
        request.shouldReportPartialResults = true
        speechRecognizer?.recognitionTask(with: request) { result, error in
            if let result = result { self.recognizedText = result.bestTranscription.formattedString }
        }
    }
}

Facial Expression Analysis

Face detection: use Vision’s VNDetectFaceRectangles to locate facial landmarks.

Feature extraction: compute mouth aspect ratio, mouth angle, eyebrow height, eye opening.

Emotion classification: feed extracted features into a Core ML model (typically a CNN) to predict emotions.

Result display: show emotion label with confidence.

Core code snippet for feature calculation:

let mouthHeight = mouth.map { $0.y }.max()! - mouth.map { $0.y }.min()!
let mouthWidth = mouth.map { $0.x }.max()! - mouth.map { $0.x }.min()!
let mouthAspectRatio = mouthHeight / mouthWidth
let mouthCorners = [mouth[0], mouth[mouth.count/2]]
let mouthAngle = atan2(mouthCorners[1].y - mouthCorners[0].y, mouthCorners[1].x - mouthCorners[0].x)
let leftBrowHeight = leftBrow.map { $0.y }.reduce(0, +) / Double(leftBrow.count)
let rightBrowHeight = rightBrow.map { $0.y }.reduce(0, +) / Double(rightBrow.count)
let avgBrowHeight = (leftBrowHeight + rightBrowHeight) / 2
let leftEyeHeight = leftEye.map { $0.y }.max()! - leftEye.map { $0.y }.min()!
let rightEyeHeight = rightEye.map { $0.y }.max()! - rightEye.map { $0.y }.min()!
let avgEyeHeight = (leftEyeHeight + rightEyeHeight) / 2
if mouthAngle > 0.05 && mouthAspectRatio < 0.35 && mouthWidth > 0.3 { return .happy }
if mouthAngle < -0.05 && avgBrowHeight < 0.65 && mouthWidth < 0.35 { return .sad }
if avgBrowHeight < 0.55 && mouthHeight < 0.15 && mouthWidth < 0.3 { return .angry }
if avgBrowHeight > 0.65 && mouthHeight > 0.25 && avgEyeHeight > 0.2 { return .surprise }

Future Development Directions

Edge computing proliferation: real‑time processing for image, speech, and multimodal fusion.

Personalized on‑device models: training with Create ML directly on the user’s device, combined with privacy‑preserving techniques such as differential privacy.

Multimodal AI integration: combining voice commands with image classification and emotion analysis for richer interactions.

No‑code/low‑code AI development: more user‑friendly Create ML interfaces that allow drag‑and‑drop model generation and rapid Xcode integration.

Application Scenarios

Healthcare : disease detection from medical images, health monitoring via voice or facial cues, rehabilitation guidance.

Smart Assistants : multimodal interaction, task prioritization through sentiment analysis, privacy‑first local voice command processing.

Enterprise Services : intelligent customer support, meeting transcription, predictive analytics on business KPIs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

iOS machine learning mobile AI Swift Vision Core ML Create ML

Written by

iKang Technology Team

The iKang tech team shares their technical and practical experiences in medical‑health projects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.