How to Build AI-Powered iOS Apps with Core ML, Create ML, and Vision
This article explains how to integrate artificial‑intelligence capabilities such as image classification, speech‑to‑text, and facial‑expression analysis into iOS applications using Apple’s Core ML, Create ML, and Vision frameworks, providing step‑by‑step guidance, code samples, and future‑direction insights.
Why Integrate AI on Mobile?
Embedding AI into mobile apps enhances user experience, improves processing efficiency, and enables offline operation through on‑device inference, which reduces latency and protects privacy.
Enhanced UX: intelligent recommendations, voice assistants, image recognition.
Higher efficiency: large‑scale data processing and predictive analysis.
Offline capability: Core ML runs models locally, eliminating network dependence.
Expected Outcomes
1. Image Classification
Goal: user uploads a photo, the app automatically identifies the content (e.g., car or landscape).
Output: class label with confidence, e.g., "Car (95%)".
2. Speech Recognition
Goal: convert spoken input to text in real time.
Output: transcribed text such as "Hello, how's the weather today?".
3. Facial Expression Analysis
Goal: detect emotional state (happy, sad, etc.) from a face image.
Output: emotion label with confidence, e.g., "Happy".
Core Frameworks for AI on iOS
Core ML
Apple’s on‑device machine‑learning framework that supports multiple model formats (TensorFlow, PyTorch, ONNX) and leverages CPU, GPU, and Neural Engine for fast inference. It can handle tasks such as deep‑learning image recognition, text processing, and speech recognition.
Create ML
A graphical tool and Swift API for quickly creating, training, and exporting .mlmodel files. It offers a low‑code workflow for common tasks like image classification, text classification, object detection, and time‑series prediction.
Vision
A powerful computer‑vision framework tightly integrated with Core ML. It provides image‑and‑video analysis capabilities such as face detection, object recognition, barcode scanning, and real‑time tracking, all running locally on the device.
Implementation Steps
Image Classification
Pre‑process the image: resize to the model’s required dimensions and optionally apply data augmentation.
Feature extraction: use a CNN (e.g., ResNet‑50 or MobileNet V2) to obtain hierarchical features.
Classification: map features to class probabilities via a fully‑connected layer and select the highest‑confidence label.
Core code example:
enum RecognitionModel {
case vision // Vision framework
case resnet // ResNet‑50 model
case mobileNet // MobileNet V2 model
}
class ImageClassifier: ObservableObject {
@Published var image: UIImage?
@Published var imageLabel: String?
private func classifyWithVision(image: UIImage) {
let request = VNClassifyImageRequest { request, error in
guard let results = request.results as? [VNClassificationObservation],
let topResult = results.first else { return }
self.imageLabel = "\(topResult.identifier) (\(Int(topResult.confidence * 100))%)"
}
let handler = VNImageRequestHandler(ciImage: CIImage(image: image)!)
try? handler.perform([request])
}
private func classifyWithResNet(image: UIImage) {
guard let model = try? Resnet50FP16(configuration: MLModelConfiguration()),
let buffer = image.resize(to: CGSize(width: 224, height: 224))?.cvPixelBuffer else { return }
let output = try? model.prediction(image: buffer)
self.imageLabel = "\(output!.classLabel) (\(Int(output!.classLabelProbs[output!.classLabel]! * 100))%)"
}
private func classifyWithMobileNet(image: UIImage) {
guard let model = try? MobileNetV2(configuration: MLModelConfiguration()),
let buffer = image.resize(to: CGSize(width: 224, height: 224))?.cvPixelBuffer else { return }
let output = try? model.prediction(image: buffer)
self.imageLabel = "\(output!.classLabel) (\(Int(output!.classLabelProbs[output!.classLabel]! * 100))%)"
}
}
extension UIImage {
func resize(to size: CGSize) -> UIImage? {
UIGraphicsBeginImageContextWithOptions(size, false, 1.0)
self.draw(in: CGRect(origin: .zero, size: size))
let resized = UIGraphicsGetImageFromCurrentImageContext()
UIGraphicsEndImageContext()
return resized
}
var cvPixelBuffer: CVPixelBuffer? {
let attrs = [kCVPixelBufferCGImageCompatibilityKey: kCFBooleanTrue,
kCVPixelBufferCGBitmapContextCompatibilityKey: kCFBooleanTrue] as CFDictionary
var pixelBuffer: CVPixelBuffer?
CVPixelBufferCreate(kCFAllocatorDefault, Int(self.size.width), Int(self.size.height), kCVPixelFormatType_32ARGB, attrs, &pixelBuffer)
return pixelBuffer
}
}Speech Recognition
Audio pre‑processing: capture audio frames with AVAudioEngine and extract features.
Real‑time transcription: feed audio buffers to SFSpeechRecognizer for on‑device speech‑to‑text conversion.
Optional file recognition: use SFSpeechURLRecognitionRequest for pre‑recorded audio files.
Core code example:
class SpeechRecognitionManager: ObservableObject {
private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "zh-CN"))
private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
private var recognitionTask: SFSpeechRecognitionTask?
private let audioEngine = AVAudioEngine()
@Published var isRecording = false
@Published var recognizedText = ""
func startRecording() {
let audioSession = AVAudioSession.sharedInstance()
try? audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
try? audioSession.setActive(true)
recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
recognitionRequest?.shouldReportPartialResults = true
let inputNode = audioEngine.inputNode
let format = inputNode.outputFormat(forBus: 0)
recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest!) { result, error in
if let result = result { self.recognizedText = result.bestTranscription.formattedString }
}
inputNode.installTap(onBus: 0, bufferSize: 1024, format: format) { buffer, _ in
self.recognitionRequest?.append(buffer)
}
audioEngine.prepare()
try? audioEngine.start()
}
func recognizeAudioFile(url: URL) {
let request = SFSpeechURLRecognitionRequest(url: url)
request.shouldReportPartialResults = true
speechRecognizer?.recognitionTask(with: request) { result, error in
if let result = result { self.recognizedText = result.bestTranscription.formattedString }
}
}
}Facial Expression Analysis
Face detection: use Vision’s VNDetectFaceRectangles to locate facial landmarks.
Feature extraction: compute mouth aspect ratio, mouth angle, eyebrow height, eye opening.
Emotion classification: feed extracted features into a Core ML model (typically a CNN) to predict emotions.
Result display: show emotion label with confidence.
Core code snippet for feature calculation:
let mouthHeight = mouth.map { $0.y }.max()! - mouth.map { $0.y }.min()!
let mouthWidth = mouth.map { $0.x }.max()! - mouth.map { $0.x }.min()!
let mouthAspectRatio = mouthHeight / mouthWidth
let mouthCorners = [mouth[0], mouth[mouth.count/2]]
let mouthAngle = atan2(mouthCorners[1].y - mouthCorners[0].y, mouthCorners[1].x - mouthCorners[0].x)
let leftBrowHeight = leftBrow.map { $0.y }.reduce(0, +) / Double(leftBrow.count)
let rightBrowHeight = rightBrow.map { $0.y }.reduce(0, +) / Double(rightBrow.count)
let avgBrowHeight = (leftBrowHeight + rightBrowHeight) / 2
let leftEyeHeight = leftEye.map { $0.y }.max()! - leftEye.map { $0.y }.min()!
let rightEyeHeight = rightEye.map { $0.y }.max()! - rightEye.map { $0.y }.min()!
let avgEyeHeight = (leftEyeHeight + rightEyeHeight) / 2
if mouthAngle > 0.05 && mouthAspectRatio < 0.35 && mouthWidth > 0.3 { return .happy }
if mouthAngle < -0.05 && avgBrowHeight < 0.65 && mouthWidth < 0.35 { return .sad }
if avgBrowHeight < 0.55 && mouthHeight < 0.15 && mouthWidth < 0.3 { return .angry }
if avgBrowHeight > 0.65 && mouthHeight > 0.25 && avgEyeHeight > 0.2 { return .surprise }Future Development Directions
Edge computing proliferation: real‑time processing for image, speech, and multimodal fusion.
Personalized on‑device models: training with Create ML directly on the user’s device, combined with privacy‑preserving techniques such as differential privacy.
Multimodal AI integration: combining voice commands with image classification and emotion analysis for richer interactions.
No‑code/low‑code AI development: more user‑friendly Create ML interfaces that allow drag‑and‑drop model generation and rapid Xcode integration.
Application Scenarios
Healthcare : disease detection from medical images, health monitoring via voice or facial cues, rehabilitation guidance.
Smart Assistants : multimodal interaction, task prioritization through sentiment analysis, privacy‑first local voice command processing.
Enterprise Services : intelligent customer support, meeting transcription, predictive analytics on business KPIs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
iKang Technology Team
The iKang tech team shares their technical and practical experiences in medical‑health projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
