Fine‑Tune Any Large Model on Apple Silicon with mlx‑tune
The article introduces mlx‑tune, a community project that wraps the MLX library with Unsloth's API to enable local fine‑tuning of large language, vision, TTS, STT, OCR, and embedding models on Apple Silicon Macs, outlines its workflow from prototype to cloud, provides installation steps, code examples, and discusses its capabilities and limitations.
Mac fine‑tuning limitation
Unsloth relies on Triton, which does not support macOS. Consequently, Mac users cannot run Unsloth locally and must either rent cloud GPUs for tiny experiments or rewrite code to use the native mlx‑lm API.
mlx‑tune solution
mlx‑tune (github.com/ARahim3/mlx-tune) wraps the MLX library with Unsloth’s API. A script written for a Mac can be run on a CUDA cluster by changing only the import statements.
# Unsloth (CUDA) # MLX‑Tune (Apple Silicon)
from unsloth import FastLanguageModel from mlx_tune import FastLanguageModel
from trl import SFTTrainer from mlx_tune import SFTTrainer
# The rest of the code is identical!Supported training methods
SFT : standard instruction fine‑tuning.
DPO / ORPO / KTO / SimPO : full coverage of preference‑learning methods.
GRPO : DeepSeek‑style multi‑generation with reward training.
CPT : continual pre‑training with decoupled learning rates.
Multimodal capabilities
Vision : fine‑tuning of Gemma 4, Qwen3.5, PaliGemma, LLaVA, Pixtral VLMs.
TTS : Orpheus, OuteTTS, Spark‑TTS, Sesame/CSM, Qwen3‑TTS.
STT : Whisper, Moonshine, Qwen3‑ASR, NVIDIA Canary, Voxtral.
Embedding : BERT, ModernBERT, Qwen3‑Embedding, Harrier (with contrastive learning).
OCR : DeepSeek‑OCR, GLM‑OCR, olmOCR, Qwen‑VL (built‑in CER/WER metrics).
Advanced features
MoE fine‑tuning : supports 39+ MoE architectures, including Qwen3.5‑35B, Mixtral, DeepSeek series.
Gemma 4 Audio : 12‑layer Conformer tower for native 16 kHz audio processing.
LFM2 : Liquid AI hybrid convolution + GQA architecture.
Installation
Recommended installer: uv.
# Standard install
uv pip install mlx-tune
# With audio support
uv pip install 'mlx-tune[audio]'
brew install ffmpegMinimal SFT example (4‑bit quantized Llama‑3.2)
from mlx_tune import FastLanguageModel, SFTTrainer, SFTConfig
from datasets import load_dataset
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="mlx-community/Llama-3.2-1B-Instruct-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=16,
)
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:100]")
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
tokenizer=tokenizer,
args=SFTConfig(
output_dir="outputs",
per_device_train_batch_size=2,
learning_rate=2e-4,
max_steps=50,
),
)
trainer.train()
trainer.save_pretrained("lora_model")
trainer.save_pretrained_merged("merged", tokenizer)
trainer.save_pretrained_gguf("model", tokenizer) # GGUF for OllamaVision fine‑tuning example
from mlx_tune import FastVisionModel, UnslothVisionDataCollator, VLMSFTTrainer
from mlx_tune.vlm import VLMSFTConfig
model, processor = FastVisionModel.from_pretrained("mlx-community/Qwen3.5-0.8B-bf16")
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers=True,
finetune_language_layers=True,
r=16,
lora_alpha=16,
)
FastVisionModel.for_training(model)
trainer = VLMSFTTrainer(
model=model,
tokenizer=processor,
data_collator=UnslothVisionDataCollator(model, processor),
train_dataset=dataset,
args=VLMSFTConfig(max_steps=30, learning_rate=2e-4),
)
trainer.train()TTS fine‑tuning example
from mlx_tune import FastTTSModel, TTSSFTTrainer, TTSSFTConfig, TTSDataCollator
from datasets import load_dataset, Audio
model, tokenizer = FastTTSModel.from_pretrained("mlx-community/orpheus-3b-0.1-ft-bf16")
model = FastTTSModel.get_peft_model(model, r=16, lora_alpha=16)
dataset = load_dataset("MrDragonFox/Elise", split="train[:100]")
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
trainer = TTSSFTTrainer(
model=model,
tokenizer=tokenizer,
data_collator=TTSDataCollator(model, tokenizer),
train_dataset=dataset,
args=TTSSFTConfig(output_dir="./tts_output", max_steps=60),
)
trainer.train()Workflow overview
Same code base can be used for local prototyping on a Mac and for large‑scale training on a CUDA cluster.
Local Mac (mlx‑tune) Cloud GPU (Unsloth)
├── Quick experiments ├── Large‑scale training
├── Small dataset validation ├── Full dataset
├── Seconds‑level iteration ├── Production‑grade optimization
└── Same code ──────────────────└── Same codeExport formats
HuggingFace : standard checkpoint.
GGUF : directly usable by Ollama or llama.cpp.
push_to_hub : one‑click upload to HuggingFace Hub.
Limitations
Training speed is slower than on an A100 GPU with Unsloth, due to hardware constraints.
GGUF export has restrictions for quantized base models; non‑quantized models are recommended.
Memory is limited by the unified memory of the Mac (up to 512 GB on Mac Studio).
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
