Running Qwen3‑Embedding on CPU‑Only Machines and Storing Vectors in Redis 8
This guide explains how to run the Qwen3‑Embedding‑0.6B model on a CPU‑only server, configure key parameters, optionally use Intel Extension for PyTorch, and efficiently store the resulting vectors in Redis 8 with proper serialization and indexing.
Environment: two Intel servers (8 CPU × 16 GB RAM) running Python 3.11.6. The embedding model is Qwen/Qwen3-Embedding-0.6B.
1. Model selection – stability on CPU
The goal is reliable execution on CPU‑only hardware. Qwen3-Embedding-0.6B was chosen because it provides the best Chinese text vectorization among small models while keeping memory and compute requirements modest.
Use a modest batch_size (e.g., 2) and increase the number of passes if needed.
Use short texts (e.g., poetry) to limit padding and token overhead.
Set num_threads to the number of physical cores (8) to avoid crashes.
2. Model acquisition – ModelScope first, HuggingFace fallback
In mainland China direct access to HuggingFace can be unstable. The recommended workflow tries ModelScope first and falls back to a HuggingFace mirror if necessary.
Optional HuggingFace mirror configuration:
import os
os.environ.setdefault("HF_ENDPOINT", "https://hf-mirror.com")Loading the model with automatic fallback:
# Set mirror (if needed)
hf_endpoint = os.environ.get("HF_ENDPOINT", "https://hf-mirror.com")
if "huggingface.co" not in hf_endpoint:
os.environ["HF_ENDPOINT"] = hf_endpoint
logger.info(f"Using HuggingFace mirror: {hf_endpoint}")
try:
# Try ModelScope download
from modelscope import snapshot_download
logger.info("Downloading model from ModelScope…")
model_ref = snapshot_download(model_name, cache_dir=cache_dir, revision="master")
logger.info(f"Model downloaded: {model_ref}")
except Exception as e:
logger.warning(f"ModelScope download failed: {e}")
logger.info(f"Falling back to HuggingFace mirror ({hf_endpoint})…")
model_ref = model_name
# Unified loading (model_ref may be a local path or a HF repo id)
tokenizer = AutoTokenizer.from_pretrained(
model_ref, cache_dir=cache_dir, trust_remote_code=trust_remote_code
)
model = AutoModel.from_pretrained(
model_ref, cache_dir=cache_dir, trust_remote_code=trust_remote_code
).to(device)3. CPU‑side key parameters
num_threads: number of CPU threads (set to 8 for an 8‑core server). batch_size: how many texts are processed together (e.g., 2). max_length: maximum token length; 512 tokens are sufficient for short Chinese poems.
Example configuration (YAML‑style):
model:
name: "Qwen/Qwen3-Embedding-0.6B"
cache_dir: "./models"
device: "cpu"
num_threads: 8
use_modelscope: true
trust_remote_code: true
vectorization:
batch_size: 2
max_length: 512
normalize: true
query_instruction: "query: "Initialize the thread pool with:
torch.set_num_threads(num_threads)4. Vectorization pipeline
Clean text : collapse consecutive whitespace into a single space.
Query prefix : prepend "query: " to query strings; document texts are used as‑is after cleaning.
Mean pooling : average token embeddings weighted by attention_mask to obtain a sentence vector.
L2 normalization : optionally apply L2 norm so that cosine similarity can be computed as a dot product.
# Forward pass
model_output = model(**encoded_input)
# Mean pooling
embeddings = _mean_pooling(model_output, encoded_input["attention_mask"])
# Optional L2 normalization
if normalize:
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
batch_vectors = embeddings.cpu().numpy()5. Storing vectors in Redis
Vectors are np.ndarray(N, dim). Each vector must be converted to float32, serialized to bytes, and loaded into a Redis vector index.
def add_vectors(self, texts, vectors, metadata=None, batch_size=100):
docs = []
for i, (text, vec) in enumerate(zip(texts, vectors)):
docs.append({
"text": text,
"vector": vec.astype("float32").tobytes(),
"metadata": json.dumps(metadata[i], ensure_ascii=False) if metadata else "",
})
for start in range(0, len(docs), batch_size):
self.vl_index.load(docs[start:start + batch_size])6. Optional Intel Extension for PyTorch (IPEX)
On Intel CPUs you can install intel-extension-for-pytorch (IPEX) to potentially accelerate certain operators. Use it only after verifying performance gains with benchmarks, and keep the PyTorch version strictly aligned with the IPEX build.
7. Dependency versions
transformers>=4.40.0,<5.0.0
torch==2.10.0 # CPU‑only, supports Python 3.11/3.12
modelscope>=1.12.0
sentencepiece>=0.1.99
onnxruntime>=1.23.0
accelerate>=1.10.0
intel-extension-for-pytorch==2.8.0 # optimized for Qwen3How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
