Artificial Intelligence 8 min read

Running Qwen3‑Embedding on CPU‑Only Machines and Storing Vectors in Redis 8

This guide explains how to run the Qwen3‑Embedding‑0.6B model on a CPU‑only server, configure key parameters, optionally use Intel Extension for PyTorch, and efficiently store the resulting vectors in Redis 8 with proper serialization and indexing.

Tech Musings

Jan 29, 2026

Running Qwen3‑Embedding on CPU‑Only Machines and Storing Vectors in Redis 8

Environment: two Intel servers (8 CPU × 16 GB RAM) running Python 3.11.6. The embedding model is Qwen/Qwen3-Embedding-0.6B.

1. Model selection – stability on CPU

The goal is reliable execution on CPU‑only hardware. Qwen3-Embedding-0.6B was chosen because it provides the best Chinese text vectorization among small models while keeping memory and compute requirements modest.

Use a modest batch_size (e.g., 2) and increase the number of passes if needed.

Use short texts (e.g., poetry) to limit padding and token overhead.

Set num_threads to the number of physical cores (8) to avoid crashes.

2. Model acquisition – ModelScope first, HuggingFace fallback

In mainland China direct access to HuggingFace can be unstable. The recommended workflow tries ModelScope first and falls back to a HuggingFace mirror if necessary.

Optional HuggingFace mirror configuration:

import os
os.environ.setdefault("HF_ENDPOINT", "https://hf-mirror.com")

Loading the model with automatic fallback:

# Set mirror (if needed)
hf_endpoint = os.environ.get("HF_ENDPOINT", "https://hf-mirror.com")
if "huggingface.co" not in hf_endpoint:
    os.environ["HF_ENDPOINT"] = hf_endpoint
    logger.info(f"Using HuggingFace mirror: {hf_endpoint}")

try:
    # Try ModelScope download
    from modelscope import snapshot_download
    logger.info("Downloading model from ModelScope…")
    model_ref = snapshot_download(model_name, cache_dir=cache_dir, revision="master")
    logger.info(f"Model downloaded: {model_ref}")
except Exception as e:
    logger.warning(f"ModelScope download failed: {e}")
    logger.info(f"Falling back to HuggingFace mirror ({hf_endpoint})…")
    model_ref = model_name

# Unified loading (model_ref may be a local path or a HF repo id)
tokenizer = AutoTokenizer.from_pretrained(
    model_ref, cache_dir=cache_dir, trust_remote_code=trust_remote_code
)
model = AutoModel.from_pretrained(
    model_ref, cache_dir=cache_dir, trust_remote_code=trust_remote_code
).to(device)

3. CPU‑side key parameters

num_threads

: number of CPU threads (set to 8 for an 8‑core server). batch_size: how many texts are processed together (e.g., 2). max_length: maximum token length; 512 tokens are sufficient for short Chinese poems.

Example configuration (YAML‑style):

model:
  name: "Qwen/Qwen3-Embedding-0.6B"
  cache_dir: "./models"
  device: "cpu"
  num_threads: 8
  use_modelscope: true
  trust_remote_code: true

vectorization:
  batch_size: 2
  max_length: 512
  normalize: true
  query_instruction: "query: "

Initialize the thread pool with:

torch.set_num_threads(num_threads)

4. Vectorization pipeline

Clean text : collapse consecutive whitespace into a single space.

Query prefix : prepend "query: " to query strings; document texts are used as‑is after cleaning.

Mean pooling : average token embeddings weighted by attention_mask to obtain a sentence vector.

L2 normalization : optionally apply L2 norm so that cosine similarity can be computed as a dot product.

# Forward pass
model_output = model(**encoded_input)

# Mean pooling
embeddings = _mean_pooling(model_output, encoded_input["attention_mask"])

# Optional L2 normalization
if normalize:
    embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

batch_vectors = embeddings.cpu().numpy()

5. Storing vectors in Redis

Vectors are np.ndarray(N, dim). Each vector must be converted to float32, serialized to bytes, and loaded into a Redis vector index.

def add_vectors(self, texts, vectors, metadata=None, batch_size=100):
    docs = []
    for i, (text, vec) in enumerate(zip(texts, vectors)):
        docs.append({
            "text": text,
            "vector": vec.astype("float32").tobytes(),
            "metadata": json.dumps(metadata[i], ensure_ascii=False) if metadata else "",
        })
    for start in range(0, len(docs), batch_size):
        self.vl_index.load(docs[start:start + batch_size])

6. Optional Intel Extension for PyTorch (IPEX)

On Intel CPUs you can install intel-extension-for-pytorch (IPEX) to potentially accelerate certain operators. Use it only after verifying performance gains with benchmarks, and keep the PyTorch version strictly aligned with the IPEX build.

7. Dependency versions

transformers>=4.40.0,<5.0.0
torch==2.10.0  # CPU‑only, supports Python 3.11/3.12
modelscope>=1.12.0
sentencepiece>=0.1.99
onnxruntime>=1.23.0
accelerate>=1.10.0
intel-extension-for-pytorch==2.8.0  # optimized for Qwen3

Python Embedding CPU Vectorization Qwen3

Written by

Tech Musings

Capturing thoughts and reflections while coding.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.