Free LLM API Tokens: Complete Provider List, Limits, and Usage Tips

This guide compiles free large‑language‑model APIs from official vendors and third‑party platforms, detailing each service's token quotas, rate limits, base URLs, usage restrictions, and available models, while offering practical advice on token optimization, multi‑platform rotation, rate‑limit handling, and key security.

AI Engineer Programming
AI Engineer Programming
AI Engineer Programming
Free LLM API Tokens: Complete Provider List, Limits, and Usage Tips

Free LLM APIs – Official Provider Quotas

Cohere

Application URL: dashboard.cohere.com/api-keys Free quota: 1,000 API calls per month

Rate limit: 20 RPM

Base URL: https://api.cohere.com/v2 Usage restriction: non‑commercial only

Available models: Command A (111B, 256K context, 4K output, text), Command R+, Command R, Command R7B, Embed 4 (text + image), Rerank 3.5 (10 RPM)

Google Gemini

Application URL: aistudio.google.com/app/apikey Rate limits: Flash 10 RPM / 250 RPD, Flash‑Lite 15 RPM / 1,000 RPD

Base URL: https://generativelanguage.googleapis.com/v1beta Usage note: free‑tier prompts may be used by Google to improve its products

Available models: Gemini 2.5 Flash and Flash‑Lite (1M context, 65K output, text + image + audio + video)

Mistral AI

Application URL: console.mistral.ai/api-keys (register for the Experiment plan)

Free quota: ~1 billion tokens per month

Rate limit: ~1 RPS · 500 K TPM per model

Base URL: https://api.mistral.ai/v1 Available models (all free): Mistral Small 4 (256K context, 256K output, text + image + code), Mistral Medium 3 (128K/128K, text), Mistral Large 3 (256K/256K, text), Mistral Nemo 12B (128K/128K, text), Codestral (256K/256K, code‑only), Pixtral Large (128K/128K, text + image)

智谱 AI(Z AI)

Application URL: open.bigmodel.cn/usercenter/apikeys Rate limit: 1 concurrent request per model

Base URL: https://open.bigmodel.cn/api/paas/v4 Free models: GLM‑4.7‑Flash (200K context, 128K output, text), GLM‑4.5‑Flash (128K context, ~8K output, text), GLM‑4.6V‑Flash (128K context, ~4K output, text + image)

Free LLM APIs – Third‑Party Inference Platforms

Cerebras

Application URL: cloud.cerebras.ai Daily token limit: 1 M tokens (shared)

Rate limit: 30 RPM · 14,400 RPD · 1 M TPD

Base URL: https://api.cerebras.ai/v1 Free models: llama3.1‑8b, gpt‑oss‑120b, qwen‑3‑235b‑a22b‑instruct‑2507, zai‑glm‑4.7 (all 8K context on free tier)

Groq

Application URL: console.groq.com/keys Rate limit: 30 RPM · 14,400 RPD

Base URL: https://api.groq.com/openai/v1 Free models include: llama‑3.3‑70B‑versatile (131K context, 32K output), llama‑3.1‑8b‑instant (131K/131K), llama‑4‑scout‑17b (131K/8K, text + vision), llama‑4‑maverick‑17b (131K/8K, 15 RPM · 500 RPD), qwen3‑32b, gpt‑oss‑120b, kimi‑k2‑instruct (262K/262K), deepseek‑r1‑distill‑70b (131K/8K, inference), whisper‑large‑v3/turbo (audio → text, 20 RPM)

GitHub Models

Application URL: github.com/marketplace/models Single‑request limit: 8K input / 4K output

Base URL: https://models.inference.ai.azure.com Selected free models (45+): gpt‑4.1, gpt‑4.1‑mini, o4‑mini, o3‑mini, gpt‑4o, Llama‑4‑Scout‑17B‑16E, Llama‑4‑Maverick‑17B‑128E, DeepSeek‑R1, Meta‑Llama‑3.3‑70B, Mistral‑Small‑3.1, plus 35 more (text / image)

OpenRouter

Application URL: openrouter.ai/keys (add ":free" suffix to model name for free tier)

Rate limit: 20 RPM · 200 RPD shared across all free models

Base URL: https://openrouter.ai/api/v1 Selected free models: deepseek‑r1‑0528, deepseek‑chat‑v3‑0324, qwen‑3.6‑plus, qwen‑3‑coder‑480b‑a35b, meta‑llama‑4‑scout, meta‑llama‑4‑maverick, openai‑gpt‑oss‑120b, nvidia‑nemotron‑3‑super‑120b, google‑gemma‑4‑31b‑it, mistralai‑devstral‑2512, minimax‑minimax‑m2.5, plus ~23 more (see openrouter.ai/models?q=:free)

NVIDIA NIM

Application URL: build.nvidia.com/explore/discover (join NVIDIA Developer Program for free access)

Rate limit: ~40 RPM, no daily token cap

Base URL: https://integrate.api.nvidia.com/v1 Selected free models (100+): deepseek‑ai/deepseek‑r1 (128K / ~163K), nvidia/nemotron‑3‑super‑120b‑a12b (262K / 262K), nvidia/llama‑3.1‑nemotron‑ultra‑253b (128K / 4K), meta/llama‑3.1‑405b‑instruct (128K / 4K), qwen/qwen2.5‑72b‑instruct (128K / 8K), minimax/minimax‑m2.7 (128K / 8K), nvidia/nemotron‑nano‑2‑vl (128K / 8K, vision + text + video), plus ~90 more (text, image, video, audio, embeddings)

SiliconFlow

Application URL: cloud.siliconflow.cn/account/ak (register for 14 CNY credit)

Free‑tier rate: 1,000 RPM · 50 K TPM

Base URL: https://api.siliconflow.cn/v1 Free models: Qwen‑3‑8B, DeepSeek‑R1‑0528‑Qwen‑3‑8B, DeepSeek‑R1‑Distill‑Qwen‑7B, THUDM/glm‑4‑9b‑chat, THUDM/GLM‑4.1V‑9B‑Thinking (text, vision, inference), DeepSeek‑OCR (visual OCR)

How to Maximize Free Tokens

Assign Models by Task Type

Lightweight daily tasks (code completion, formatting): use 智谱 GLM‑4.7‑Flash / GLM‑4.5‑Flash (permanent free)

Medium‑complex tasks (debugging, refactoring, doc generation): use GLM Coding Plan Lite / Qwen‑3.6‑Plus (free on OpenRouter)

Very long document processing: Kimi API (unlimited tokens) or Gemini (1 M context)

Complex multi‑step agent workflows: GLM‑5.1 or MiniMax M2.7

Top‑tier inference needs: Claude Opus 4.6/4.7 (pay‑as‑you‑go)

Build a Multi‑Platform Rotation

Do not funnel all traffic through a single provider. Use OpenRouter as a unified entry point and switch between GLM‑5.1, Qwen‑3.6‑Plus, MiniMax M2.5, etc. When a platform hits its rate limit, automatically fall back to the next one.

Handle Rate‑Limit Restrictions

Almost every free API enforces RPM and RPD limits. Implement exponential back‑off retry logic in code: on receiving a 429 error, wait, then retry instead of crashing.

Secure API Keys

Leaked keys allow others to consume your quota or incur charges. Never hard‑code keys in source files or push them to public repositories; store them in environment variables or a dedicated secret‑management system.

Key Takeaways

For independent developers and students, combining Mistral (≈1 B tokens / month), Groq (14,400 RPD), and GitHub Models (GPT‑4.1 / o4‑mini) enables a completely zero‑cost early‑stage AI product validation. Free quotas are suitable for development, testing, and learning, but should not be relied upon for production because they lack SLA guarantees, priority queuing, and have strict rate limits.

Data source: GitHub search, repository mnfst/awesome-free-llm-apis , CC0 license, continuously updated.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AILLMmodel comparisonRate limitingFree APIToken limits
AI Engineer Programming
Written by

AI Engineer Programming

In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.