Industry Insights 12 min read

OpenAI Builds Its Own Inference Chip While GLM‑5.2 Gains Traction – AI Race Shifts to Compute Control

The article analyzes how OpenAI’s self‑developed Jalapeño inference chip, GLM‑5.2’s integration into model gateways, GitHub’s one‑click credential revocation, and Cloudflare’s cache‑TTL adjustments illustrate a broader industry shift toward full‑stack, controllable AI infrastructure that prioritizes cost, latency, and governance.

Programmer DD

Jun 25, 2026

OpenAI Builds Its Own Inference Chip While GLM‑5.2 Gains Traction – AI Race Shifts to Compute Control

Today's headline is that AI competition is moving from "which model is stronger" to "who can control compute, cost, model entry, cache freshness, and enterprise security boundaries"; model capability remains important, but developers increasingly feel the impact of stable, cheap, and governable underlying systems.

OpenAI and Broadcom unveil Jalapeño inference chip

On June 24, OpenAI announced together with Broadcom the Jalapeño chip, its first self‑designed intelligence processor optimized for large‑language‑model inference. The chip was built from large‑model inference requirements, and engineering samples have already run ML workloads—including GPT‑5.3‑Codex‑Spark—at target frequency and power. OpenAI describes Jalapeño as the first step of a multi‑generation compute platform slated for deployment beginning in 2026.

The announcement signals that AI product competition has entered a full‑stack era: when model companies start optimizing chips, kernels, memory, networking, and scheduling, inference cost, latency, throughput, and supply‑chain stability become decisive factors for developers.

GLM‑5.2 continues to heat up

GLM‑5.2 remains a focal point for developers worldwide. Vercel announced on June 24 that GLM‑5.2 Fast (repository zai/glm-5.2-fast) is now available via its AI Gateway. Internal serverless tests show the Wafer service delivering higher throughput than competing providers, with measured rates of over 170 tok/s for small‑context queries and over 200 tok/s for large‑context queries.

This integration highlights a shift in model‑gateway competition from merely offering a unified API to emphasizing throughput, cost transparency, bring‑your‑own‑key (BYOK) support, retry logic, and observability. The move also shows open models transitioning from community downloads to production‑grade gateways, SDKs, and enterprise call‑chains.

GitHub adds one‑click enterprise credential revocation

On the same day, GitHub released an update that lets Enterprise Owners and members with the Manage enterprise credentials permission revoke SSO authorizations, personal access tokens, SSH keys, and OAuth tokens for all users or selected users in bulk. For EMU accounts, user tokens and SSH keys can also be deleted. Individual users can self‑revoke their credentials via the credential settings page.

In the era of coding agents, credential sprawl and token leakage become harder to audit. The one‑click revocation feature acts as an “emergency power‑off” for security incidents, and teams adopting agents should incorporate credential lifecycle management, audit logs, and incident‑response processes into their deployment standards.

Cloudflare AI Search adjusts similarity cache TTL

Cloudflare updated its AI Search similarity cache on June 24. The default cache retention changed from a fixed 30 days to a configurable cache_ttl with a default of 48 hours. Developers can now set the TTL between 10 minutes and 6 days or manually clear all cached responses.

This change helps Retrieval‑Augmented Generation (RAG) applications balance latency and inference cost against answer freshness. Shorter defaults and explicit clearing interfaces remind developers that cache strategies must consider staleness, invalidation, and traceability, not just hit rates.

Enterprise Agent deployments illustrate workflow value

Reports from the Amazon Web Services China Summit describe Xpeng Motors building an internal AI coding and agent platform using Kiro, Amazon Bedrock, and Amazon EKS. The platform has accumulated over 700 skills, connected more than 400 APIs, generated 100+ AI‑assisted PRs daily, and completed over 140 000 workflow runs. Similar cases from Kimi, Cheetah Mobile, and Insta360 reinforce that AI coding speed for individuals does not automatically translate to organizational efficiency.

Effective Agent adoption requires closing the loop across requirements, data, models, tools, testing, deployment, operations, and governance. Companies should focus on restructuring processes so that agents become manageable production units rather than merely adding more AI assistants.

Today's keyword is "full‑stack controllable": OpenAI builds a chip to control cost and supply, GLM‑5.2 enters model gateways, and GitHub and Cloudflare update security and caching. Developers will feel model capability through the stability of the entire infrastructure chain rather than isolated benchmark scores.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

OpenAI GLM-5.2 AI model gateway Cloudflare cache GitHub credentials inference chip

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.