Artificial Intelligence 75 min read

Why Neural Networks Are the Hidden Engine Behind Modern AI: From Basics to Large Language Models

This comprehensive guide walks through the fundamentals of neural networks, activation functions, training methods, and how they power large language models, while also covering tokenization, self‑attention, transformer architectures, AI infrastructure, and practical usage through agents and retrieval‑augmented generation.

Tencent Technical Engineering

Feb 2, 2026

Why Neural Networks Are the Hidden Engine Behind Modern AI: From Basics to Large Language Models

Why Neural Networks Power Modern AI

Artificial‑intelligence algorithms differ from traditional ones: they learn from data instead of being explicitly programmed. The core of this learning is the neural network, a layered structure that mimics the brain’s signal flow.

1. Basic Concepts of Neural Networks

A neural network consists of an input layer , one or more hidden layers , and an output layer . Each layer contains neurons (nodes) that receive weighted sums of signals from the previous layer, apply an activation function , and pass the result forward. The input, hidden, and output layers correspond to the networking terms "access layer", "aggregation layer", and "core layer" for those familiar with computer‑network architecture.

2. Activation Functions and Parameters

Neurons first compute a weighted sum y = w·x + b where w are weights and b is the bias . The sum is then transformed by an activation function (e.g., step, sigmoid, ReLU) to introduce non‑linearity, enabling the network to model complex relationships. Different activation functions affect training stability and expressiveness.

The collection of all weights and biases forms the model’s parameters . Large models may contain billions of parameters (e.g., GPT‑3 with 175 billion, DeepSeek V3.2 with 670 billion).

3. Training Process

Training consists of forward propagation (computing outputs) followed by backward propagation (computing gradients). The loss (or error) between the predicted output and the target value is measured by a loss function such as cross‑entropy. Using the chain rule, gradients of the loss with respect to each parameter are calculated, forming the gradient . Gradient descent updates parameters by moving them opposite to the gradient, often scaled by a learning rate.

Matrix multiplication underlies both forward and backward passes. For a layer with weight matrix W and input vector x, the activation is a = f(W·x + b). Large‑scale training distributes these operations across many GPUs.

4. Large Language Models (LLMs)

LLMs specialize in natural‑language tasks. Input text is first tokenized (e.g., using BPE) into discrete tokens that are looked up in a vocabulary . Tokens are then embedded into high‑dimensional vectors (tensors) that capture semantic similarity via Euclidean distance or cosine similarity.

The transformer architecture replaces recurrent networks with self‑attention . Each token computes attention scores against all other tokens (matrix × matrix), producing a weighted sum of value vectors. This allows parallel processing of the entire sequence and captures long‑range dependencies.

Variants such as sparse attention limit calculations to the most relevant token pairs, reducing computational cost. Models also use a temperature parameter in the softmax layer to control the sharpness of the output probability distribution.

5. AI Infrastructure

Training LLMs requires massive parallel compute. Modern GPUs (e.g., NVIDIA H100) contain >18 000 CUDA cores. The CUDA platform schedules matrix operations across these cores. Two parallelism strategies are used:

Data parallelism : the same model replica runs on multiple GPUs, each processing a different mini‑batch. Gradients are aggregated via AllReduce (or AllGather) to keep parameters synchronized.

Model (tensor) parallelism : a single model is split across GPUs; intermediate tensors are exchanged using AllGather and AllReduce operations.

High‑speed interconnects such as NVLink and NVSwitch provide up to 900 GB/s bandwidth between GPUs within a server, while ConnectX‑7 NICs deliver 400 Gbps external links. NCCL (NVIDIA Collective Communications Library) abstracts these communications with calls like ncclAllReduce and ncclAllGather.

6. Using LLMs in Practice

End users typically interact with LLMs through agents —software components that combine perception (e.g., text, images) and action. Agents rely on protocols such as MCP (Model Context Protocol) to request up‑to‑date data from external tools, overcoming the static‑knowledge limitation of frozen models.

To mitigate hallucinations, retrieval‑augmented generation ( RAG ) injects relevant documents from a vector‑searchable knowledge base into the prompt. A2A (Agent‑to‑Agent) protocols, now an industry standard, enable different agents to cooperate and share information.

Despite these advances, challenges remain: high compute cost, outdated knowledge, and occasional inaccurate (hallucinated) outputs. Ongoing research focuses on better alignment, efficient fine‑tuning (e.g., knowledge distillation), and more robust retrieval mechanisms.