Understanding NAS: Core Algorithms and Python Implementations
This article reviews Neural Architecture Search (NAS), explains its bi‑level optimization formulation, compares three major search strategies—reinforcement learning, evolutionary algorithms, and differentiable gradient‑based methods—provides complete Python code for each, and analyzes experimental results highlighting performance trade‑offs and remaining challenges.
NAS Technical Framework and Core Principles
Neural Architecture Search (NAS) reformulates neural network design as an automated optimization problem, converting the manual, time‑consuming process of crafting architectures (e.g., AlexNet, ResNet, Transformer) into a searchable space. The NAS workflow consists of three key components: defining the search space, selecting a search strategy, and evaluating candidate architectures.
Bi‑level Optimization Formulation
The objective is to find the architecture \(\alpha^*\) that minimizes validation loss \(L_{val}\) while simultaneously training the network weights \(w\) for that architecture. Formally:
min_{\alpha \in A} L_{val}(w^*(\alpha), \alpha)
where w^*(\alpha) = argmin_w L_{train}(w, \alpha)Here, \(A\) denotes the set of all possible architectures, \(L_{train}\) is the training loss, and \(L_{val}\) is the validation loss. The outer optimization searches over discrete architectures, while the inner optimization trains the weights.
Three Core Search Strategies
Reinforcement‑Learning‑Based Search
The RL approach treats architecture generation as a sequential decision process. A controller RNN proposes architecture components, receives a reward equal to the negative validation loss, and updates its policy via policy‑gradient methods.
class ArchitectureController(nn.Module):
def __init__(self, search_space):
super(ArchitectureController, self).__init__()
self.search_space = search_space
self.keys = list(search_space.keys())
self.vocab_size = [len(search_space[key]) for key in self.keys]
self.num_actions = len(self.keys)
self.rnn = nn.RNN(input_size=1, hidden_size=64, num_layers=1)
self.policy_heads = nn.ModuleList([nn.Linear(64, vs) for vs in self.vocab_size])
def forward(self, input, hidden):
output, hidden = self.rnn(input, hidden)
logits = [head(output.squeeze(0)) for head in self.policy_heads]
return logits, hidden
def run_rl_search(search_space, X_train, y_train, X_val, y_val, num_epochs=10, num_episodes=5):
controller = ArchitectureController(search_space)
controller_optimizer = optim.Adam(controller.parameters(), lr=0.01)
best_loss = float('inf')
best_arch = None
for episode in range(num_episodes):
controller_optimizer.zero_grad()
hidden = torch.zeros(1, 1, 64)
log_probs = []
architecture = {}
for i, key in enumerate(controller.keys):
logits, hidden = controller(torch.zeros(1,1,1), hidden)
dist = torch.distributions.Categorical(logits=logits[i])
action = dist.sample()
architecture[key] = search_space[key][action.item()]
log_probs.append(dist.log_prob(action))
val_loss = evaluate_architecture(architecture, X_train, y_train, X_val, y_val, num_epochs=num_epochs)
reward = -val_loss
policy_loss = torch.sum(torch.stack(log_probs) * -reward)
policy_loss.backward()
controller_optimizer.step()
if val_loss < best_loss:
best_loss = val_loss
best_arch = architecture
return best_arch, best_lossEvolutionary‑Algorithm Search
The EA approach maintains a population of architectures and applies selection, crossover, and mutation to evolve better solutions over generations.
def run_evolutionary_search(X, y, search_space, population_size=10, num_generations=5):
best_loss = float('inf')
best_arch = None
split_idx = int(len(X) * 0.8)
X_train, X_val = X[:split_idx], X[split_idx:]
y_train, y_val = y[:split_idx], y[split_idx:]
population = []
for _ in range(population_size):
architecture = {k: random.choice(v) for k, v in search_space.items()}
population.append(architecture)
for _ in range(num_generations):
fitness = []
for arch in population:
loss = evaluate_architecture(arch, X_train, y_train, X_val, y_val, num_epochs=10)
fitness.append((loss, arch))
if loss < best_loss:
best_loss = loss
best_arch = arch
fitness.sort(key=lambda x: x[0])
elites = [arch for _, arch in fitness[:population_size//2]]
new_population = elites.copy()
while len(new_population) < population_size:
p1, p2 = random.sample(elites, 2)
child = {k: random.choice([p1[k], p2[k]]) for k in p1}
mutation_key = random.choice(list(search_space.keys()))
child[mutation_key] = random.choice(search_space[mutation_key])
new_population.append(child)
population = new_population
return best_arch, best_lossGradient‑Based Differentiable Search (DARTS)
DARTS builds a super‑network containing all candidate operations and learns continuous architecture weights via gradient descent, alternating between updating network weights and architecture parameters.
class Cell(nn.Module):
def __init__(self, in_features, out_features, ops):
super(Cell, self).__init__()
self.ops = nn.ModuleList([nn.Sequential(nn.Linear(in_features, out_features), op()) for op in ops])
def forward(self, x, weights):
return sum(w * op(x) for w, op in zip(weights, self.ops))
class Model(nn.Module):
def __init__(self, search_space):
super(Model, self).__init__()
self.ops_list = [activation_map[name] for name in search_space['activation_function']]
self.num_ops = len(self.ops_list)
self.num_hidden_layers = max(search_space['num_hidden_layers'])
self.hidden_layer_size = search_space['hidden_layer_size'][0]
self.alphas = nn.Parameter(torch.randn(self.num_hidden_layers, self.num_ops, requires_grad=True))
self.layers = nn.ModuleList([nn.Linear(1, self.hidden_layer_size)])
for _ in range(self.num_hidden_layers - 1):
self.layers.append(Cell(self.hidden_layer_size, self.hidden_layer_size, self.ops_list))
self.output_layer = nn.Linear(self.hidden_layer_size, 1)
def forward(self, x):
arch_weights = nn.functional.softmax(self.alphas, dim=-1)
out = x
for i, layer in enumerate(self.layers):
if isinstance(layer, nn.Linear):
out = layer(out)
else:
out = layer(out, arch_weights[i-1])
return self.output_layer(out)
def discretize(self):
architecture = {
'num_hidden_layers': self.num_hidden_layers,
'hidden_layer_size': self.hidden_layer_size,
'learning_rate': 0.001,
'optimizer': 'Adam',
'dropout_rate': 0.0,
}
best_op_indices = self.alphas.argmax(dim=-1)
best_ops = [self.ops_list[i].__name__ for i in best_op_indices]
architecture['activation_function'] = best_ops[0]
return architecture
def run_gradient_based_search(search_space, X_train, y_train, X_val, y_val, num_epochs=50):
model = Model(search_space)
criterion = nn.MSELoss()
optimizer_alpha = optim.Adam([model.alphas], lr=0.001)
weight_params = [p for p in model.parameters() if p.requires_grad and p not in model.alphas]
optimizer_w = optim.Adam(weight_params, lr=0.01)
for epoch in range(num_epochs):
optimizer_w.zero_grad()
outputs = model(X_train)
loss_w = criterion(outputs, y_train)
loss_w.backward()
optimizer_w.step()
optimizer_alpha.zero_grad()
val_outputs = model(X_val)
loss_alpha = criterion(val_outputs, y_val)
loss_alpha.backward()
optimizer_alpha.step()
best_arch = model.discretize()
final_loss = evaluate_architecture(best_arch, X_train, y_train, X_val, y_val, num_epochs=50)
return best_arch, final_lossExperimental Setup and Results
The search space includes:
search_space = {
'num_hidden_layers': [1,2,3,4,5],
'hidden_layer_size': [32,64,128,256,512],
'activation_function': ['ReLU','LeakyReLU','Tanh'],
'learning_rate': [0.1,0.01,0.001,0.0001],
'optimizer': ['Adam','SGD','RMSprop'],
'dropout_rate': [0.0,0.2,0.4,0.6]
}Each method was run with the same data split (80 % training, 20 % validation). The RL search used 5 episodes, the EA search used a population of 10 over 5 generations, and the gradient‑based search ran for 50 epochs.
Results
Evolutionary Algorithm (EA) : Best validation MSE = 0.1498, discovered after the 2nd generation. Architecture – 5 hidden layers, 512 units each, Tanh activation, learning rate 0.1, SGD optimizer, dropout 0.2.
Reinforcement Learning (RL) : Best validation MSE = 0.2744 (found in episode 5). Architecture – 4 hidden layers, 64 units each, Tanh activation, learning rate 0.1, RMSprop optimizer, dropout 0.2. Losses per episode: 1.1483, 3.2017, 4.0062, 2.5762, 0.2744.
Gradient‑Based (DARTS) : Worst validation MSE = 3.6725 after 50 epochs. Architecture – 5 hidden layers, 32 units each, LeakyReLU activation, learning rate 0.001, Adam optimizer, no dropout. Training loss decreased from 0.0938 (epoch 10) to 0.0114 (epoch 50), but validation loss remained high.
The EA method achieved the lowest MSE, indicating strong global search capability via population diversity. RL showed high variance due to exploration‑exploitation trade‑offs, while DARTS, despite its computational efficiency, struggled with the chosen search space and hyper‑parameters.
Analysis and Discussion
These results illustrate fundamental differences among NAS strategies. Evolutionary search avoids local minima by maintaining multiple candidates, reinforcement learning can incorporate multi‑objective rewards but suffers from sample inefficiency, and differentiable search offers fast gradient updates but is sensitive to search‑space design. The experiments also highlight two persistent challenges for NAS: (1) the massive computational cost of evaluating many architectures, and (2) limited generalization across tasks, as most methods over‑fit to the specific dataset used for search.
Recent work such as LangVision‑LoRA‑NAS and Jet‑Nemotron demonstrates the growing interest in applying NAS to large‑scale language and vision models, suggesting a future where NAS scales to massive pre‑trained models.
Conclusion
NAS provides a powerful automated tool for discovering high‑performance neural architectures, reducing reliance on expert intuition. While evolutionary algorithms currently deliver the best empirical performance on the presented benchmark, each method has distinct trade‑offs that must be considered when selecting a NAS approach for a given problem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data STUDIO
Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
