Artificial Intelligence 25 min read

Understanding NAS: Core Algorithms and Python Implementations

This article reviews Neural Architecture Search (NAS), explains its bi‑level optimization formulation, compares three major search strategies—reinforcement learning, evolutionary algorithms, and differentiable gradient‑based methods—provides complete Python code for each, and analyzes experimental results highlighting performance trade‑offs and remaining challenges.

Data STUDIO

Sep 2, 2025

Understanding NAS: Core Algorithms and Python Implementations

NAS Technical Framework and Core Principles

Neural Architecture Search (NAS) reformulates neural network design as an automated optimization problem, converting the manual, time‑consuming process of crafting architectures (e.g., AlexNet, ResNet, Transformer) into a searchable space. The NAS workflow consists of three key components: defining the search space, selecting a search strategy, and evaluating candidate architectures.

Bi‑level Optimization Formulation

The objective is to find the architecture \(\alpha^*\) that minimizes validation loss \(L_{val}\) while simultaneously training the network weights \(w\) for that architecture. Formally:

min_{\alpha \in A} L_{val}(w^*(\alpha), \alpha)
where w^*(\alpha) = argmin_w L_{train}(w, \alpha)

Here, \(A\) denotes the set of all possible architectures, \(L_{train}\) is the training loss, and \(L_{val}\) is the validation loss. The outer optimization searches over discrete architectures, while the inner optimization trains the weights.

Three Core Search Strategies

Reinforcement‑Learning‑Based Search

The RL approach treats architecture generation as a sequential decision process. A controller RNN proposes architecture components, receives a reward equal to the negative validation loss, and updates its policy via policy‑gradient methods.

class ArchitectureController(nn.Module):
    def __init__(self, search_space):
        super(ArchitectureController, self).__init__()
        self.search_space = search_space
        self.keys = list(search_space.keys())
        self.vocab_size = [len(search_space[key]) for key in self.keys]
        self.num_actions = len(self.keys)
        self.rnn = nn.RNN(input_size=1, hidden_size=64, num_layers=1)
        self.policy_heads = nn.ModuleList([nn.Linear(64, vs) for vs in self.vocab_size])
    def forward(self, input, hidden):
        output, hidden = self.rnn(input, hidden)
        logits = [head(output.squeeze(0)) for head in self.policy_heads]
        return logits, hidden

def run_rl_search(search_space, X_train, y_train, X_val, y_val, num_epochs=10, num_episodes=5):
    controller = ArchitectureController(search_space)
    controller_optimizer = optim.Adam(controller.parameters(), lr=0.01)
    best_loss = float('inf')
    best_arch = None
    for episode in range(num_episodes):
        controller_optimizer.zero_grad()
        hidden = torch.zeros(1, 1, 64)
        log_probs = []
        architecture = {}
        for i, key in enumerate(controller.keys):
            logits, hidden = controller(torch.zeros(1,1,1), hidden)
            dist = torch.distributions.Categorical(logits=logits[i])
            action = dist.sample()
            architecture[key] = search_space[key][action.item()]
            log_probs.append(dist.log_prob(action))
        val_loss = evaluate_architecture(architecture, X_train, y_train, X_val, y_val, num_epochs=num_epochs)
        reward = -val_loss
        policy_loss = torch.sum(torch.stack(log_probs) * -reward)
        policy_loss.backward()
        controller_optimizer.step()
        if val_loss < best_loss:
            best_loss = val_loss
            best_arch = architecture
    return best_arch, best_loss

Evolutionary‑Algorithm Search

The EA approach maintains a population of architectures and applies selection, crossover, and mutation to evolve better solutions over generations.

def run_evolutionary_search(X, y, search_space, population_size=10, num_generations=5):
    best_loss = float('inf')
    best_arch = None
    split_idx = int(len(X) * 0.8)
    X_train, X_val = X[:split_idx], X[split_idx:]
    y_train, y_val = y[:split_idx], y[split_idx:]
    population = []
    for _ in range(population_size):
        architecture = {k: random.choice(v) for k, v in search_space.items()}
        population.append(architecture)
    for _ in range(num_generations):
        fitness = []
        for arch in population:
            loss = evaluate_architecture(arch, X_train, y_train, X_val, y_val, num_epochs=10)
            fitness.append((loss, arch))
            if loss < best_loss:
                best_loss = loss
                best_arch = arch
        fitness.sort(key=lambda x: x[0])
        elites = [arch for _, arch in fitness[:population_size//2]]
        new_population = elites.copy()
        while len(new_population) < population_size:
            p1, p2 = random.sample(elites, 2)
            child = {k: random.choice([p1[k], p2[k]]) for k in p1}
            mutation_key = random.choice(list(search_space.keys()))
            child[mutation_key] = random.choice(search_space[mutation_key])
            new_population.append(child)
        population = new_population
    return best_arch, best_loss

Gradient‑Based Differentiable Search (DARTS)

DARTS builds a super‑network containing all candidate operations and learns continuous architecture weights via gradient descent, alternating between updating network weights and architecture parameters.

class Cell(nn.Module):
    def __init__(self, in_features, out_features, ops):
        super(Cell, self).__init__()
        self.ops = nn.ModuleList([nn.Sequential(nn.Linear(in_features, out_features), op()) for op in ops])
    def forward(self, x, weights):
        return sum(w * op(x) for w, op in zip(weights, self.ops))

class Model(nn.Module):
    def __init__(self, search_space):
        super(Model, self).__init__()
        self.ops_list = [activation_map[name] for name in search_space['activation_function']]
        self.num_ops = len(self.ops_list)
        self.num_hidden_layers = max(search_space['num_hidden_layers'])
        self.hidden_layer_size = search_space['hidden_layer_size'][0]
        self.alphas = nn.Parameter(torch.randn(self.num_hidden_layers, self.num_ops, requires_grad=True))
        self.layers = nn.ModuleList([nn.Linear(1, self.hidden_layer_size)])
        for _ in range(self.num_hidden_layers - 1):
            self.layers.append(Cell(self.hidden_layer_size, self.hidden_layer_size, self.ops_list))
        self.output_layer = nn.Linear(self.hidden_layer_size, 1)
    def forward(self, x):
        arch_weights = nn.functional.softmax(self.alphas, dim=-1)
        out = x
        for i, layer in enumerate(self.layers):
            if isinstance(layer, nn.Linear):
                out = layer(out)
            else:
                out = layer(out, arch_weights[i-1])
        return self.output_layer(out)
    def discretize(self):
        architecture = {
            'num_hidden_layers': self.num_hidden_layers,
            'hidden_layer_size': self.hidden_layer_size,
            'learning_rate': 0.001,
            'optimizer': 'Adam',
            'dropout_rate': 0.0,
        }
        best_op_indices = self.alphas.argmax(dim=-1)
        best_ops = [self.ops_list[i].__name__ for i in best_op_indices]
        architecture['activation_function'] = best_ops[0]
        return architecture

def run_gradient_based_search(search_space, X_train, y_train, X_val, y_val, num_epochs=50):
    model = Model(search_space)
    criterion = nn.MSELoss()
    optimizer_alpha = optim.Adam([model.alphas], lr=0.001)
    weight_params = [p for p in model.parameters() if p.requires_grad and p not in model.alphas]
    optimizer_w = optim.Adam(weight_params, lr=0.01)
    for epoch in range(num_epochs):
        optimizer_w.zero_grad()
        outputs = model(X_train)
        loss_w = criterion(outputs, y_train)
        loss_w.backward()
        optimizer_w.step()
        optimizer_alpha.zero_grad()
        val_outputs = model(X_val)
        loss_alpha = criterion(val_outputs, y_val)
        loss_alpha.backward()
        optimizer_alpha.step()
    best_arch = model.discretize()
    final_loss = evaluate_architecture(best_arch, X_train, y_train, X_val, y_val, num_epochs=50)
    return best_arch, final_loss

Experimental Setup and Results

The search space includes:

search_space = {
    'num_hidden_layers': [1,2,3,4,5],
    'hidden_layer_size': [32,64,128,256,512],
    'activation_function': ['ReLU','LeakyReLU','Tanh'],
    'learning_rate': [0.1,0.01,0.001,0.0001],
    'optimizer': ['Adam','SGD','RMSprop'],
    'dropout_rate': [0.0,0.2,0.4,0.6]
}

Each method was run with the same data split (80 % training, 20 % validation). The RL search used 5 episodes, the EA search used a population of 10 over 5 generations, and the gradient‑based search ran for 50 epochs.

Results

Evolutionary Algorithm (EA) : Best validation MSE = 0.1498, discovered after the 2nd generation. Architecture – 5 hidden layers, 512 units each, Tanh activation, learning rate 0.1, SGD optimizer, dropout 0.2.

Reinforcement Learning (RL) : Best validation MSE = 0.2744 (found in episode 5). Architecture – 4 hidden layers, 64 units each, Tanh activation, learning rate 0.1, RMSprop optimizer, dropout 0.2. Losses per episode: 1.1483, 3.2017, 4.0062, 2.5762, 0.2744.

Gradient‑Based (DARTS) : Worst validation MSE = 3.6725 after 50 epochs. Architecture – 5 hidden layers, 32 units each, LeakyReLU activation, learning rate 0.001, Adam optimizer, no dropout. Training loss decreased from 0.0938 (epoch 10) to 0.0114 (epoch 50), but validation loss remained high.

The EA method achieved the lowest MSE, indicating strong global search capability via population diversity. RL showed high variance due to exploration‑exploitation trade‑offs, while DARTS, despite its computational efficiency, struggled with the chosen search space and hyper‑parameters.

Analysis and Discussion

These results illustrate fundamental differences among NAS strategies. Evolutionary search avoids local minima by maintaining multiple candidates, reinforcement learning can incorporate multi‑objective rewards but suffers from sample inefficiency, and differentiable search offers fast gradient updates but is sensitive to search‑space design. The experiments also highlight two persistent challenges for NAS: (1) the massive computational cost of evaluating many architectures, and (2) limited generalization across tasks, as most methods over‑fit to the specific dataset used for search.

Recent work such as LangVision‑LoRA‑NAS and Jet‑Nemotron demonstrates the growing interest in applying NAS to large‑scale language and vision models, suggesting a future where NAS scales to massive pre‑trained models.

Conclusion

NAS provides a powerful automated tool for discovering high‑performance neural architectures, reducing reliance on expert intuition. While evolutionary algorithms currently deliver the best empirical performance on the presented benchmark, each method has distinct trade‑offs that must be considered when selecting a NAS approach for a given problem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python deep learning Reinforcement Learning Neural Architecture Search NAS Evolutionary Algorithms Differentiable Architecture Search

Written by

Data STUDIO

Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.