Artificial Intelligence 11 min read

Building a Collaborative Denoising Autoencoder with PyTorch Lightning

This article explains the collaborative denoising autoencoder (CDAE) for recommendation, walks through data preparation with MovieLens, shows a full PyTorch Lightning implementation, tunes hyper‑parameters using Ray Tune and CometML, and reports detailed evaluation metrics.

Code DAO

May 20, 2022

Building a Collaborative Denoising Autoencoder with PyTorch Lightning

Autoencoders are neural networks that learn a compressed mapping from input to a bottleneck representation and then reconstruct the original input. Variants such as adding noise, regularization, or reducing hidden units create a bottleneck that forces the model to capture essential structure. In recommendation systems, where rating matrices are extremely sparse, a collaborative denoising autoencoder (CDAE) can learn effective low‑dimensional user‑item representations.

The CDAE architecture adds a user‑specific embedding to the corrupted rating matrix. The input layer contains a user node, the hidden layer has fewer nodes than the input/output, and the output layer must approximate the original ratings despite the corruption. During inference, the user rating vector is left intact, passed through the network, and the top‑N outputs become the recommended items.

Data preparation uses the MovieLens 1M dataset, keeping only ratings of 4 or higher and discarding users and items with fewer than five ratings. The resulting matrix has 6,034 users, 3,125 items, and a density of about 3 %.

Number of rows: 6034
Number of cols: 3125
Density: 3.046%

After splitting, at least 20 % of each qualifying user's ratings become the validation set, and the same procedure is applied to the test set. Three DataLoaders are created: a training loader with the sparse matrix, a test loader containing both training and target matrices, and an inference loader that supplies user IDs.

class RecoSparseTrainDataset(Dataset):
    def __init__(self, sparse_mat):
        self.sparse_mat = sparse_mat
    def __len__(self):
        return self.sparse_mat.shape[0]
    def __getitem__(self, idx):
        batch_matrix = self.sparse_mat[idx].toarray().squeeze()
        return batch_matrix, idx

class RecoSparseTestSet(Dataset):
    def __init__(self, train_mat, test_mat):
        self.train_mat = train_mat
        self.test_mat = test_mat
        assert train_mat.shape == test_mat.shape
    def __len__(self):
        return self.train_mat.shape[0]
    def __getitem__(self, idx):
        train_matrix = self.train_mat[idx].toarray().squeeze()
        test_matrix = self.test_mat[idx].toarray().squeeze()
        return train_matrix, test_matrix, idx

The model is defined as a PyTorch Lightning module. It embeds users, encodes corrupted ratings, adds the embedding, applies an activation, and decodes back to the item space. Binary cross‑entropy loss is used, and training, validation, and test steps log loss and compute precision, recall, and NDCG.

class CDAE(pl.LightningModule):
    def __init__(self, model_conf, novelty_per_item, num_users, num_items, remove_observed=False):
        super().__init__()
        self.hidden_dim = model_conf["hidden_dim"]
        self.user_embedding = nn.Embedding(self.num_users, self.hidden_dim)
        self.encoder = nn.Linear(self.num_items, self.hidden_dim)
        self.decoder = nn.Linear(self.hidden_dim, self.num_items)
        self.criterion = nn.BCELoss(reduction='sum')
        self.save_hyperparameters(model_conf, ignore=["novelty_per_item", "remove_observed"])
    def forward(self, x):
        rating_matrix, user_idx = x
        corrupted = F.dropout(rating_matrix, self.corruption_ratio, training=self.training)
        embedded_users = self.user_embedding(user_idx)
        encoded = self.encoder(corrupted)
        enc = torch.add(embedded_users, encoded)
        enc = self.__apply_activation(self.act, enc)
        dec = self.decoder(enc)
        return self.__apply_activation(self.out_act, dec)

Hyper‑parameter tuning combines Ray Tune and CometML. The search space includes hidden dimension, corruption ratio, activation function, negative‑sample probability, learning rate, and weight decay. Ray Tune reports validation loss and precision@20, while CometLoggerCallback records all metrics.

search_space_conf = {
    "hidden_dim": tune.grid_search([50, 100, 200]),
    "corruption_ratio": tune.grid_search([0.3, 0.5, 0.8]),
    "activation": tune.grid_search(['sigmoid', 'tanh']),
    "negative_sample_prob": tune.grid_search([0, 0.5, 1]),
    "learning_rate": tune.grid_search([0.1, 0.05, 0.01]),
    "wd": tune.grid_search([0, 0.01, 0.001]),
}

The tuning run selects the best configuration:

{'hidden_dim': 100, 'corruption_ratio': 0.3, 'learning_rate': 0.3, 'wd': 0, 'activation_fun': 'tanh'}

Evaluation on the test set reports the following metrics at @20:

Metrics@20, My Value
NDCG, 0.1978
Precision, 0.081
Recall, 0.214
Coverage, 0.018
Gini Diversity, 0.547
Novelty, 1.865

These results illustrate how hidden dimension and corruption rate most strongly influence the final performance. The article also lists related recommendation libraries such as MetaRec on GitHub.

Parallel coordinate plot of hyper‑parameter search

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Recommendation Systems Autoencoder MovieLens PyTorch Lightning CDAE CometML Ray Tune

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.