Building a Collaborative Denoising Autoencoder with PyTorch Lightning

This article explains the collaborative denoising autoencoder (CDAE) for recommendation, walks through data preparation with MovieLens, shows a full PyTorch Lightning implementation, tunes hyper‑parameters using Ray Tune and CometML, and reports detailed evaluation metrics.

Code DAO
Code DAO
Code DAO
Building a Collaborative Denoising Autoencoder with PyTorch Lightning

Autoencoders are neural networks that learn a compressed mapping from input to a bottleneck representation and then reconstruct the original input. Variants such as adding noise, regularization, or reducing hidden units create a bottleneck that forces the model to capture essential structure. In recommendation systems, where rating matrices are extremely sparse, a collaborative denoising autoencoder (CDAE) can learn effective low‑dimensional user‑item representations.

The CDAE architecture adds a user‑specific embedding to the corrupted rating matrix. The input layer contains a user node, the hidden layer has fewer nodes than the input/output, and the output layer must approximate the original ratings despite the corruption. During inference, the user rating vector is left intact, passed through the network, and the top‑N outputs become the recommended items.

Data preparation uses the MovieLens 1M dataset, keeping only ratings of 4 or higher and discarding users and items with fewer than five ratings. The resulting matrix has 6,034 users, 3,125 items, and a density of about 3 %.

Number of rows: 6034
Number of cols: 3125
Density: 3.046%

After splitting, at least 20 % of each qualifying user's ratings become the validation set, and the same procedure is applied to the test set. Three DataLoaders are created: a training loader with the sparse matrix, a test loader containing both training and target matrices, and an inference loader that supplies user IDs.

class RecoSparseTrainDataset(Dataset):
    def __init__(self, sparse_mat):
        self.sparse_mat = sparse_mat
    def __len__(self):
        return self.sparse_mat.shape[0]
    def __getitem__(self, idx):
        batch_matrix = self.sparse_mat[idx].toarray().squeeze()
        return batch_matrix, idx

class RecoSparseTestSet(Dataset):
    def __init__(self, train_mat, test_mat):
        self.train_mat = train_mat
        self.test_mat = test_mat
        assert train_mat.shape == test_mat.shape
    def __len__(self):
        return self.train_mat.shape[0]
    def __getitem__(self, idx):
        train_matrix = self.train_mat[idx].toarray().squeeze()
        test_matrix = self.test_mat[idx].toarray().squeeze()
        return train_matrix, test_matrix, idx

The model is defined as a PyTorch Lightning module. It embeds users, encodes corrupted ratings, adds the embedding, applies an activation, and decodes back to the item space. Binary cross‑entropy loss is used, and training, validation, and test steps log loss and compute precision, recall, and NDCG.

class CDAE(pl.LightningModule):
    def __init__(self, model_conf, novelty_per_item, num_users, num_items, remove_observed=False):
        super().__init__()
        self.hidden_dim = model_conf["hidden_dim"]
        self.user_embedding = nn.Embedding(self.num_users, self.hidden_dim)
        self.encoder = nn.Linear(self.num_items, self.hidden_dim)
        self.decoder = nn.Linear(self.hidden_dim, self.num_items)
        self.criterion = nn.BCELoss(reduction='sum')
        self.save_hyperparameters(model_conf, ignore=["novelty_per_item", "remove_observed"])
    def forward(self, x):
        rating_matrix, user_idx = x
        corrupted = F.dropout(rating_matrix, self.corruption_ratio, training=self.training)
        embedded_users = self.user_embedding(user_idx)
        encoded = self.encoder(corrupted)
        enc = torch.add(embedded_users, encoded)
        enc = self.__apply_activation(self.act, enc)
        dec = self.decoder(enc)
        return self.__apply_activation(self.out_act, dec)

Hyper‑parameter tuning combines Ray Tune and CometML. The search space includes hidden dimension, corruption ratio, activation function, negative‑sample probability, learning rate, and weight decay. Ray Tune reports validation loss and precision@20, while CometLoggerCallback records all metrics.

search_space_conf = {
    "hidden_dim": tune.grid_search([50, 100, 200]),
    "corruption_ratio": tune.grid_search([0.3, 0.5, 0.8]),
    "activation": tune.grid_search(['sigmoid', 'tanh']),
    "negative_sample_prob": tune.grid_search([0, 0.5, 1]),
    "learning_rate": tune.grid_search([0.1, 0.05, 0.01]),
    "wd": tune.grid_search([0, 0.01, 0.001]),
}

The tuning run selects the best configuration:

{'hidden_dim': 100, 'corruption_ratio': 0.3, 'learning_rate': 0.3, 'wd': 0, 'activation_fun': 'tanh'}

Evaluation on the test set reports the following metrics at @20:

Metrics@20, My Value
NDCG, 0.1978
Precision, 0.081
Recall, 0.214
Coverage, 0.018
Gini Diversity, 0.547
Novelty, 1.865

These results illustrate how hidden dimension and corruption rate most strongly influence the final performance. The article also lists related recommendation libraries such as MetaRec on GitHub.

CDAE architecture diagram
CDAE architecture diagram
Parallel coordinate plot of hyper‑parameter search
Parallel coordinate plot of hyper‑parameter search
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Recommendation SystemsAutoencoderMovieLensPyTorch LightningCDAECometMLRay Tune
Code DAO
Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.