Reinforcement Learning for Recommendation System Mixing: Concepts, Practice, and Evaluation

This article explains how reinforcement learning, with its focus on maximizing long‑term reward, can improve recommendation system mixing by covering basic RL concepts, differences from supervised learning, multi‑armed bandit approaches, practical OpenAI Gym experiments, new AUC metrics, online gains, and advanced model optimizations.

DataFunTalk
DataFunTalk
DataFunTalk
Reinforcement Learning for Recommendation System Mixing: Concepts, Practice, and Evaluation

Compared with traditional supervised learning, reinforcement learning (RL) can maximize long‑term reward, which is especially valuable for recommendation systems that need to look beyond immediate clicks.

The article introduces RL basics, including the classic <A, S, R, P> tuple (Agent, State, Reward, and Model), and contrasts RL with supervised and unsupervised learning, highlighting its focus on long‑term gains.

It explains the multi‑armed bandit (MAB) problem as a core RL technique for exploration vs. exploitation, and mentions AlphaGo as an example that combines policy‑based and value‑based networks.

For hands‑on practice, the article suggests using OpenAI’s gym environment (CartPole) and provides a complete Q‑learning implementation:

import gym
import random
import numpy

N_BINS = [5, 5, 5, 5]
LEARNING_RATE=0.05
DISCOUNT_FACTOR=0.9
EPS = 0.3

MIN_VALUES = [-0.5,-2.0,-0.5,-3.0]
MAX_VALUES = [0.5,2.0,0.5,3.0]
BINS = [numpy.linspace(MIN_VALUES[i], MAX_VALUES[i], N_BINS[i]) for i in xrange(4)]

def discretize(obs):
    return tuple([int(numpy.digitize(obs[i], BINS[i])) for i in xrange(4)])

qv = {}

env = gym.make('CartPole-v0')
print(env.action_space)
print(env.observation_space)
an = env.action_space.n

def get(s, a):
    global qv
    if (s, a) not in qv:
        return 0
    return qv[(s, a)]

def update(s, a, s1, r):
    global qv
    nows = get(s, a)
    m0 = get(s1, 0)
    m1 = get(s1, 1)
    if m0 < m1:
        m0 = m1
    qv[(s, a)] = nows + LEARNING_RATE * (r + DISCOUNT_FACTOR * m0 - nows)

for i in range(500000):
    obs = env.reset()
    if i % 1000 == 0:
        print i
    for _ in range(5000):
        s = discretize(obs)
        s_0 = get(s, 0)
        nowa = 0
        s_1 = get(s, 1)
        if s_1 > s_0:
            nowa = 1
        if random.random() <= EPS:
            nowa = 1 - nowa
        obs, reward, done, info = env.step(nowa)
        s1 = discretize(obs)
        if done:
            reward = -10
        update(s, nowa, s1, reward)
        if done:
            break

for i_episode in range(1):
    obs = env.reset()
    for t in range(5000):
        env.render()
        s = discretize(obs)
        maxs = get(s, 0)
        maxa = 0
        nows = get(s, 1)
        if nows > maxs:
            maxa = 1
        obs, reward, done, info = env.step(maxa)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break

The article then discusses why RL is needed for recommendation mixing, pointing out challenges such as heterogeneous data, differing objectives across content types, high computational cost, and varying content quality.

It models the recommendation process as a Markov Decision Process where the system is the agent, recommended items are actions, and user feedback (clicks, negative feedback, exits) serves as reward.

To evaluate models offline, a new AUC metric is proposed that measures the probability that a pair of items with higher cumulative reward is ranked higher, arguing that this better reflects long‑term gains than traditional CTR‑based AUC.

Online experiments show a 7% increase in total dwell time compared with rule‑based mixing, and a 1‑2% improvement over a supervised learning baseline.

Further model optimizations are described, including session‑based recommendation using a Personal DQN with RNN‑encoded states, Bloom embedding with Dueling DQN to reduce hash collisions, and Double DQN (DDDQN) for more stable learning.

Negative feedback is incorporated as a negative reward, and focal loss is applied to address its sparsity, achieving a 19% reduction in negative feedback rate.

The article concludes with reflections on the similarity between RL’s actor‑critic architecture and GANs, suggesting potential fusion of the two approaches for future improvements.

Finally, the author thanks the audience and invites readers to like, share, and join the DataFunTalk community for further AI and big‑data discussions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Artificial IntelligenceRecommendation SystemsReinforcement Learningmulti-armed banditlong-term rewardOpenAI GymQ-Learning
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.