Artificial Intelligence 10 min read

Reinforcement Learning for Recommendation System Mixing: Concepts, Practice, and Evaluation

This article explains how reinforcement learning, with its focus on maximizing long‑term reward, can improve recommendation system mixing by covering basic RL concepts, differences from supervised learning, multi‑armed bandit approaches, practical OpenAI Gym experiments, new AUC metrics, online gains, and advanced model optimizations.

DataFunTalk

Nov 12, 2020

Reinforcement Learning for Recommendation System Mixing: Concepts, Practice, and Evaluation

Compared with traditional supervised learning, reinforcement learning (RL) can maximize long‑term reward, which is especially valuable for recommendation systems that need to look beyond immediate clicks.

The article introduces RL basics, including the classic <A, S, R, P> tuple (Agent, State, Reward, and Model), and contrasts RL with supervised and unsupervised learning, highlighting its focus on long‑term gains.

It explains the multi‑armed bandit (MAB) problem as a core RL technique for exploration vs. exploitation, and mentions AlphaGo as an example that combines policy‑based and value‑based networks.

For hands‑on practice, the article suggests using OpenAI’s gym environment (CartPole) and provides a complete Q‑learning implementation:

import gym
import random
import numpy

N_BINS = [5, 5, 5, 5]
LEARNING_RATE=0.05
DISCOUNT_FACTOR=0.9
EPS = 0.3

MIN_VALUES = [-0.5,-2.0,-0.5,-3.0]
MAX_VALUES = [0.5,2.0,0.5,3.0]
BINS = [numpy.linspace(MIN_VALUES[i], MAX_VALUES[i], N_BINS[i]) for i in xrange(4)]

def discretize(obs):
    return tuple([int(numpy.digitize(obs[i], BINS[i])) for i in xrange(4)])

qv = {}

env = gym.make('CartPole-v0')
print(env.action_space)
print(env.observation_space)
an = env.action_space.n

def get(s, a):
    global qv
    if (s, a) not in qv:
        return 0
    return qv[(s, a)]

def update(s, a, s1, r):
    global qv
    nows = get(s, a)
    m0 = get(s1, 0)
    m1 = get(s1, 1)
    if m0 < m1:
        m0 = m1
    qv[(s, a)] = nows + LEARNING_RATE * (r + DISCOUNT_FACTOR * m0 - nows)

for i in range(500000):
    obs = env.reset()
    if i % 1000 == 0:
        print i
    for _ in range(5000):
        s = discretize(obs)
        s_0 = get(s, 0)
        nowa = 0
        s_1 = get(s, 1)
        if s_1 > s_0:
            nowa = 1
        if random.random() <= EPS:
            nowa = 1 - nowa
        obs, reward, done, info = env.step(nowa)
        s1 = discretize(obs)
        if done:
            reward = -10
        update(s, nowa, s1, reward)
        if done:
            break

for i_episode in range(1):
    obs = env.reset()
    for t in range(5000):
        env.render()
        s = discretize(obs)
        maxs = get(s, 0)
        maxa = 0
        nows = get(s, 1)
        if nows > maxs:
            maxa = 1
        obs, reward, done, info = env.step(maxa)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break

The article then discusses why RL is needed for recommendation mixing, pointing out challenges such as heterogeneous data, differing objectives across content types, high computational cost, and varying content quality.

It models the recommendation process as a Markov Decision Process where the system is the agent, recommended items are actions, and user feedback (clicks, negative feedback, exits) serves as reward.

To evaluate models offline, a new AUC metric is proposed that measures the probability that a pair of items with higher cumulative reward is ranked higher, arguing that this better reflects long‑term gains than traditional CTR‑based AUC.

Online experiments show a 7% increase in total dwell time compared with rule‑based mixing, and a 1‑2% improvement over a supervised learning baseline.

Further model optimizations are described, including session‑based recommendation using a Personal DQN with RNN‑encoded states, Bloom embedding with Dueling DQN to reduce hash collisions, and Double DQN (DDDQN) for more stable learning.

Negative feedback is incorporated as a negative reward, and focal loss is applied to address its sparsity, achieving a 19% reduction in negative feedback rate.

The article concludes with reflections on the similarity between RL’s actor‑critic architecture and GANs, suggesting potential fusion of the two approaches for future improvements.

Finally, the author thanks the audience and invites readers to like, share, and join the DataFunTalk community for further AI and big‑data discussions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Artificial Intelligence Recommendation Systems reinforcement learning multi-armed bandit long-term reward OpenAI Gym Q-Learning

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.