Building a Student Model with TensorFlow: Deep Knowledge Tracing for Adaptive Learning
This article reviews how Liulishuo applied TensorFlow to implement a Deep Knowledge Tracing (DKT) student model for an adaptive learning system, covering the problem background, model architecture, TensorFlow implementation details, multi‑GPU training, and practical deployment considerations.
On February 15, Google held the first TensorFlow Dev Summit and released TensorFlow 1.0. A follow‑up event in Shanghai on March 18 reviewed the summit, and this article shares how Liulishuo used TensorFlow to build a student model for its adaptive learning system.
1. Application Background
What is Adaptive Learning?
Adaptive learning aims to improve student efficiency by personalizing learning paths. Traditional teaching offers a one‑size‑fits‑all path, which is inefficient for students with varying abilities. By matching appropriate content to each student's ability, learning efficiency can be increased.
Student Model
The core challenges are (1) assessing a student's ability and (2) recommending suitable content based on that assessment. This article explains how TensorFlow can be used to construct the student model.
Student learning is treated as a time‑series of interactions; ability is inferred from the correctness of answers over time. Because ability changes, static assessment methods are unsuitable.
Deep Knowledge Tracing (DKT)
To model the learning sequence and evaluate ability at each time step, we adopt the Deep Knowledge Tracing model introduced by Piech et al. (NIPS 2015). DKT is essentially a Seq2Seq RNN model.
The diagram shows the DKT model unfolded over time. Input sequence x1, x2, x3 … encodes answer information at timestamps t1, t2, t3 … . Hidden states capture knowledge mastery, and the output predicts the probability of answering each question correctly.
Assuming four questions, the output layer has four nodes (one per question). Using one‑hot encoding, the input layer size is 4 * 2 = 8 . The input connects to an RNN hidden layer, which connects to the output layer, followed by a sigmoid activation.
The loss function is defined as binary cross‑entropy:
where y is the model prediction at time t , q_{t+1} is the one‑hot question ID, a_{t+1} is the answer correctness, and l denotes the binary cross‑entropy loss.
2. Model Construction
First, model parameters are initialized and inputs are received via tf.placeholder :
Next, the RNN layer is built using tf.dynamic_rnn with a multi‑layer cell, specifying sequence_len for each batch and obtaining the hidden state tensor state_series and the final state self.current_state :
The output layer connects two variables from hidden to output and applies tf.sigmoid :
The prediction tensor self.pred_all has shape (batch_size, self.max_steps, num_skills) . To train, we compute the loss and gradients, clip them with tf.clip_by_global_norm to avoid explosion, and apply a gradient descent step via self.train_op :
Gradients are obtained with tf.gradients and then clipped:
Training proceeds by creating a tf.Session and feeding data via a feed_dict inside a step method:
The learning rate is set with an assign_lr method:
Finally, the TensorFlowDKT class is ready for use:
Complete demo code is available at https://github.com/lingochamp/tensorflow-dkt .
3. Engineering Practice
By the end of 2016, Liulishuo's "Understand Your English" course had accumulated billions of answer records. Handling such scale revealed several practical insights.
Truncated BPTT
Sequences can exceed 50,000 steps, which is infeasible for standard BPTT due to memory limits. The solution is to split long sequences, carry the hidden state from the previous segment as the initial state for the next, enabling training on very long sequences.
Multi‑GPU Acceleration
When data reaches billions of records, training time becomes significant. A Multi‑Tower data‑parallel architecture is employed, where each GPU holds a model replica sharing parameters. Gradients from each GPU are averaged on the CPU and used to update the shared parameters.
Using TensorFlowDKT as an example, the average_gradients function (see TensorFlow models' CIFAR‑10 multi‑gpu example) aggregates gradients, and a custom feed_dict method supplies data to each tower.
Because some dynamic_rnn operations are not GPU‑compatible, the session must be configured accordingly before training:
With these steps, multi‑GPU training of the DKT model is achieved.
Model Export
Training is performed with the Python API, while inference services are implemented in C++. To export a model, tf.train.write_graph saves the graph structure but not variable values. Converting variables to constants via tf.constant and using tensorflow.python.tools.freeze_graph resolves this.
Conclusion
TensorFlow is a simple yet powerful deep‑learning framework that lets researchers focus on algorithms rather than low‑level details. Since early 2016, Liulishuo's algorithm team has applied TensorFlow to internal machine‑learning projects, gaining extensive experience that helps deliver smarter services to users.
References
Piech, Chris, et al. "Deep Knowledge Tracing." Advances in Neural Information Processing Systems . 2015.
https://www.tensorflow.org/
https://github.com/tensorflow/tensorflow
Liulishuo Tech Team
Help everyone become a global citizen!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.