Artificial Intelligence 21 min read

AI-Based Digital Watermarking for Video: Design, Training Strategies, and Engineering Deployment

The paper presents an AI‑driven invisible video watermarking system that combines a convolutional encoder/decoder with SE blocks, a simulated‑JPEG degradation layer, multi‑term loss, block‑wise processing, anchor‑based alignment and redundancy voting, achieving high visual fidelity and robust recovery after double‑compression in large‑scale platforms like Bilibili.

Bilibili Tech
Bilibili Tech
Bilibili Tech
AI-Based Digital Watermarking for Video: Design, Training Strategies, and Engineering Deployment

The rapid growth of online video platforms relies heavily on creators who invest significant effort in producing content. As copyright awareness rises, protecting video assets has become a critical challenge, especially against unauthorized copying, clipping, cross‑platform redistribution, and unlicensed commercial use. Traditional manual verification of infringement is labor‑intensive and often ineffective.

Most mainstream platforms embed visible watermarks (e.g., logos) to declare ownership, but these can be easily removed by simple cropping or by using mature AI‑driven watermark removal tools. Consequently, a continuous arms race between watermarking techniques and removal attacks drives the evolution of more sophisticated methods.

Conventional invisible watermarking approaches—such as Least Significant Bit (LSB) manipulation in the spatial domain or DWT‑DCT embedding in the frequency domain—suffer from poor robustness against compression, transcoding, and quality‑degrading attacks. They also tend to introduce visible artifacts when forced to be highly robust.

AI‑based digital watermarks address many of these shortcomings by achieving strong robustness with minimal visual impact. However, early AI watermark models suffer from low encoding/decoding efficiency, limiting their applicability to large‑scale video pipelines.

Background

Digital watermarking follows three stages: embedding (adding the watermark to the original media), channel propagation (the watermarked media undergoes various distortions), and extraction (recovering the watermark). For video, the process is applied frame‑by‑frame, mirroring image watermarking pipelines.

Algorithm Design

1. Model Architecture

The proposed network consists of two modules: a watermark encoder and a watermark decoder . The encoder extracts features from the input frame using convolutional layers and SE (Squeeze‑and‑Excitation) blocks, reshapes the binary watermark into a 2‑D tensor, and merges it with image features via deconvolution and channel‑wise concatenation to produce the watermarked frame. The decoder processes a potentially attacked frame with convolution + SE blocks, reduces the feature map to a single channel, flattens it, and outputs the recovered binary watermark.

2. Degradation Layer

During training, a degradation layer simulates real‑world attacks between the encoder and decoder. Three attack modes are randomly selected per iteration:

No attack (clean transmission)

Simulated JPEG compression (approximate quantization of high‑frequency DCT coefficients)

Real JPEG compression (actual codec)

For realistic JPEG attacks, the quality factor is sampled uniformly from [10, 90] , and double‑compression is applied to mimic platform transcoding. This strategy expands the degradation space and improves robustness against varied compression settings.

3. Loss Functions

The training objective combines several terms:

MSE Loss on the watermarked image to enforce invisibility.

Adversarial loss to align the distribution of watermarked images with natural images.

MSE Loss on the extracted watermark to enforce robustness.

Target PSNR Loss that drives the peak‑signal‑to‑noise ratio toward a predefined value, ensuring high visual quality.

LPIPS Loss (Learned Perceptual Image Patch Similarity) to further reduce perceptual artifacts.

The total loss is expressed as:

L = λ1·L_MSE_image + λ2·L_adv + λ3·L_MSE_wm + λ4·L_PSNR_target + λ5·L_LPIPS , where each λ balances a specific component.

Engineering Deployment

1. Block Encoding

To meet the throughput demands of Bilibili’s massive video library, watermark embedding is performed on image blocks rather than whole frames. For a 4K frame, a full‑frame encoder would require ~4 TFLOPs and 14 GB of GPU memory, taking ~400 ms per frame. By processing 512 × 512 blocks, the compute drops to ~0.13 TFLOPs, memory to 2.7 GB, and latency to ~12 ms. Only the Y (luminance) channel is watermarked, preserving chroma (U/V) quality.

2. Anchor Calibration

Since attacks may involve scaling or cropping, the spatial location of watermarked blocks can shift. Sparse anchor points are embedded in the U channel to mark the positions of Y‑channel blocks. A dedicated anchor‑detection network (inspired by facial‑keypoint detectors) locates these points, enabling a perspective transform that aligns attacked blocks before decoding.

3. Redundancy and Voting

Multiple watermarked blocks per frame allow bit‑wise voting across decoded watermarks, dramatically reducing bit‑error rates. Additionally, a (7, 4) linear block code can be applied to the watermark payload, providing error‑correction capability when up to one bit per 7‑bit codeword is corrupted.

Effect Demonstration

Experiments used a two‑pass H.264 compression (CRF = 32) to simulate piracy chains and embedded the string "bilibili@copyright" (256 bits after padding). On 1080p frames, the watermarked and original images were visually indistinguishable. After double compression, the decoder achieved an 83 % recall (83 out of 100 attacked frames correctly recovered the full string) while maintaining 100 % precision (no false positives on clean frames).

Real‑world validation involved uploading watermarked videos to Bilibili and Douyin, downloading the transcoded versions, and successfully extracting the original watermark.

Conclusion

The presented AI digital watermark solution balances invisibility, robustness, and large‑scale efficiency, addressing key gaps in existing research such as multi‑pass compression resilience and practical encoding speed. Future work will explore higher payload capacities, adaptive bitrate trade‑offs, and more sophisticated degradation modeling.

AIdeep learninganchor calibrationblock encodingcompression robustnessdigital watermarkingvideo security
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.