AI-Based Digital Watermarking for Video: Design, Training Strategies, and Engineering Deployment
The paper presents an AI‑driven invisible video watermarking system that combines a convolutional encoder/decoder with SE blocks, a simulated‑JPEG degradation layer, multi‑term loss, block‑wise processing, anchor‑based alignment and redundancy voting, achieving high visual fidelity and robust recovery after double‑compression in large‑scale platforms like Bilibili.
The rapid growth of online video platforms relies heavily on creators who invest significant effort in producing content. As copyright awareness rises, protecting video assets has become a critical challenge, especially against unauthorized copying, clipping, cross‑platform redistribution, and unlicensed commercial use. Traditional manual verification of infringement is labor‑intensive and often ineffective.
Most mainstream platforms embed visible watermarks (e.g., logos) to declare ownership, but these can be easily removed by simple cropping or by using mature AI‑driven watermark removal tools. Consequently, a continuous arms race between watermarking techniques and removal attacks drives the evolution of more sophisticated methods.
Conventional invisible watermarking approaches—such as Least Significant Bit (LSB) manipulation in the spatial domain or DWT‑DCT embedding in the frequency domain—suffer from poor robustness against compression, transcoding, and quality‑degrading attacks. They also tend to introduce visible artifacts when forced to be highly robust.
AI‑based digital watermarks address many of these shortcomings by achieving strong robustness with minimal visual impact. However, early AI watermark models suffer from low encoding/decoding efficiency, limiting their applicability to large‑scale video pipelines.
Background
Digital watermarking follows three stages: embedding (adding the watermark to the original media), channel propagation (the watermarked media undergoes various distortions), and extraction (recovering the watermark). For video, the process is applied frame‑by‑frame, mirroring image watermarking pipelines.
Algorithm Design
1. Model Architecture
The proposed network consists of two modules: a watermark encoder and a watermark decoder . The encoder extracts features from the input frame using convolutional layers and SE (Squeeze‑and‑Excitation) blocks, reshapes the binary watermark into a 2‑D tensor, and merges it with image features via deconvolution and channel‑wise concatenation to produce the watermarked frame. The decoder processes a potentially attacked frame with convolution + SE blocks, reduces the feature map to a single channel, flattens it, and outputs the recovered binary watermark.
2. Degradation Layer
During training, a degradation layer simulates real‑world attacks between the encoder and decoder. Three attack modes are randomly selected per iteration:
No attack (clean transmission)
Simulated JPEG compression (approximate quantization of high‑frequency DCT coefficients)
Real JPEG compression (actual codec)
For realistic JPEG attacks, the quality factor is sampled uniformly from [10, 90] , and double‑compression is applied to mimic platform transcoding. This strategy expands the degradation space and improves robustness against varied compression settings.
3. Loss Functions
The training objective combines several terms:
MSE Loss on the watermarked image to enforce invisibility.
Adversarial loss to align the distribution of watermarked images with natural images.
MSE Loss on the extracted watermark to enforce robustness.
Target PSNR Loss that drives the peak‑signal‑to‑noise ratio toward a predefined value, ensuring high visual quality.
LPIPS Loss (Learned Perceptual Image Patch Similarity) to further reduce perceptual artifacts.
The total loss is expressed as:
L = λ1·L_MSE_image + λ2·L_adv + λ3·L_MSE_wm + λ4·L_PSNR_target + λ5·L_LPIPS , where each λ balances a specific component.
Engineering Deployment
1. Block Encoding
To meet the throughput demands of Bilibili’s massive video library, watermark embedding is performed on image blocks rather than whole frames. For a 4K frame, a full‑frame encoder would require ~4 TFLOPs and 14 GB of GPU memory, taking ~400 ms per frame. By processing 512 × 512 blocks, the compute drops to ~0.13 TFLOPs, memory to 2.7 GB, and latency to ~12 ms. Only the Y (luminance) channel is watermarked, preserving chroma (U/V) quality.
2. Anchor Calibration
Since attacks may involve scaling or cropping, the spatial location of watermarked blocks can shift. Sparse anchor points are embedded in the U channel to mark the positions of Y‑channel blocks. A dedicated anchor‑detection network (inspired by facial‑keypoint detectors) locates these points, enabling a perspective transform that aligns attacked blocks before decoding.
3. Redundancy and Voting
Multiple watermarked blocks per frame allow bit‑wise voting across decoded watermarks, dramatically reducing bit‑error rates. Additionally, a (7, 4) linear block code can be applied to the watermark payload, providing error‑correction capability when up to one bit per 7‑bit codeword is corrupted.
Effect Demonstration
Experiments used a two‑pass H.264 compression (CRF = 32) to simulate piracy chains and embedded the string "bilibili@copyright" (256 bits after padding). On 1080p frames, the watermarked and original images were visually indistinguishable. After double compression, the decoder achieved an 83 % recall (83 out of 100 attacked frames correctly recovered the full string) while maintaining 100 % precision (no false positives on clean frames).
Real‑world validation involved uploading watermarked videos to Bilibili and Douyin, downloading the transcoded versions, and successfully extracting the original watermark.
Conclusion
The presented AI digital watermark solution balances invisibility, robustness, and large‑scale efficiency, addressing key gaps in existing research such as multi‑pass compression resilience and practical encoding speed. Future work will explore higher payload capacities, adaptive bitrate trade‑offs, and more sophisticated degradation modeling.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.