Application Scenarios and Practical Implementation of NLP in Yuewen's Content Mining Platform
This article details the business background, technical architecture, and practical deployments of natural language processing for content mining at Yuewen, covering tag construction, knowledge‑graph building, role analysis, recommendation generation, porn and plagiarism detection, and summarizing lessons learned.
Business Background – The rapid growth of online literature in China has turned novels into valuable IP, driving massive readership and cross‑media adaptations. Yuewen, now a 40‑billion‑HKD platform, faces challenges in understanding and monetizing this content.
Industry Status & Writing Patterns – Network novels span genres such as fantasy, sci‑fi, and historical fiction, each with distinct narrative structures and tag vocabularies. Continuous evolution of tags and writing templates necessitates fine‑grained labeling for effective recommendation.
Tag Dimension & Structured Tagging – A hierarchical tag system (generic + genre‑specific) is built through a closed loop of operations, editorial input, and technical validation. The process involves defining tags, collecting user‑filled tags, algorithmic candidate generation, and editorial confirmation.
Technical Architecture – The platform consists of five layers: underlying data, core technologies, foundational operators, application strategies, and business scenarios. Core operators work at paragraph and chapter granularity using end‑to‑end models.
Knowledge‑Base Construction – A knowledge base supports semantic understanding, relationship graph building, and inference. It can be constructed via data‑driven inference or manual curation, with deep‑learning assistance for large‑scale extraction.
Content Mining Goals & Solutions – The aim is to continuously enhance content value conversion by identifying user preferences and linking them to downstream scenarios. Solutions include closing the loop between platform demands, mining operators, and business feedback, and linking multiple platforms for seamless data flow.
Practical Deployments
1. Role Analysis – Uses NER and relationship extraction to identify main characters, compute social ratios, and cluster relationship graphs for persona insights.
2. Tag Construction – Combines rule‑based and similarity‑based methods (semantic vectors, behavior‑based B2V features) with deep‑learning models to generate and refine tags.
3. Recommendation Generation – Generates recommendation sentences by templating structured content or training data‑to‑sequence models on existing book recommendations.
4. Porn Detection – Employs keyword recall and model‑based recall, leveraging rule‑level, structural, and semantic features, optionally expanded with word2vec.
5. Plagiarism Detection – Splits chapters into sentences, filters short sentences, removes named entities, creates MD5 fingerprints, builds Lucene inverted indexes, and flags chapters whose fingerprint overlap exceeds a threshold.
Practice Summary – Emphasizes tight integration of technology with business needs, efficient sample generation, and leveraging user behavior as a rich source of implicit knowledge for NLP models.
Author & Recruitment – Presented by Ma Yufeng, senior R&D engineer at Yuewen, with a background at Baidu. Recruitment details for a Text Mining Engineer position in Shanghai are provided.
Community Note – DataFun community promotes sharing of practical AI and big‑data experiences through offline salons and online resources.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.