Tagged articles
9 articles
Page 1 of 1
Sohu Smart Platform Tech Team
Sohu Smart Platform Tech Team
Aug 9, 2025 · Artificial Intelligence

How SimHash and Cosine Similarity Accelerate Large-Scale Text Deduplication

This article explains why traditional pairwise text comparison is impractical for massive news corpora, introduces cosine similarity and SimHash as efficient deduplication techniques, walks through their mathematical foundations, step‑by‑step implementation details, code examples, and discusses trade‑offs such as accuracy versus speed.

Big DataCosine SimilaritySimHash
0 likes · 12 min read
How SimHash and Cosine Similarity Accelerate Large-Scale Text Deduplication
ZhongAn Tech Team
ZhongAn Tech Team
Sep 3, 2024 · Big Data

Real-Time Log Clustering Architecture and Continuous Clustering Algorithm

This article presents a comprehensive overview of a log clustering system, detailing its background, architecture based on Filebeat, Kafka, Flink, Elasticsearch, and Grafana, and introduces a continuous clustering algorithm using SimHash and Hamming distance for real‑time log governance and anomaly detection.

FlinkLog ClusteringReal-time analytics
0 likes · 14 min read
Real-Time Log Clustering Architecture and Continuous Clustering Algorithm
Sohu Tech Products
Sohu Tech Products
Feb 28, 2024 · Big Data

How SimHash and Cosine Similarity Accelerate Large‑Scale Text Deduplication

This article explains why massive news feeds need efficient deduplication, compares cosine similarity and SimHash for measuring text similarity, walks through a step‑by‑step implementation with Java code, and shows how a space‑for‑time indexing strategy can reduce duplicate‑detection complexity from O(n²) to near O(1).

Big DataCosine SimilarityNear-Duplicate Detection
0 likes · 14 min read
How SimHash and Cosine Similarity Accelerate Large‑Scale Text Deduplication
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Mar 31, 2022 · Industry Insights

How Implicit Relationship Chains Solve Cold‑Start Problems at NetEase Cloud Music

This article details NetEase Cloud Music's technical approach to building implicit user relationship chains—using SimHash, Item2Vec, and MetaPath2Vec embeddings, large‑scale vector search, and a unified service architecture—to address cold‑start challenges across multiple business scenarios.

Item2VecMetaPath2VecRecommendation Systems
0 likes · 20 min read
How Implicit Relationship Chains Solve Cold‑Start Problems at NetEase Cloud Music
Architect
Architect
Oct 18, 2021 · Fundamentals

Understanding Simhash: From Traditional Hash to Random Projection and LSH

This article explains the principles behind Simhash, covering the shortcomings of traditional hash functions, the use of cosine similarity, random projection for dimensionality reduction, locality‑sensitive hashing, random hyperplane hashing, implementation steps, query optimization with the pigeonhole principle, and the algorithm's limitations in short‑text scenarios.

Locality Sensitive HashingRandom ProjectionSimHash
0 likes · 18 min read
Understanding Simhash: From Traditional Hash to Random Projection and LSH
Sohu Tech Products
Sohu Tech Products
Mar 17, 2021 · Big Data

Understanding Simhash: From Traditional Hash to Random Projection LSH

This article explains the principles and implementation of Simhash, covering the shortcomings of traditional hash functions, the use of cosine similarity, random projection for dimensionality reduction, locality‑sensitive hashing, and practical optimizations for large‑scale duplicate detection.

Big DataCosine SimilarityLocality Sensitive Hashing
0 likes · 24 min read
Understanding Simhash: From Traditional Hash to Random Projection LSH