Tag

Simhash

0 views collected around this technical thread.

ZhongAn Tech Team
ZhongAn Tech Team
Sep 3, 2024 · Big Data

Real-Time Log Clustering Architecture and Continuous Clustering Algorithm

This article presents a comprehensive overview of a log clustering system, detailing its background, architecture based on Filebeat, Kafka, Flink, Elasticsearch, and Grafana, and introduces a continuous clustering algorithm using SimHash and Hamming distance for real‑time log governance and anomaly detection.

FlinkLog ClusteringSimhash
0 likes · 14 min read
Real-Time Log Clustering Architecture and Continuous Clustering Algorithm
Architect
Architect
Oct 18, 2021 · Fundamentals

Understanding Simhash: From Traditional Hash to Random Projection and LSH

This article explains the principles behind Simhash, covering the shortcomings of traditional hash functions, the use of cosine similarity, random projection for dimensionality reduction, locality‑sensitive hashing, random hyperplane hashing, implementation steps, query optimization with the pigeonhole principle, and the algorithm's limitations in short‑text scenarios.

Locality Sensitive HashingRandom ProjectionSimhash
0 likes · 18 min read
Understanding Simhash: From Traditional Hash to Random Projection and LSH
Sohu Tech Products
Sohu Tech Products
Mar 17, 2021 · Big Data

Understanding Simhash: From Traditional Hash to Random Projection LSH

This article explains the principles and implementation of Simhash, covering the shortcomings of traditional hash functions, the use of cosine similarity, random projection for dimensionality reduction, locality‑sensitive hashing, and practical optimizations for large‑scale duplicate detection.

AlgorithmBig DataLocality Sensitive Hashing
0 likes · 24 min read
Understanding Simhash: From Traditional Hash to Random Projection LSH
360 Quality & Efficiency
360 Quality & Efficiency
Oct 19, 2018 · Big Data

Information Fingerprint and Simhash Algorithm for Large-Scale Duplicate Detection

This article explains the concept of information fingerprints, compares traditional set‑equality methods, introduces the Simhash algorithm for high‑dimensional text similarity reduction, and demonstrates how partitioned 64‑bit fingerprints enable efficient duplicate detection on massive web data.

Big DataDuplicate DetectionSimhash
0 likes · 6 min read
Information Fingerprint and Simhash Algorithm for Large-Scale Duplicate Detection