Big Data 15 min read

Using Spark for Machine Learning, New Word Discovery, and Intelligent Q&A

The article explains how to leverage Apache Spark for machine‑learning tasks, large‑scale new‑word discovery, and simple intelligent question‑answering by using Spark‑Shell, Scala code, and word2vec‑based similarity, while sharing practical tips and performance considerations.

Architect
Architect
Architect
Using Spark for Machine Learning, New Word Discovery, and Intelligent Q&A

In this technical talk, the author William Zhu, a real‑time computing engineer, shares three Spark‑based topics: machine learning with Spark‑Shell, new‑word discovery, and a simple intelligent Q&A system.

1. Machine learning with Spark – Spark 1.5 provides a unified platform for real‑time, batch, SQL, and ML libraries. Spark‑Shell, combined with Scala, lets users write code as easily as shell scripts, but the code runs on a cluster of hundreds or thousands of nodes. Unlike traditional sampling‑based methods, Spark can process full‑scale data, making algorithms such as Naïve Bayes, Word2Vec, and linear regression practical for data analysts and engineers.

2. New‑word discovery – By exploiting Spark’s massive parallelism, the author processed over 2 million blog posts (≈200 GB) to extract around 80 000 candidate terms (Chinese, English, and mixed). Five attributes—cohesion, freedom, frequency, IDF, and overlapping substrings—were computed for each term. The pipeline includes: (1) stripping HTML tags, (2) tokenizing text into small blocks, (3) limiting word length to five characters, (4) extracting Chinese/English tokens, and (5) removing special characters. Memory‑intensive operations such as groupByKey were replaced by reduceByKey to avoid out‑of‑memory failures. The final filtered list became the foundation of the company’s domain‑specific lexicon.

3. Intelligent Q&A – The prototype compares the similarity of two titles using word2vec vectors. Queries from search engine logs serve as the training corpus, producing 50‑dimensional embeddings for each word. Sentence vectors are obtained by element‑wise addition of word vectors (e.g., A[1,3,5] + B[1,3,7] = [2,6,12] ). Titles with cosine similarity > 0.9 are treated as direct answers, while > 0.7 are considered references. This approach enables cross‑product knowledge linking, as demonstrated with CSDN’s Q&A system.

The author concludes with two key recommendations: data analysts and algorithm engineers should adopt Spark‑Shell (which now supports Python and R) for full‑scale analytics, and platform designers can draw on his experience when building machine‑learning pipelines.

Selected Q&A – Highlights include learning Scala to start with Spark, why RAID is unnecessary with HDFS replication, handling memory pressure by preferring reduceByKey , and practical tips for vector addition and unsupervised Chinese tokenization.

big dataMachine LearningSparkScalaIntelligent QANew Word Discovery
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.