Semantic Search on Wikipedia with Weaviate, GraphQL, Sentence‑BERT, and BERT Q&A

This article walks through building a large‑scale semantic search system on the English Wikipedia using the Weaviate vector database, GraphQL queries, and pre‑trained Sentence‑BERT and BERT Q&A models, covering dataset preparation, schema design, import pipelines, query examples, and production deployment strategies.

Code DAO
Code DAO
Code DAO
Semantic Search on Wikipedia with Weaviate, GraphQL, Sentence‑BERT, and BERT Q&A

To perform large‑scale semantic search, a vector search engine is required. The guide uses the open‑sourced English Wikipedia dump (2021‑10‑09) containing 11,348,257 articles, 27,377,159 paragraphs, and 125,447,595 cross‑references. The service runs on a Google Cloud VM with 12 CPU, 100 GB RAM, 250 GB SSD, and a NVIDIA Tesla P4.

The ML models employed are multi-qa-MiniLM-L6-cos-v1 and bert-large-uncased-whole-word-masking-finetuned-squad, both available as pre‑trained modules inside Weaviate.

Data import proceeds in two phases: (1) clean the Wikipedia dump and generate a JSON Lines file; (2) import the JSON Lines into Weaviate. The process can be run manually or by downloading a ready‑made file.

Schema creation defines two classes— Article and Paragraph —so that each paragraph links back to its article, forming a graph. This enables GraphQL queries that traverse article‑paragraph relationships.

Weaviate class structure
Weaviate class structure

Paragraph content is vectorised with the Sentence‑BERT transformer; the resulting vectors power all semantic queries.

Import runs on the same hardware as the dataset preparation, but with four GPUs instead of one. A Docker Compose file mounts an external volume for persistent storage, sets CLUSTER_HOSTNAME to identify the cluster, and uses a Weaviate load balancer to route traffic to available transformer modules, accelerating import speed.

Docker environment setup
Docker environment setup

Querying the data

Weaviate enables two modules—semantic search and Q&A—accessible via GraphQL. Four example queries are provided:

Natural‑language question : asks "Where is the States General of The Netherlands located?" and returns a single answer with certainty ≈ 0.68.

{
  Get {
    Paragraph(
      ask: {
        question: "Where is the States General of The Netherlands located?"
        properties: ["content"]
      }
      limit: 1
    ) {
      _additional { answer { result certainty } }
      content title
    }
  }
}

Generic concept search : uses the nearText filter to find paragraphs about "Italian food" (limit 50).

{
  Get {
    Paragraph(
      nearText: { concepts: ["Italian food"] }
      limit: 50
    ) {
      content order title inArticle { title }
    }
  }
}

Mixing scalar and vector filters : searches paragraphs about saxophonist Michael Brecker, filtering by the scalar field inArticle and limiting to one result.

{
  Get {
    Paragraph(
      ask: { question: "What was Michael Brecker's first saxophone?" properties: ["content"] }
      where: { operator: Equal path: ["inArticle", "Article", "title"] valueString: "Michael Brecker" }
      limit: 1
    ) {
      _additional { answer { result } }
      content order title inArticle { title }
    }
  }
}

Combining concept search with graph relations : retrieves paragraphs about "jazz saxophone players" and follows the graph to linked articles.

{
  Get {
    Paragraph(
      nearText: { concepts: ["jazz saxophone players"] }
      limit: 25
    ) {
      content order title inArticle {
        title
        hasParagraphs { title }
      }
    }
  }
}

Production strategy

Weaviate is designed for production‑grade machine‑learning workloads. The dataset can run on a single‑machine Docker setup; for larger deployments, a Kubernetes cluster can be launched (link omitted for brevity).

Scalability hinges on three components:

Data (the Wikipedia corpus)

Machine‑learning models (Sentence‑BERT, BERT Q&A)

Vector search engine (Weaviate)

The article demonstrates how to combine open‑source ML models with a vector database to turn the Wikipedia corpus into a production‑ready semantic search solution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Vector DatabaseSemantic SearchGraphQLSentence-BERTWikipediaWeaviate
Code DAO
Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.