Artificial Intelligence 8 min read

Semantic Search on Wikipedia with Weaviate, GraphQL, Sentence‑BERT, and BERT Q&A

This article walks through building a large‑scale semantic search system on the English Wikipedia using the Weaviate vector database, GraphQL queries, and pre‑trained Sentence‑BERT and BERT Q&A models, covering dataset preparation, schema design, import pipelines, query examples, and production deployment strategies.

Code DAO

Dec 14, 2021

Semantic Search on Wikipedia with Weaviate, GraphQL, Sentence‑BERT, and BERT Q&A

To perform large‑scale semantic search, a vector search engine is required. The guide uses the open‑sourced English Wikipedia dump (2021‑10‑09) containing 11,348,257 articles, 27,377,159 paragraphs, and 125,447,595 cross‑references. The service runs on a Google Cloud VM with 12 CPU, 100 GB RAM, 250 GB SSD, and a NVIDIA Tesla P4.

The ML models employed are multi-qa-MiniLM-L6-cos-v1 and bert-large-uncased-whole-word-masking-finetuned-squad, both available as pre‑trained modules inside Weaviate.

Data import proceeds in two phases: (1) clean the Wikipedia dump and generate a JSON Lines file; (2) import the JSON Lines into Weaviate. The process can be run manually or by downloading a ready‑made file.

Schema creation defines two classes— Article and Paragraph —so that each paragraph links back to its article, forming a graph. This enables GraphQL queries that traverse article‑paragraph relationships.

Paragraph content is vectorised with the Sentence‑BERT transformer; the resulting vectors power all semantic queries.

Import runs on the same hardware as the dataset preparation, but with four GPUs instead of one. A Docker Compose file mounts an external volume for persistent storage, sets CLUSTER_HOSTNAME to identify the cluster, and uses a Weaviate load balancer to route traffic to available transformer modules, accelerating import speed.

Querying the data

Weaviate enables two modules—semantic search and Q&A—accessible via GraphQL. Four example queries are provided:

Natural‑language question : asks "Where is the States General of The Netherlands located?" and returns a single answer with certainty ≈ 0.68.

{
  Get {
    Paragraph(
      ask: {
        question: "Where is the States General of The Netherlands located?"
        properties: ["content"]
      }
      limit: 1
    ) {
      _additional { answer { result certainty } }
      content title
    }
  }
}

Generic concept search : uses the nearText filter to find paragraphs about "Italian food" (limit 50).

{
  Get {
    Paragraph(
      nearText: { concepts: ["Italian food"] }
      limit: 50
    ) {
      content order title inArticle { title }
    }
  }
}

Mixing scalar and vector filters : searches paragraphs about saxophonist Michael Brecker, filtering by the scalar field inArticle and limiting to one result.

{
  Get {
    Paragraph(
      ask: { question: "What was Michael Brecker's first saxophone?" properties: ["content"] }
      where: { operator: Equal path: ["inArticle", "Article", "title"] valueString: "Michael Brecker" }
      limit: 1
    ) {
      _additional { answer { result } }
      content order title inArticle { title }
    }
  }
}

Combining concept search with graph relations : retrieves paragraphs about "jazz saxophone players" and follows the graph to linked articles.

{
  Get {
    Paragraph(
      nearText: { concepts: ["jazz saxophone players"] }
      limit: 25
    ) {
      content order title inArticle {
        title
        hasParagraphs { title }
      }
    }
  }
}

Production strategy

Weaviate is designed for production‑grade machine‑learning workloads. The dataset can run on a single‑machine Docker setup; for larger deployments, a Kubernetes cluster can be launched (link omitted for brevity).

Scalability hinges on three components:

Data (the Wikipedia corpus)

Machine‑learning models (Sentence‑BERT, BERT Q&A)

Vector search engine (Weaviate)

The article demonstrates how to combine open‑source ML models with a vector database to turn the Wikipedia corpus into a production‑ready semantic search solution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Vector Database Semantic Search GraphQL Sentence-BERT Wikipedia Weaviate

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.