Artificial Intelligence 10 min read

Implementing a Simple University Paper Plagiarism Detection System in Python

This article outlines the design and implementation of a basic university paper plagiarism detection system using Python, covering text preprocessing with NLTK, TF‑IDF weighting, cosine similarity calculation, and a sample in‑memory paper database, while also discussing scalability, UI, and legal considerations.

Test Development Learning Exchange

Apr 20, 2024

Implementing a Simple University Paper Plagiarism Detection System in Python

Building a complete university paper plagiarism detection system involves complex text processing, similarity computation, and database operations, and requires a large corpus of published papers together with appropriate copyright permissions.

Typical systems use algorithms such as cosine similarity, Jaccard similarity, and TF‑IDF weighting, and perform preprocessing steps including tokenization, stop‑word removal, and stemming.

In practice, it is recommended to use professional plagiarism services (e.g., Turnitin, Grammarly, CNKI) that provide mature and compliant solutions; however, if you develop your own system you must also consider data acquisition, user‑interface design, report generation, security, and privacy, possibly leveraging cloud services and big‑data technologies.

The following example demonstrates how to use the NLTK library for preprocessing, compute TF‑IDF weights, calculate cosine similarity, and compare a query paper against a simple in‑memory dictionary representing a paper database.

import re</code><code>import string</code><code>from collections import Counter</code><code>from math import sqrt</code><code>from typing import Dict, List, Tuple</code><code>import nltk</code><code>from nltk.corpus import stopwords</code><code>from nltk.tokenize import word_tokenize</code><code># 假设有一个简单的论文数据库（实际应用中应使用数据库系统）</code><code>SIMPLE_PAPER_DATABASE: Dict[str, str] = {</code><code>    "paper1": "This is the content of the first paper.",</code><code>    "paper2": "The second paper has its own unique content.",</code><code>    # ... 更多论文</code><code>}</code><code># 文本预处理函数</code><code>def preprocess_text(text: str) -> List[str]:</code><code>    """</code><code>    Perform basic text preprocessing: lowercase, remove punctuation, tokenize, and remove stop words.</code><code>    Args:</code><code>        text (str): The input text.</code><code>    Returns:</code><code>        list[str]: A list of preprocessed tokens.</code><code>    """</code><code>    text = text.lower().translate(str.maketrans('', '', string.punctuation))</code><code>    tokens = word_tokenize(text)</code><code>    stop_words = set(stopwords.words('english'))</code><code>    preprocessed_tokens = [token for token in tokens if token not in stop_words]</code><code>    return preprocessed_tokens</code><code># 计算TF-IDF权重</code><code>def compute_tfidf(tokens: List[str], all_tokens: List[List[str]]) -> Dict[str, float]:</code><code>    """</code><code>    Compute TF-IDF weights for a given list of tokens, based on a list of all tokens across all documents.</code><code>    Args:</code><code>        tokens (list[str]): Tokens from a single document.</code><code>        all_tokens (list[list[str]]): List of tokens for all documents.</code><code>    Returns:</code><code>        dict[str, float]: Dictionary mapping each token to its TF-IDF weight.</code><code>    """</code><code>    token_counts = Counter(tokens)</code><code>    total_tokens = sum(len(doc_tokens) for doc_tokens in all_tokens)</code><code>    tfidf_weights = {}</code><code>    for token, count in token_counts.items():</code><code>        df = sum(token in doc_tokens for doc_tokens in all_tokens) / len(all_tokens)</code><code>        idf = log((len(all_tokens) + 1) / (df + 1))  # Smoothed IDF</code><code>        tfidf_weights[token] = count / len(tokens) * idf</code><code>    return tfidf_weights</code><code># 计算余弦相似度</code><code>def cosine_similarity(vec1: Dict[str, float], vec2: Dict[str, float]) -> float:</code><code>    """</code><code>    Compute the cosine similarity between two dictionaries representing token-weight vectors.</code><code>    Args:</code><code>        vec1 (dict[str, float]): Token-weight vector 1.</code><code>        vec2 (dict[str, float]): Token-weight vector 2.</code><code>    Returns:</code><code>        float: Cosine similarity value between 0 and 1, where 1 indicates identical vectors.</code><code>    """</code><code>    dot_product = sum(vec1.get(token, 0) * vec2.get(token, 0) for token in set(vec1.keys()) | set(vec2.keys()))</code><code>    norm1 = sqrt(sum(val ** 2 for val in vec1.values()))</code><code>    norm2 = sqrt(sum(val ** 2 for val in vec2.values()))</code><code>    if norm1 == 0 or norm2 == 0:</code><code>        return 0.0</code><code>    return dot_product / (norm1 * norm2)</code><code>def compare_papers(paper_id: str, query_paper_content: str) -> Dict[str, float]:</code><code>    """</code><code>    Compare a query paper's content against all papers in the database, returning a dictionary of similarity scores.</code><code>    Args:</code><code>        paper_id (str): ID of the query paper.</code><code>        query_paper_content (str): Content of the query paper.</code><code>    Returns:</code><code>        dict[str, float]: Dictionary mapping each paper ID to its similarity score with the query paper.</code><code>    """</code><code>    query_tokens = preprocess_text(query_paper_content)</code><code>    all_tokens = [preprocess_text(content) for content in SIMPLE_PAPER_DATABASE.values()]</code><code>    query_tfidf = compute_tfidf(query_tokens, all_tokens)</code><code>    similarities = {}</code><code>    for other_paper_id, other_paper_content in SIMPLE_PAPER_DATABASE.items():</code><code>        if other_paper_id == paper_id:  # Skip comparing the paper with itself</code><code>            continue</code><code>        other_tokens = preprocess_text(other_paper_content)</code><code>        other_tfidf = compute_tfidf(other_tokens, all_tokens)</code><code>        similarity = cosine_similarity(query_tfidf, other_tfidf)</code><code>        similarities[other_paper_id] = similarity</code><code>    return similarities</code><code># 示例用法</code><code>query_paper_id = "paper1"</code><code>query_paper_content = "This is the content of the first paper, with some added text."</code><code>results = compare_papers(query_paper_id, query_paper_content)</code><code>for other_paper_id, similarity in results.items():</code><code>    print(f"Similarity between '{query_paper_id}' and '{other_paper_id}': {similarity * 100:.2f}%")

For real‑world deployment you would need scalable storage (e.g., MySQL, MongoDB), distributed processing frameworks (e.g., Spark) for large datasets, a user‑friendly web or desktop UI, compliance with copyright laws, more sophisticated preprocessing (stemming, lemmatization, synonym replacement), and advanced similarity methods such as sentence‑level metrics or deep‑learning models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python NLP TF-IDF text similarity Cosine Similarity plagiarism detection

Written by

Test Development Learning Exchange

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.