Artificial Intelligence 18 min read

Data Science Q&A: Overfitting, Experimental Design, Tall/Wide Data, Chart Junk, Outliers, Extreme Value Theory, Recommendation Engines, and Visualization

This article presents a series of data‑science questions and expert answers covering overfitting, experimental design for user behavior, the distinction between tall and wide data, detecting chart junk, outlier detection methods, extreme‑value theory for rare events, recommendation‑engine fundamentals, and techniques for visualizing high‑dimensional data.

Architects Research Society

Nov 21, 2016

Data Science Q&A: Overfitting, Experimental Design, Tall/Wide Data, Chart Junk, Outliers, Extreme Value Theory, Recommendation Engines, and Visualization

On KDnuggets, the most‑read article in January titled “20 Questions to Detect Fake Data Scientists” sparked a follow‑up where editors answered those questions and added a previously omitted 21st question, providing a concise shortcut for aspiring data scientists.

The article highlights key terminology (shown in blue) that readers should become familiar with, even if the underlying concepts are initially unclear.

Original content was published on KDnuggets and translated by Baixue (a veteran IT professional) and Longxing Biaoju (an internet specialist). Important reference material and tools mentioned are available via a shared drive.

Part Two addresses overfitting, experimental design, tall vs. wide data, and the reliability of media‑reported statistics, with contributions from KDnuggets editor Gregory Piatetsky.

Overfitting : Defined as results that are accidental, non‑reproducible, and often arise from excessive hypothesis testing without proper statistical control. Prevention methods include choosing the simplest hypothesis, regularization, randomization testing, nested cross‑validation, controlling false discovery rate, and using the reusable holdout method.

Experimental Design for User Behavior (Day 12): Steps include defining the research question (e.g., impact of page load time on user satisfaction), identifying variables, constructing hypotheses, selecting a factorial design, choosing between within‑participants or between‑participants setups, specifying tasks and procedures, defining measurements (latency, frequency, duration, intensity), and analyzing results.

Figure 12: A flaw in the experimental design.

Tall/Wide Data (Day 13): “Tall” data have many rows and few columns, while “wide” data have few rows but many columns (e.g., genomics). Feature‑reduction techniques such as Lasso are recommended to avoid false positives.

Figure 13: Different methods for tall and wide data.

Assessing Media Statistics (Day 14): Use source credibility, audience bias, and look for methodological details such as sample size and margin of error. Zack Lipton’s rule of thumb suggests that statistics published in newspapers are often unreliable.

Figure 14a: Misleading bar chart from Fox News.

Figure 14b: An objective presentation of the same data.

Chart Junk (Day 15): Unnecessary visual elements that distract from the data’s message, a concept introduced by Edward Tufte in 1983.

Figure 15: Example of chart junk.

Outlier Detection (Day 16): Methods include Z‑score, modified Z‑score, box plots, Grubbs’ test, Tietjen‑Moore test, Kimber test, and moving‑window filters. Two robust approaches are the Inter‑Quartile Range (IQR) method and Tukey’s fences, both using Q1, Q3, and 1.5 × IQR thresholds.

When outliers are found, qualitative assessment is essential before removal; they may indicate new trends or valuable insights.

Extreme Value Theory & Monte Carlo (Day 17): EVT models rare events using Gumbel, Fréchet, or Weibull distributions (or the combined Generalized Extreme Value distribution). After selecting an appropriate model, probabilities of extreme events can be estimated.

Recommendation Engines (Day 18): Systems like Netflix or Amazon use collaborative filtering (based on user behavior) and content‑based filtering (based on item attributes). Examples include Last.fm’s collaborative approach and Pandora’s content‑based method.

False Positives & Negatives (Day 19): In binary classification, a false positive signals a condition that isn’t present, while a false negative misses a present condition; their costs can differ dramatically, especially in medical testing.

Visualization Tools & 5‑Dimensional Data (Day 20): Common tools include R, Python, Tableau, and Excel. Multi‑dimensional data can be shown using 3‑D scatter plots with size, color, shape, and animation for time, or parallel‑coordinates plots for higher dimensions.

Figure 20a: 5‑D Iris dataset scatter plot (size = sepal length, color = sepal width, shape = class, x‑axis = petal length, y‑axis = petal width).

Figure 20b: Parallel coordinates view of the Iris dataset.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Recommendation Systems overfitting visualization Experimental Design outliers chart junk Extreme Value Theory tall and wide data

Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.