Why Good Data Beats Fancy AI: Lessons from MySQL HeatWave Golf Predictions

The article walks through five iterative experiments using MySQL HeatWave AutoML on a golf league dataset, showing how data quality, feature selection, and over‑fitting affect predictive accuracy and emphasizing that high‑quality data is essential for reliable AI models.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Why Good Data Beats Fancy AI: Lessons from MySQL HeatWave Golf Predictions

I. Preparation

For the MySQL Shorts Episode 89 demo, the author built a golf‑league dataset containing player information, scores, courses, and sometimes weather data, with records dating back to 2013. This dataset was used to train a model that predicts a golfer’s score based on past performance and course conditions.

Golf league dataset
Golf league dataset

II. Data

golf_train : training table with a common score column; 11,465 rows per iteration.

golf_test : testing table mirroring golf_train columns but with score set to NULL; 4,914 rows per iteration.

golf_scores : evaluation table containing the actual scores from golf_test.

III. Steps

Call sys.ml_train() on golf_train to train the model.

Load the newly created model with sys.ml_model_load().

Run predictions on golf_test using sys.ml_predict_table(), which writes results to a new table golf_predict.

Evaluate predictions with the following query:

select distinct abs(round(gp.prediction,0)-gs.score) diff,
       count(*) count
from   golf_predict gp
join   golf_scores gs on gp.id = gs.id
group by diff
order by diff;

The query returns each unique difference between predicted and actual scores and the frequency of that difference.

IV. Iterations

A. First iteration

Using the original golf_train (which included score, net_score, and handicap), the model achieved a perfect 100% accuracy within a 5‑stroke error margin, correctly predicting 4,910 of 4,914 scores. However, the author’s son pointed out severe over‑fitting, as the model essentially summed net_score and handicap to reproduce score.

+------+-------+
| diff | count |
+------+-------+
|    0 |  4910 |
|    1 |     3 |
|    2 |     1 |
+------+-------+

B. Second iteration

Removed net_score and handicap from golf_train. The remaining columns are listed in the schema below. The model’s accuracy dropped to 82% (4,018 of 4,914 predictions within 5 strokes). The author then examined the model explanation table and noticed heavy reliance on the team_name column, which is problematic because teams consist of two players and the column can leak information about a teammate’s performance.

+-----------------+---------------+
| Field           | Type          |
+-----------------+---------------+
| id              | int           |
| golfer_name     | varchar(100)  |
| match_date      | date          |
| scheduled_date  | varchar(10)   |
| score           | int           |
| course_name     | varchar(75)   |
| hole_group_name | varchar(75)   |
| slope           | decimal(10,2) |
| rating          | decimal(10,2) |
| team_name       | varchar(50)   |
| week_name       | varchar(75)   |
| division_name   | varchar(50)   |
| season_name     | varchar(50)   |
| league_name     | varchar(50)   |
+-----------------+---------------+
+------+-------+
| diff | count |
+------+-------+
|    0 |   446 |
|    1 |   931 |
|    2 |   817 |
|    3 |   769 |
|    4 |   598 |
|    5 |   457 |
|    6 |   324 |
|    7 |   214 |
|    8 |   126 |
|    9 |    92 |
|   10 |    62 |
|   11 |    30 |
|   12 |    21 |
|   13 |    11 |
|   14 |     3 |
|   15 |     2 |
|   16 |     5 |
|   17 |     3 |
|   18 |     1 |
|   19 |     1 |
|   20 |     1 |
+------+-------+

C. Third iteration

Removed the team_name column from golf_train. The schema now lacks any team identifier. Accuracy fell further to 63% (3,118 of 4,914 predictions within 5 strokes).

+------+-------+
| diff | count |
+------+-------+
|    0 |   325 |
|    1 |   612 |
|    2 |   581 |
|    3 |   577 |
|    4 |   550 |
|    5 |   473 |
|    6 |   454 |
|    7 |   370 |
|    8 |   312 |
|    9 |   233 |
|   10 |   152 |
|   11 |   118 |
|   12 |    69 |
|   13 |    41 |
|   14 |    23 |
|   15 |    13 |
|   16 |     6 |
|   17 |     2 |
|   18 |     3 |
+------+-------+

D. Fourth iteration

Added weather‑related columns ( temperature, conditions, wind_speed, humidity) to golf_train. The model’s accuracy remained at 63% despite the extra features, indicating that weather did not improve predictive power for this dataset.

+------+-------+
| diff | count |
+------+-------+
|    0 |   325 |
|    1 |   610 |
|    2 |   589 |
|    3 |   573 |
|    4 |   556 |
|    5 |   464 |
|    6 |   458 |
|    7 |   369 |
|    8 |   313 |
|    9 |   230 |
|   10 |   154 |
|   11 |   116 |
|   12 |    69 |
|   13 |    41 |
|   14 |    23 |
|   15 |    13 |
|   16 |     6 |
|   17 |     2 |
|   18 |     3 |
+------+-------+
Fourth iteration accuracy
Fourth iteration accuracy

E. Fifth iteration

Re‑introduced the handicap column (a true measure of player ability) while keeping the weather features. This configuration boosted accuracy to 84% (4,118 of 4,914 predictions within 5 strokes), the best result among all iterations.

The model correctly predicted 531 of 4,914 scores and kept the number of large errors lower than in the fourth iteration.

V. Conclusion

The experiments demonstrate that simply adding more columns does not guarantee better predictions; thoughtful feature selection and high‑quality data are crucial. Over‑fitting can produce deceptively perfect results that do not generalize. The author, a developer learning AI, concludes that data scientists should curate datasets carefully, remembering the adage “garbage in, garbage out.”

machine learningData qualityMySQLHeatWaveAutoMLPredictive Modeling
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.