Databases 14 min read

Datafaker: A Powerful Tool for Bulk Test Data Generation

Datafaker is a Python‑compatible utility that creates large volumes of synthetic test data for databases, streams, files, and messaging systems, offering flexible metadata rules, multi‑backend support, and command‑line options for quick data provisioning.

Woodpecker Software Testing
Woodpecker Software Testing
Woodpecker Software Testing
Datafaker: A Powerful Tool for Bulk Test Data Generation

Overview

Datafaker is a test‑data generation tool that works with Python 2.7 and Python 3.4+. It can produce bulk records for relational databases, Hive, Kafka, Elasticsearch, Oracle, HBase, and plain files.

Installation

Ensure Python 3 and pip3 are installed, then run: pip3 install datafaker For MySQL/TiDB on CentOS 7, install the required client library:

sudo yum install python3-devel mysql-devel
pip3 install mysqlclient

Metadata File (meta.txt)

The generator reads a metadata file where each line describes a column using three fields separated by ||:

Field name

Data type (e.g., int, varchar(20))

Comment that may contain construction rules such as [:inc(id,1)] or enum(file://names.txt) Lines starting with # are ignored. If a rule is present in the comment, the parser gives it priority; otherwise it falls back to the declared type.

Creating a Sample Table

Example MySQL table definition for a student record:

create table stu (
  id int unsigned auto_increment primary key COMMENT '自增id',
  name varchar(20) not null COMMENT '学生名字',
  school varchar(20) not null COMMENT '学校名字',
  nickname varchar(20) not null COMMENT '学生小名',
  age int not null COMMENT '学生年龄',
  class_num int not null COMMENT '班级人数',
  score decimal(4,2) not null COMMENT '成绩',
  phone bigint not null COMMENT '电话号码',
  email varchar(64) COMMENT '家庭网络邮箱',
  ip varchar(32) COMMENT 'IP地址',
  address text COMMENT '家庭地址'
) engine=InnoDB default charset=utf8;

The corresponding meta.txt entry for the school column can use an enum loaded from names.txt:

school||varchar(20)||学校名字[:enum(file://names.txt)]
names.txt

contains five school names, one per line.

Generating Data

Generate 10 records and print them to the console:

$ datafaker --version
0.0.8

$ datafaker --help
usage: datafaker [options] [dbtype] [connect] [table] [num]

Read metadata and output 10 rows to MySQL:

$ datafaker rdb mysql+mysqldb://root:root@localhost:3600/test stu 10 \
    --meta meta.txt --outspliter ,,
generated records : 10
printed records   : 10
time used         : 0.458 s

When the primary‑key column is auto‑increment, the start value must be set higher than existing keys, e.g. id[:inc(id,11)], to avoid duplicate‑key errors.

Other Backends

Hive : Write 1 000 records to a Hive table stu in database test.

datafaker hive://yarn@localhost:10000/test stu 1000 \
    --meta hive_meta.txt

File Output : Produce JSON lines in /home/out.txt.

datafaker file /home/out.txt 10 \
    --meta meta.txt --format json

Kafka : Send one record per second to topic hello.

$ datafaker kafka localhost:9092 hello 1 \
    --meta meta.txt --interval 1

HBase : Create a table with rowkey and column family Cf.

datafaker hbase localhost:9090 test-table \
    --meta hbase.txt

The first line of hbase.txt must define the rowkey, optionally with a pattern like rowkey(0,1,4) to concatenate column values.

Elasticsearch : Index 100 documents into example1/tp1.

datafaker es localhost:9200 example1/tp1 100 \
    --auth elastic:elastic --meta meta.txt

Oracle : Insert 10 rows using an SQLAlchemy connection string.

datafaker rdb oracle://root:[email protected]:1521/helowin stu 10 \
    --meta meta.txt

Rule Priority and Benefits

The parser first looks for rule markers in the third column of meta.txt. If none are found, it falls back to the declared column type. This allows users to fine‑tune data generation for existing tables (by exporting DESC output) or for new tables (by annotating the DDL directly).

Verification

An example consumer validates the generated data; the article includes a screenshot of the verification UI.

Verification screenshot
Verification screenshot

Summary of Capabilities

Supports MySQL, TiDB, Oracle, PostgreSQL, SQL Server, Hive, HBase, Kafka, Elasticsearch, and plain files.

Customizable metadata rules: incremental IDs, enums from files, range constraints, and type‑based defaults.

Command‑line options for output format (JSON, text), column separators, and batch size.

Fast generation: 10 records in under half a second for MySQL.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonElasticsearchMetadataKafkaMySQLtest data generationdatafaker
Woodpecker Software Testing
Written by

Woodpecker Software Testing

The Woodpecker Software Testing public account shares software testing knowledge, connects testing enthusiasts, founded by Gu Xiang, website: www.3testing.com. Author of five books, including "Mastering JMeter Through Case Studies".

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.