Datafaker: A Powerful Tool for Bulk Test Data Generation
Datafaker is a Python‑compatible utility that creates large volumes of synthetic test data for databases, streams, files, and messaging systems, offering flexible metadata rules, multi‑backend support, and command‑line options for quick data provisioning.
Overview
Datafaker is a test‑data generation tool that works with Python 2.7 and Python 3.4+. It can produce bulk records for relational databases, Hive, Kafka, Elasticsearch, Oracle, HBase, and plain files.
Installation
Ensure Python 3 and pip3 are installed, then run: pip3 install datafaker For MySQL/TiDB on CentOS 7, install the required client library:
sudo yum install python3-devel mysql-devel
pip3 install mysqlclientMetadata File (meta.txt)
The generator reads a metadata file where each line describes a column using three fields separated by ||:
Field name
Data type (e.g., int, varchar(20))
Comment that may contain construction rules such as [:inc(id,1)] or enum(file://names.txt) Lines starting with # are ignored. If a rule is present in the comment, the parser gives it priority; otherwise it falls back to the declared type.
Creating a Sample Table
Example MySQL table definition for a student record:
create table stu (
id int unsigned auto_increment primary key COMMENT '自增id',
name varchar(20) not null COMMENT '学生名字',
school varchar(20) not null COMMENT '学校名字',
nickname varchar(20) not null COMMENT '学生小名',
age int not null COMMENT '学生年龄',
class_num int not null COMMENT '班级人数',
score decimal(4,2) not null COMMENT '成绩',
phone bigint not null COMMENT '电话号码',
email varchar(64) COMMENT '家庭网络邮箱',
ip varchar(32) COMMENT 'IP地址',
address text COMMENT '家庭地址'
) engine=InnoDB default charset=utf8;The corresponding meta.txt entry for the school column can use an enum loaded from names.txt:
school||varchar(20)||学校名字[:enum(file://names.txt)] names.txtcontains five school names, one per line.
Generating Data
Generate 10 records and print them to the console:
$ datafaker --version
0.0.8
$ datafaker --help
usage: datafaker [options] [dbtype] [connect] [table] [num]Read metadata and output 10 rows to MySQL:
$ datafaker rdb mysql+mysqldb://root:root@localhost:3600/test stu 10 \
--meta meta.txt --outspliter ,,
generated records : 10
printed records : 10
time used : 0.458 sWhen the primary‑key column is auto‑increment, the start value must be set higher than existing keys, e.g. id[:inc(id,11)], to avoid duplicate‑key errors.
Other Backends
Hive : Write 1 000 records to a Hive table stu in database test.
datafaker hive://yarn@localhost:10000/test stu 1000 \
--meta hive_meta.txtFile Output : Produce JSON lines in /home/out.txt.
datafaker file /home/out.txt 10 \
--meta meta.txt --format jsonKafka : Send one record per second to topic hello.
$ datafaker kafka localhost:9092 hello 1 \
--meta meta.txt --interval 1HBase : Create a table with rowkey and column family Cf.
datafaker hbase localhost:9090 test-table \
--meta hbase.txtThe first line of hbase.txt must define the rowkey, optionally with a pattern like rowkey(0,1,4) to concatenate column values.
Elasticsearch : Index 100 documents into example1/tp1.
datafaker es localhost:9200 example1/tp1 100 \
--auth elastic:elastic --meta meta.txtOracle : Insert 10 rows using an SQLAlchemy connection string.
datafaker rdb oracle://root:[email protected]:1521/helowin stu 10 \
--meta meta.txtRule Priority and Benefits
The parser first looks for rule markers in the third column of meta.txt. If none are found, it falls back to the declared column type. This allows users to fine‑tune data generation for existing tables (by exporting DESC output) or for new tables (by annotating the DDL directly).
Verification
An example consumer validates the generated data; the article includes a screenshot of the verification UI.
Summary of Capabilities
Supports MySQL, TiDB, Oracle, PostgreSQL, SQL Server, Hive, HBase, Kafka, Elasticsearch, and plain files.
Customizable metadata rules: incremental IDs, enums from files, range constraints, and type‑based defaults.
Command‑line options for output format (JSON, text), column separators, and batch size.
Fast generation: 10 records in under half a second for MySQL.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Woodpecker Software Testing
The Woodpecker Software Testing public account shares software testing knowledge, connects testing enthusiasts, founded by Gu Xiang, website: www.3testing.com. Author of five books, including "Mastering JMeter Through Case Studies".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
