Big Data 9 min read

What Is a Big Data Development Engineer? Roles, Skills, and Differences from Traditional Development

The article explains what a big data development engineer does, the tools and skills required such as Hadoop, Hive, Spark and Kafka, how they process massive logs to compute metrics like PV and UV, and compares this role with conventional business system development.

Java Captain

Oct 1, 2018

What Is a Big Data Development Engineer? Roles, Skills, and Differences from Traditional Development

Many students are unclear about the role of a big data development engineer, so this article briefly introduces what the position entails, what data development looks like in internet companies today, and how it differs from typical Java or PHP engineering work.

What Is Not Big Data Development

Simply using databases (relational MySQL, SQL Server, Oracle, or NoSQL MongoDB, Redis) even with tens or hundreds of millions of rows is not considered big data development.

Querying business‑system databases to produce reports is not big data development.

Collecting front‑end (web, H5, native mobile) click‑through data and storing it in a database is not big data development.

What Is Big Data Development

1. Required Skills

Job listings for big data development engineers commonly mention tools such as Hadoop, Hive, HBase, Spark, and Kafka.

2. What Big Data Developers Do

In a word: statistics.

Two main metrics: Page Views (PV) and Unique Visitors (UV).

They compute various PV and UV indicators for web pages, buttons, heat‑maps, and, in the mobile era, for content exposure and clicks within information streams.

These metrics drive advertising revenue because higher user engagement leads to longer session times.

3. How They Do It

Because the volume of logs (often billions of records per day) is too large for traditional relational databases, big data engineers rely on log‑based pipelines.

Typical sources include server logs (Apache, Tomcat, WebLogic, Nginx). For example, an Apache access log line:

218.69.234.153 - - [23/Sep/2018:21:08:00 +0800] "GET /2018/09/python-scrapy-%e7%99%bb%e5%bd%95%e7%9f%a5%e4%b9%8e%e8%bf%87%e7%a8%8b/ HTTP/1.1" 200 12466

After parsing, it becomes four columns (IP, timestamp, HTTP status, URL) for easier aggregation:

218.69.234.153 2018-09-23 21:08:00 200 /2018/09/python-scrapy-%e7%99%bb%e5%bd%95%e7%9f%a5%e4%b9%8e%e8%bf%87%e7%a8%8b/

Aggregations such as counting rows give PV; deduplication on a user identifier yields UV.

Key challenges include:

Massive log volume (several terabytes per day for large companies).

Timeliness – offline jobs must finish within a required window, while real‑time jobs need sub‑minute latency.

Accuracy – statistical results must be precise.

Real‑time processing technologies (e.g., Spark Streaming, Flink) for metrics like online users every five minutes.

Monitoring – ensuring tasks complete, data is produced, and results are sane.

Disaster recovery – re‑processing missing intervals when failures occur.

Big Data Development vs. Traditional Business Development

Traditional business systems (e.g., HR, payroll, billing) focus on CRUD operations against databases and require deep domain knowledge and high‑availability services.

Big data development, by contrast, focuses on processing massive strings of log data to compute timely, accurate statistics, emphasizing data freshness, correctness, and fault tolerance.

Disclaimer: The views expressed are personal and may not represent the full scope of the big data development profession.

Java Group

Focused on sharing Java knowledge.

Scan the QR code above for more Java content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering statistics log analysis Spark Hadoop

Written by

Java Captain

Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.