Big Data 9 min read

My Journey into Big Data: From Early Mistakes to the Lambda Architecture

The article recounts the author’s early encounters with big‑data challenges, the shift from relational to NoSQL systems, the development of an immutable‑data batch architecture, and the eventual formulation of the Lambda Architecture, illustrating how simplicity and fault‑tolerance can replace complex incremental designs.

Architecture Digest
Architecture Digest
Architecture Digest
My Journey into Big Data: From Early Mistakes to the Lambda Architecture

When I first entered the world of big data, I felt like I was on the software‑development frontier of the American West. Many people abandoned relational databases for highly constrained‑model NoSQL databases because they were easy to use, familiar, and could scale to thousands of machines. The number of NoSQL databases is overwhelming, and many differ only slightly. A new project called “Hadoop” began to emerge, claiming the ability to perform deep analysis on massive data sets, but figuring out how to use these new tools was confusing.

At that time I was trying to solve the scalability problems my company faced. The system architecture was very complex – the web system included a shared relational database, queues, worker nodes, a master node and slave nodes. Data corruption seeped into the database; to handle it we used special code in the application, but the slave nodes always lagged behind. I decided to explore other big‑data technologies to see if there was a better design than our current data architecture.

An early experience in my software‑engineering career profoundly shaped my view of how systems should be architected. A colleague spent weeks collecting internet data into a shared file system, waiting for enough data to analyze. One day, during routine maintenance, I accidentally deleted all his data, delaying his project by weeks.

I knew I had made a big mistake, but as a junior software engineer I didn’t understand the consequences. I wondered if I would be fired for my carelessness. I sent an apology email to the team, and to my surprise everyone responded with sympathy. I’ll never forget the moment a coworker patted me on the back and said, “Congratulations! You’re now a professional software engineer!”

His joking comment expressed an unspoken truth of software development – we don’t know how to create perfect software. Software can have bugs and be deployed to production. If an application can write to a database, bugs can also write to the database. When redesigning our data architecture, this experience taught me that a new architecture must be scalable, fault‑tolerant to machine failures, easy to reason about, and also tolerant of human error.

Refactoring that system led me onto a path of questioning everything I believed about databases and data management. I devised an architecture based on immutable data and batch processing, and was surprised that, compared with a system based only on incremental computation, the new system was much simpler. Everything became easier – operations, evolving the system to support new features, recovering from human error, and performance optimization. The approach is generic and seems applicable to any data system.

Yet something bothered me. Looking at other industries, I found almost no one using similar techniques. Instead, massive cluster architectures based on incremental‑update databases are accepted despite their daunting complexity. Many of those complexities have been completely avoided or greatly reduced by the method I developed.

In the following years I expanded the method and formally named it the Lambda Architecture. While working at the startup BackType, our five‑person team built a social‑media analytics product that supported diverse real‑time analysis on more than 100 TB of data. Our small team also handled cluster management, deployment, operations, and monitoring of hundreds of machines. When we showed the product to others, they were amazed that only five people built it, often asking, “How could so few people do so much?” My answer was simple: “It’s not what we did, but what we didn’t do.” By using the Lambda Architecture we avoided the complexities that plague traditional architectures, dramatically improving our efficiency.

The big‑data movement merely amplified data‑architecture complexities that have existed for decades. Large‑scale databases built on incremental updates suffer from this complexity, leading to errors, heavy operations, and reduced productivity. Although SQL and NoSQL databases are often portrayed as opposites, at a fundamental level they are the same. They both encourage the same architecture – one that inevitably carries complexity. Complexity is a vicious beast; whether you admit it or not, it will bite you.

To spread knowledge about the Lambda Architecture and how it avoids the complexities of traditional designs, I wrote this book. It is the book I wished I had when I first started working with big data. I hope you treat it as a journey that challenges what you think you know about data systems and discovers that working with big data can be elegant, simple, and fun.

Recommended Reading: Big Data Systems: Principles and Best Practices for Building Scalable Real‑Time Data Systems (ISBN: 978-7-111-55294-9, Authors: Nathan Marz, James Warren, Published: December 2016). The book, written by the “father of Storm” and former chief engineer at Twitter, provides an authoritative, comprehensive analysis of how enterprise teams can better leverage big‑data systems from a system‑building perspective.

Data Engineeringbig datascalabilitysystem designlambda architectureImmutable Data
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.