Popular Big Data Tools and Their Descriptions
This article provides an extensive overview of more than ninety open‑source and commercial big‑data tools—including ETL platforms, resource managers, storage systems, messaging queues, processing engines, and visualization libraries—detailing their core functions, typical use cases, and notable adopters.
Talend Open Studio is the first open‑source ETL tool for data integration, used by enterprises such as AIG, Comcast, and GE.
DYSON (by Tianma Technology) is a smart web‑data collection system that crawls and extracts dispersed information from web pages.
YARN is Hadoop’s next‑generation resource manager that separates resource allocation from job scheduling to overcome MapReduce bottlenecks.
Mesos is an open‑source cluster manager from UC Berkeley’s AMPLab that abstracts CPU, memory, and storage across physical or virtual machines.
Datale is a Hadoop‑based big‑data development suite from Tianma Technology.
Ambari provides a web UI for configuring, managing, and monitoring Hadoop clusters, supporting components such as HDFS, Hive, HBase, and Zookeeper.
ZooKeeper is a distributed coordination service essential for Hadoop and HBase consistency.
Thrift is an Apache project enabling cross‑language RPC for high‑volume data transfer between services.
Chukwa is an open‑source data‑collection system built on HDFS/MapReduce for monitoring large distributed systems.
Lustre is a high‑performance, fault‑tolerant parallel file system capable of scaling to thousands of nodes and petabytes of data.
HDFS (Hadoop Distributed File System) offers high‑throughput, fault‑tolerant storage for massive data sets on commodity hardware.
GlusterFS aggregates storage across servers into a single networked parallel file system.
Alluxio (formerly Tachyon) is a memory‑centric distributed file system that provides fast, fault‑tolerant data sharing for Spark and MapReduce.
Presto is an open‑source distributed SQL query engine for interactive analytics on petabyte‑scale data.
Drill enables SQL queries over Hadoop, NoSQL, and cloud storage without requiring data movement.
Hive and HiveSQL (formerly Shark) provide SQL‑like querying on Hadoop, with HiveSQL offering in‑memory processing via Spark.
Kafka is a high‑throughput distributed publish‑subscribe messaging system widely used for real‑time data pipelines.
Spark and Spark Streaming deliver fast, in‑memory data processing and micro‑batch streaming capabilities.
Storm and Trident are real‑time stream processing frameworks with low latency.
Flink offers efficient, distributed stream and batch processing with support for iterative algorithms.
Samza is a LinkedIn‑originated stream processing framework built on Kafka and YARN.
ElasticSearch and Solr are distributed search engines based on Lucene, providing full‑text search and analytics.
Cassandra , HBase , MongoDB , and Redis are NoSQL databases covering wide‑column, document, and key‑value models.
Impala offers fast, interactive SQL queries on data stored in HDFS or HBase.
Tableau , PowerBI equivalents (e.g., Infogram, ChartBlocks, Datawrapper, Plotly, Highcharts) provide visual analytics and dashboarding capabilities.
Overall, the list showcases a comprehensive ecosystem of tools for data ingestion, storage, processing, messaging, and visualization that together form the modern big‑data stack.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
