Comprehensive Overview of Open‑Source Big Data Tools and Platforms
This article presents a detailed, categorized catalogue of more than fifty open‑source big‑data projects—including Hadoop‑related utilities, analytics platforms, databases, BI solutions, data‑mining packages, query engines, programming languages, search tools, and in‑memory technologies—highlighting their primary functions, supported operating systems, and official links.
Analysts predict that Hadoop will become ubiquitous among large enterprises, with the market expected to grow at a 58% CAGR and exceed $1 billion by 2020; IBM has even allocated 3,500 researchers to develop Apache Spark, a component of the Hadoop ecosystem.
This article does not aim to rank tools but to introduce them by category, inviting readers to suggest additional open‑source big‑data or Hadoop utilities.
1. Hadoop‑related tools
1. Hadoop – Apache Hadoop is synonymous with big data, forming a comprehensive ecosystem for highly scalable distributed computing. Supported OS: Windows, Linux, OS X. Link: http://hadoop.apache.org
2. Ambari – A web‑based interface for configuring, managing, and monitoring Hadoop clusters, offering a REST API for integration. Supported OS: Windows, Linux, OS X. Link: http://ambari.apache.org
3. Avro – Provides a data‑serialization system with rich structures and a compact format; schemas are defined in JSON and integrate easily with dynamic languages. OS‑independent. Link: http://avro.apache.org
4. Cascading – A Hadoop‑based application development platform offering commercial support and training. OS‑independent. Link: http://www.cascading.org/projects/cascading/
5. Chukwa – Built on Hadoop, it collects data from large distributed systems for monitoring and includes analysis and visualization tools. Supported OS: Linux, OS X. Link: http://chukwa.apache.org
6. Flume – Collects log data from other applications and delivers it to Hadoop; it is fault‑tolerant and highly configurable. Supported OS: Linux, OS X. Link: https://cwiki.apache.org/confluence/display/FLUME/Home
7. HBase – A distributed database designed for billions of rows and millions of columns, offering random real‑time read/write access; similar to Google’s Bigtable and built on HDFS. OS‑independent. Link: http://hbase.apache.org
8. HDFS – Hadoop Distributed File System, a fault‑tolerant, highly scalable file system written in Java; can also be used independently. Supported OS: Windows, Linux, OS X. Link: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
9. Hive – A data‑warehouse system for Hadoop that provides HiveQL, an SQL‑like query language. OS‑independent. Link: http://hive.apache.org
10. Hivemall – Extends Hive with a collection of scalable machine‑learning algorithms for classification, recommendation, k‑NN, anomaly detection, and feature hashing. OS‑independent. Link: https://github.com/myui/hivemall
11. Mahout – Provides a scalable environment for building machine‑learning applications, including MapReduce‑based algorithms and newer Scala/Spark implementations. OS‑independent. Link: http://mahout.apache.org
12. MapReduce – The programming model that underpins Hadoop for processing large distributed data sets; also used by CouchDB, MongoDB, and Riak. OS‑independent. Link: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
13. Oozie – A workflow scheduler for managing Hadoop jobs, capable of triggering tasks based on time or data availability and integrating with MapReduce, Pig, Hive, Sqoop, etc. Supported OS: Linux, OS X. Link: http://oozie.apache.org
14. Pig – A platform for analyzing large data sets that uses the Pig Latin language to simplify parallel programming, optimization, and scalability. OS‑independent. Link: http://pig.apache.org
15. Sqoop – Facilitates bulk transfer of data between relational databases and Hadoop, supporting import to Hive/HBase and export to RDBMS. OS‑independent. Link: http://sqoop.apache.org
16. Spark – An alternative to MapReduce, Spark is an in‑memory data‑processing engine that can be up to 100× faster on memory and 10× faster on disk; it works with Hadoop, Mesos, or standalone. Supported OS: Windows, Linux, OS X. Link: http://spark.apache.org
17. Tez – Built on Hadoop YARN, Tez provides a DAG‑based application framework that simplifies complex Hive and Pig tasks. Supported OS: Windows, Linux, OS X. Link: http://tez.apache.org
18. Zookeeper – A centralized service for maintaining configuration information, naming, distributed synchronization, and group services, enabling coordination among Hadoop nodes. Supported OS: Linux, Windows (dev only), OS X (dev only). Link: http://zookeeper.apache.org
2. Big‑Data Analysis Platforms and Tools
19. Disco – A distributed computing framework originally from Nokia, similar to Hadoop and based on MapReduce, with its own distributed file system and key/value store. Supported OS: Linux, OS X. Link: http://discoproject.org
20. HPCC – An alternative big‑data platform promising high speed and scalability; available as a free community edition and commercial versions with services. Supported OS: Linux. Link: http://hpccsystems.com
21. Lumify – An open‑source data integration, analysis, and visualization platform from Altamira, demonstrated via a try‑online demo. Supported OS: Linux. Link: http://www.jboss.org/infinispan.html
22. Pandas – Python‑based data structures and analysis tools that allow Python to be used alongside R for big‑data analytics. Supported OS: Windows, Linux, OS X. Link: http://pandas.pydata.org
23. Storm – An Apache project providing real‑time stream processing, used by Twitter, Weather Channel, WebMD, Alibaba, Yelp, Spotify, etc. Supported OS: Linux. Link: https://storm.apache.org
3. Databases / Data Warehouses
24. Blazegraph – Formerly “Bigdata”, a highly scalable, high‑performance database available under open‑source and commercial licenses. OS‑independent. Link: http://www.systap.com/bigdata
25. Cassandra – A NoSQL database originally developed by Facebook, now used by over 1,500 organizations (e.g., Apple, CERN, Netflix). Supports massive clusters; Apple’s deployment has >75,000 nodes and >10 PB of data. OS‑independent. Link: http://cassandra.apache.org
26. CouchDB – Stores data as JSON documents accessible via web browsers and JavaScript; offers high availability and scalability. Supported OS: Windows, Linux, OS X, Android. Link: http://couchdb.apache.org
27. FlockDB – A fast, highly scalable graph database from Twitter, suited for social‑network data; the open‑source version has not been updated recently. OS‑independent. Link: https://github.com/twitter/flockdb
28. Hibari – An Erlang‑based distributed ordered key‑value store emphasizing strong consistency; originally from Gemini Mobile Technologies and now used by telecom operators in Europe and Asia. OS‑independent. Link: http://hibari.github.io/hibari-doc/
29. Hypertable – A Hadoop‑compatible big‑data database promising ultra‑high performance; used by companies such as Baidu and Yelp. Supported OS: Linux, OS X. Link: http://hypertable.org
30. Impala – Cloudera’s SQL‑based analytics database for Hadoop, marketed as a leading open‑source solution; available as a standalone download and as part of Cloudera’s commercial suite. Supported OS: Linux, OS X. Link: http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
31. InfoBright Community Edition – A column‑oriented database designed for analytics with high compression; commercial versions and support are offered. Supported OS: Windows, Linux. Link: http://www.infobright.org
32. MongoDB – A popular NoSQL database with over 10 million downloads; offers enterprise editions, support, and training. Supported OS: Windows, Linux, OS X, Solaris. Link: http://www.mongodb.org
33. Neo4j – Marketed as the fastest, most scalable native graph database, used by companies such as Walmart and CrunchBase. Supported OS: Windows, Linux. Link: http://neo4j.org
34. OrientDB – A multi‑model database combining graph and document features; commercial support and services are available. OS‑independent. Link: http://www.orientdb.org/index.htm
35. Pivotal Greenplum Database – An enterprise‑grade analytics database claimed to be the best in its class, part of Pivotal’s big‑data suite. Supported OS: Windows, Linux, OS X. Link: http://pivotal.io/big-data/pivotal-greenplum-database
36. Riak – A fully featured distributed NoSQL database (KV) and object store (S2) with open‑source and commercial editions, plus integrations for Spark, Redis, and Solr. Supported OS: Linux, OS X. Link: http://basho.com/riak-0-10-is-full-of-great-stuff/
37. Redis – A key‑value cache and storage system sponsored by Pivotal; a Windows port exists on GitHub. Supported OS: Linux. Link: http://redis.io
4. Business Intelligence
38. Talend Open Studio – An open‑source data‑integration suite with over 2 million downloads; the company also offers commercial big‑data, cloud, and master‑data‑management tools. Supported OS: Windows, Linux, OS X. Link: http://www.talend.com/index.php
39. Jaspersoft – Provides flexible, embeddable BI tools with community, professional, and AWS editions; used by organizations such as AIG and GE. OS‑independent. Link: http://www.jaspersoft.com
40. Pentaho – Offers a suite of data‑integration and analytics tools; three community editions are available alongside commercial support. Supported OS: Windows, Linux, OS X. Link: http://community.pentaho.com
41. SpagoBI – Marketed as an open‑source leader, it provides BI, middleware, QA software, and a Java EE application framework; free core version plus paid support. OS‑independent. Link: http://www.spagoworld.org/xwiki/bin/view/SpagoWorld/
42. KNIME – The Konstanz Information Miner, an open‑source analytics and reporting platform with commercial extensions. Supported OS: Windows, Linux, OS X. Link: http://www.knime.org
43. BIRT – The Business Intelligence and Reporting Tools project, part of Eclipse, enabling embedded visualizations and reports; supported by Actuate, IBM, and Innovent Solutions. OS‑independent. Link: http://www.eclipse.org/birt/
5. Data Mining
44. DataMelt – Successor to jHepWork, it handles mathematical computation, data mining, statistical analysis, and visualization; supports Java, Jython, Groovy, JRuby, and BeanShell. OS‑independent. Link: http://jwork.org/dmelt/
45. KEEL – The Knowledge Extraction based on Evolutionary Learning framework, a Java‑based machine‑learning toolkit for classification, clustering, pattern mining, etc. OS‑independent. Link: http://keel.es
46. Orange – An open‑source data‑mining suite offering visual programming and Python scripting for analysis and visualization. Supported OS: Windows, Linux, OS X. Link: http://orange.biolab.si
47. RapidMiner – A data‑science platform with over 250 k users (e.g., PayPal, Deloitte); offers open‑source and commercial editions, the free version limited to CSV/Excel input. OS‑independent. Link: https://rapidminer.com
48. Rattle – A graphical front‑end for the R language that simplifies statistical summaries, modeling, and data transformation. Supported OS: Windows, Linux, OS X. Link: http://rattle.togaware.com
49. SPMF – Provides 93 algorithms for sequential pattern mining, association rule mining, clustering, etc.; usable standalone or integrated into Java programs. OS‑independent. Link: http://www.philippe-fournier-viger.com/spmf/
50. Weka – A Java‑based suite of machine‑learning algorithms for data preprocessing, classification, clustering, association rules, and visualization. Supported OS: Windows, Linux, OS X. Link: http://www.cs.waikato.ac.nz/~ml/weka/
6. Query Engines
51. Drill – An Apache project enabling SQL‑based queries across Hadoop, NoSQL databases, and cloud storage (e.g., HBase, MongoDB, S3, Azure Blob). Supported OS: Windows, Linux, OS X. Link: http://drill.apache.org
7. Programming Languages
52. R – A language and environment for statistical computing and graphics, offering extensive big‑data tools for processing, computation, and visualization. Supported OS: Windows, Linux, OS X. Link: http://www.r-project.org
53. ECL – Enterprise Control Language used to build big‑data applications on the HPCC platform; HPCC provides an IDE, tutorials, and related tools. Supported OS: Linux. Link: http://hpccsystems.com/download/docs/ecl-language-reference
8. Big‑Data Search
54. Lucene – A Java‑based full‑text search library capable of indexing over 150 GB per hour on modern hardware; sponsored by the Apache Software Foundation. OS‑independent. Link: http://lucene.apache.org/core/
55. Solr – Built on Lucene, a highly reliable, scalable enterprise search platform used by companies such as Netflix and Instagram. OS‑independent. Link: http://lucene.apache.org/solr/
9. In‑Memory Technologies
56. Ignite – An Apache project offering a high‑performance, integrated, distributed in‑memory platform for real‑time analytics, featuring data grids, compute grids, streaming, Hadoop acceleration, and more. OS‑independent. Link: https://ignite.incubator.apache.org
57. Terracotta – Provides the BigMemory in‑memory data‑management platform, claimed to be among the world’s best; offers commercial editions, support, consulting, and training. OS‑independent. Link: http://www.terracotta.org
58. Pivotal GemFire/Geode – Pivotal opened the source of its key big‑data components, including the GemFire in‑memory NoSQL database, now managed under the Apache Geode project; commercial versions also exist. Supported OS: Windows, Linux. Link: http://pivotal.io/big-data/pivotal-gemfire
59. GridGain – Built on Apache Ignite, GridGain provides in‑memory data structures for fast big‑data processing and a Hadoop accelerator; offers both enterprise and free community editions. Supported OS: Windows, Linux, OS X. Link: http://www.gridgain.com
60. Infinispan – A Red Hat JBoss project delivering a distributed in‑memory data grid for caching, NoSQL storage, and clustering. OS‑independent. Link: http://www.jboss.org/infinispan.html
Original source: 51CTO thebigdata.cn
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.