Big Data 31 min read

Comprehensive Overview of Open‑Source Big Data Tools and Platforms

This article presents a detailed, categorized catalogue of more than fifty open‑source big‑data projects—including Hadoop‑related utilities, analytics platforms, databases, BI solutions, data‑mining packages, query engines, programming languages, search tools, and in‑memory technologies—highlighting their primary functions, supported operating systems, and official links.

Qunar Tech Salon

Aug 17, 2015

Comprehensive Overview of Open‑Source Big Data Tools and Platforms

Analysts predict that Hadoop will become ubiquitous among large enterprises, with the market expected to grow at a 58% CAGR and exceed $1 billion by 2020; IBM has even allocated 3,500 researchers to develop Apache Spark, a component of the Hadoop ecosystem.

This article does not aim to rank tools but to introduce them by category, inviting readers to suggest additional open‑source big‑data or Hadoop utilities.

1. Hadoop‑related tools

1. Hadoop – Apache Hadoop is synonymous with big data, forming a comprehensive ecosystem for highly scalable distributed computing. Supported OS: Windows, Linux, OS X. Link: http://hadoop.apache.org

2. Ambari – A web‑based interface for configuring, managing, and monitoring Hadoop clusters, offering a REST API for integration. Supported OS: Windows, Linux, OS X. Link: http://ambari.apache.org

3. Avro – Provides a data‑serialization system with rich structures and a compact format; schemas are defined in JSON and integrate easily with dynamic languages. OS‑independent. Link: http://avro.apache.org

4. Cascading – A Hadoop‑based application development platform offering commercial support and training. OS‑independent. Link: http://www.cascading.org/projects/cascading/

5. Chukwa – Built on Hadoop, it collects data from large distributed systems for monitoring and includes analysis and visualization tools. Supported OS: Linux, OS X. Link: http://chukwa.apache.org

6. Flume – Collects log data from other applications and delivers it to Hadoop; it is fault‑tolerant and highly configurable. Supported OS: Linux, OS X. Link: https://cwiki.apache.org/confluence/display/FLUME/Home

7. HBase – A distributed database designed for billions of rows and millions of columns, offering random real‑time read/write access; similar to Google’s Bigtable and built on HDFS. OS‑independent. Link: http://hbase.apache.org

8. HDFS – Hadoop Distributed File System, a fault‑tolerant, highly scalable file system written in Java; can also be used independently. Supported OS: Windows, Linux, OS X. Link: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

9. Hive – A data‑warehouse system for Hadoop that provides HiveQL, an SQL‑like query language. OS‑independent. Link: http://hive.apache.org

10. Hivemall – Extends Hive with a collection of scalable machine‑learning algorithms for classification, recommendation, k‑NN, anomaly detection, and feature hashing. OS‑independent. Link: https://github.com/myui/hivemall

11. Mahout – Provides a scalable environment for building machine‑learning applications, including MapReduce‑based algorithms and newer Scala/Spark implementations. OS‑independent. Link: http://mahout.apache.org

12. MapReduce – The programming model that underpins Hadoop for processing large distributed data sets; also used by CouchDB, MongoDB, and Riak. OS‑independent. Link: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

13. Oozie – A workflow scheduler for managing Hadoop jobs, capable of triggering tasks based on time or data availability and integrating with MapReduce, Pig, Hive, Sqoop, etc. Supported OS: Linux, OS X. Link: http://oozie.apache.org

14. Pig – A platform for analyzing large data sets that uses the Pig Latin language to simplify parallel programming, optimization, and scalability. OS‑independent. Link: http://pig.apache.org

15. Sqoop – Facilitates bulk transfer of data between relational databases and Hadoop, supporting import to Hive/HBase and export to RDBMS. OS‑independent. Link: http://sqoop.apache.org

16. Spark – An alternative to MapReduce, Spark is an in‑memory data‑processing engine that can be up to 100× faster on memory and 10× faster on disk; it works with Hadoop, Mesos, or standalone. Supported OS: Windows, Linux, OS X. Link: http://spark.apache.org

17. Tez – Built on Hadoop YARN, Tez provides a DAG‑based application framework that simplifies complex Hive and Pig tasks. Supported OS: Windows, Linux, OS X. Link: http://tez.apache.org

18. Zookeeper – A centralized service for maintaining configuration information, naming, distributed synchronization, and group services, enabling coordination among Hadoop nodes. Supported OS: Linux, Windows (dev only), OS X (dev only). Link: http://zookeeper.apache.org

2. Big‑Data Analysis Platforms and Tools

19. Disco – A distributed computing framework originally from Nokia, similar to Hadoop and based on MapReduce, with its own distributed file system and key/value store. Supported OS: Linux, OS X. Link: http://discoproject.org

20. HPCC – An alternative big‑data platform promising high speed and scalability; available as a free community edition and commercial versions with services. Supported OS: Linux. Link: http://hpccsystems.com

21. Lumify – An open‑source data integration, analysis, and visualization platform from Altamira, demonstrated via a try‑online demo. Supported OS: Linux. Link: http://www.jboss.org/infinispan.html

22. Pandas – Python‑based data structures and analysis tools that allow Python to be used alongside R for big‑data analytics. Supported OS: Windows, Linux, OS X. Link: http://pandas.pydata.org

23. Storm – An Apache project providing real‑time stream processing, used by Twitter, Weather Channel, WebMD, Alibaba, Yelp, Spotify, etc. Supported OS: Linux. Link: https://storm.apache.org

3. Databases / Data Warehouses

24. Blazegraph – Formerly “Bigdata”, a highly scalable, high‑performance database available under open‑source and commercial licenses. OS‑independent. Link: http://www.systap.com/bigdata

25. Cassandra – A NoSQL database originally developed by Facebook, now used by over 1,500 organizations (e.g., Apple, CERN, Netflix). Supports massive clusters; Apple’s deployment has >75,000 nodes and >10 PB of data. OS‑independent. Link: http://cassandra.apache.org

26. CouchDB – Stores data as JSON documents accessible via web browsers and JavaScript; offers high availability and scalability. Supported OS: Windows, Linux, OS X, Android. Link: http://couchdb.apache.org

27. FlockDB – A fast, highly scalable graph database from Twitter, suited for social‑network data; the open‑source version has not been updated recently. OS‑independent. Link: https://github.com/twitter/flockdb

28. Hibari – An Erlang‑based distributed ordered key‑value store emphasizing strong consistency; originally from Gemini Mobile Technologies and now used by telecom operators in Europe and Asia. OS‑independent. Link: http://hibari.github.io/hibari-doc/

29. Hypertable – A Hadoop‑compatible big‑data database promising ultra‑high performance; used by companies such as Baidu and Yelp. Supported OS: Linux, OS X. Link: http://hypertable.org

30. Impala – Cloudera’s SQL‑based analytics database for Hadoop, marketed as a leading open‑source solution; available as a standalone download and as part of Cloudera’s commercial suite. Supported OS: Linux, OS X. Link: http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html

31. InfoBright Community Edition – A column‑oriented database designed for analytics with high compression; commercial versions and support are offered. Supported OS: Windows, Linux. Link: http://www.infobright.org

32. MongoDB – A popular NoSQL database with over 10 million downloads; offers enterprise editions, support, and training. Supported OS: Windows, Linux, OS X, Solaris. Link: http://www.mongodb.org

33. Neo4j – Marketed as the fastest, most scalable native graph database, used by companies such as Walmart and CrunchBase. Supported OS: Windows, Linux. Link: http://neo4j.org

34. OrientDB – A multi‑model database combining graph and document features; commercial support and services are available. OS‑independent. Link: http://www.orientdb.org/index.htm

35. Pivotal Greenplum Database – An enterprise‑grade analytics database claimed to be the best in its class, part of Pivotal’s big‑data suite. Supported OS: Windows, Linux, OS X. Link: http://pivotal.io/big-data/pivotal-greenplum-database

36. Riak – A fully featured distributed NoSQL database (KV) and object store (S2) with open‑source and commercial editions, plus integrations for Spark, Redis, and Solr. Supported OS: Linux, OS X. Link: http://basho.com/riak-0-10-is-full-of-great-stuff/

37. Redis – A key‑value cache and storage system sponsored by Pivotal; a Windows port exists on GitHub. Supported OS: Linux. Link: http://redis.io

4. Business Intelligence

38. Talend Open Studio – An open‑source data‑integration suite with over 2 million downloads; the company also offers commercial big‑data, cloud, and master‑data‑management tools. Supported OS: Windows, Linux, OS X. Link: http://www.talend.com/index.php

39. Jaspersoft – Provides flexible, embeddable BI tools with community, professional, and AWS editions; used by organizations such as AIG and GE. OS‑independent. Link: http://www.jaspersoft.com

40. Pentaho – Offers a suite of data‑integration and analytics tools; three community editions are available alongside commercial support. Supported OS: Windows, Linux, OS X. Link: http://community.pentaho.com

41. SpagoBI – Marketed as an open‑source leader, it provides BI, middleware, QA software, and a Java EE application framework; free core version plus paid support. OS‑independent. Link: http://www.spagoworld.org/xwiki/bin/view/SpagoWorld/

42. KNIME – The Konstanz Information Miner, an open‑source analytics and reporting platform with commercial extensions. Supported OS: Windows, Linux, OS X. Link: http://www.knime.org

43. BIRT – The Business Intelligence and Reporting Tools project, part of Eclipse, enabling embedded visualizations and reports; supported by Actuate, IBM, and Innovent Solutions. OS‑independent. Link: http://www.eclipse.org/birt/

5. Data Mining

44. DataMelt – Successor to jHepWork, it handles mathematical computation, data mining, statistical analysis, and visualization; supports Java, Jython, Groovy, JRuby, and BeanShell. OS‑independent. Link: http://jwork.org/dmelt/

45. KEEL – The Knowledge Extraction based on Evolutionary Learning framework, a Java‑based machine‑learning toolkit for classification, clustering, pattern mining, etc. OS‑independent. Link: http://keel.es

46. Orange – An open‑source data‑mining suite offering visual programming and Python scripting for analysis and visualization. Supported OS: Windows, Linux, OS X. Link: http://orange.biolab.si

47. RapidMiner – A data‑science platform with over 250 k users (e.g., PayPal, Deloitte); offers open‑source and commercial editions, the free version limited to CSV/Excel input. OS‑independent. Link: https://rapidminer.com

48. Rattle – A graphical front‑end for the R language that simplifies statistical summaries, modeling, and data transformation. Supported OS: Windows, Linux, OS X. Link: http://rattle.togaware.com

49. SPMF – Provides 93 algorithms for sequential pattern mining, association rule mining, clustering, etc.; usable standalone or integrated into Java programs. OS‑independent. Link: http://www.philippe-fournier-viger.com/spmf/

50. Weka – A Java‑based suite of machine‑learning algorithms for data preprocessing, classification, clustering, association rules, and visualization. Supported OS: Windows, Linux, OS X. Link: http://www.cs.waikato.ac.nz/~ml/weka/

6. Query Engines

51. Drill – An Apache project enabling SQL‑based queries across Hadoop, NoSQL databases, and cloud storage (e.g., HBase, MongoDB, S3, Azure Blob). Supported OS: Windows, Linux, OS X. Link: http://drill.apache.org

7. Programming Languages

52. R – A language and environment for statistical computing and graphics, offering extensive big‑data tools for processing, computation, and visualization. Supported OS: Windows, Linux, OS X. Link: http://www.r-project.org

53. ECL – Enterprise Control Language used to build big‑data applications on the HPCC platform; HPCC provides an IDE, tutorials, and related tools. Supported OS: Linux. Link: http://hpccsystems.com/download/docs/ecl-language-reference

8. Big‑Data Search

54. Lucene – A Java‑based full‑text search library capable of indexing over 150 GB per hour on modern hardware; sponsored by the Apache Software Foundation. OS‑independent. Link: http://lucene.apache.org/core/

55. Solr – Built on Lucene, a highly reliable, scalable enterprise search platform used by companies such as Netflix and Instagram. OS‑independent. Link: http://lucene.apache.org/solr/

9. In‑Memory Technologies

56. Ignite – An Apache project offering a high‑performance, integrated, distributed in‑memory platform for real‑time analytics, featuring data grids, compute grids, streaming, Hadoop acceleration, and more. OS‑independent. Link: https://ignite.incubator.apache.org

57. Terracotta – Provides the BigMemory in‑memory data‑management platform, claimed to be among the world’s best; offers commercial editions, support, consulting, and training. OS‑independent. Link: http://www.terracotta.org

58. Pivotal GemFire/Geode – Pivotal opened the source of its key big‑data components, including the GemFire in‑memory NoSQL database, now managed under the Apache Geode project; commercial versions also exist. Supported OS: Windows, Linux. Link: http://pivotal.io/big-data/pivotal-gemfire

59. GridGain – Built on Apache Ignite, GridGain provides in‑memory data structures for fast big‑data processing and a Hadoop accelerator; offers both enterprise and free community editions. Supported OS: Windows, Linux, OS X. Link: http://www.gridgain.com

60. Infinispan – A Red Hat JBoss project delivering a distributed in‑memory data grid for caching, NoSQL storage, and clustering. OS‑independent. Link: http://www.jboss.org/infinispan.html

Original source: 51CTO thebigdata.cn

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Analytics data processing In-Memory Databases Hadoop

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.