Big Data 20 min read

Overview of 50 Additional Big Data Terms and Apache Projects

This article provides an extensive English overview of fifty additional big-data terminology and Apache open-source projects, explaining concepts such as Kafka, Hive, Spark, data cleaning, AI, graph databases, and many other tools and techniques essential for modern data engineering and analytics.

Architects Research Society

Jul 23, 2017

Overview of 50 Additional Big Data Terms and Apache Projects

In the first article we introduced terms: algorithm, analysis, descriptive analysis, prescriptive analysis, predictive analysis, batch processing, Cassandra, cloud computing, cluster computing, dark data, data lake, data mining, data scientist, distributed file system, ETL, Hadoop, in‑memory computing, IoT, machine learning, MapReduce, NoSQL, R, Spark, stream processing, structured, unstructured data.

Now let's look at 50 more major data terms.

The Apache Software Foundation (ASF) hosts many Big Data open‑source projects, currently over 350. I could spend a whole day explaining them, rather than picking a few popular ones.

Apache Kafka: named after the famous Czech writer, Kafka is used to build real‑time data pipelines and streaming applications. It is popular because it can store, manage and process data streams in a fault‑tolerant way and is claimed to be “fast”. In social‑network environments, Kafka is very popular.

Apache Mahout: Mahout provides a library of pre‑built algorithms for machine learning and data mining, as well as an environment for creating new algorithms. In other words, a machine‑learning heaven.

Apache Oozie: In any programming environment you need a workflow system to schedule and run jobs with predefined dependencies. Oozie provides workflow support for Pig, MapReduce, Hive and other languages.

Apache Drill, Apache Impala, Apache Spark SQL

All of these provide fast and interactive SQL for interacting with data stored in Hadoop. If you already know SQL and work with data stored in big‑data formats (e.g., HBase or HDFS), these features are very useful.

Apache Hive: Know SQL? Then you’re in good hands with Hive. Hive helps read, write and manage large datasets residing in distributed storage using SQL.

Apache Pig: Pig is a platform for creating query execution routines on large distributed datasets. The scripting language is called Pig Latin (no, I didn’t make it up). Pig is said to be easy to understand and learn, though I wonder how many people actually do.

Apache Sqoop: a tool for moving data from Hadoop to non‑Hadoop storage such as data warehouses and relational databases.

Apache Storm: a free open‑source real‑time distributed computing system. It makes instant processing of data that would otherwise require batch processing with Hadoop easier, especially for unstructured data.

Artificial Intelligence (AI) – why AI here? It’s not a separate field; all these emerging technologies are so related that we need to keep learning. AI is about developing intelligent machines and software that can perceive the environment, take appropriate actions, and continue learning. Sounds like machine learning? Join my “confused” club.

Behavioral analysis: Ever wonder how Google shows you ads for products you seem to need? Behavioral analysis focuses on understanding why consumers and applications act the way they do, linking browsing patterns, social media interactions, e‑commerce actions, and trying to predict outcomes.

Brontobytes‑1, followed by 27 zeros, is the size of tomorrow’s digital universe. While we’re here, let’s talk about terabytes, petabytes, exabytes, zettabytes, yottabytes and brontobytes. You must read this article to learn more about all these terms.

Business Intelligence (BI): I will reuse Gartner’s definition – BI is an umbrella term that includes applications, infrastructure, tools and best practices to access and analyze information to improve decision‑making and performance.

Biometric technology: This is James‑Bond‑style technology combined with analytics, identifying individuals by one or more physical traits such as facial recognition, iris scan, fingerprint, etc.

Click‑stream analysis: used to analyze users’ online clicks while browsing. Ever wonder why certain Google ads keep following you even when you switch sites? The system knows what you click.

Cluster analysis is an exploratory analysis that tries to identify structure in data. Also called segmentation or classification analysis. It attempts to group cases (observations, participants, respondents) that belong together. Because it is exploratory, it depends on the distinction between variables. SPSS offers various clustering methods for binary, nominal, ordinal and ratio data.

Comparative analysis: In this article I will dive deeper into analysis, because the holy grail of big data is analyzing data. As the name suggests, comparative analysis uses statistical techniques such as pattern analysis, filtering and decision‑tree analysis to compare multiple processes, datasets or other objects. It can be used in healthcare to compare large numbers of medical records, documents, images, etc., for more effective and accurate diagnosis.

Connection analysis: You have probably seen spider‑web graphs that connect topics to identify influencers. Connection analysis helps you discover these inter‑relationships and influences among people, products and systems in a network, even combining data from multiple networks.

Data analyst: A data analyst is a very important and popular role; besides preparing reports, they collect, manipulate and analyze data. I will write a more detailed article about data analysts.

Data cleaning: Self‑explanatory – it involves detecting and correcting or removing inaccurate data or records from a database. Remember “dirty data”? Using a mix of manual and automated tools and algorithms, analysts can correct and enrich data to improve its quality. Dirty data leads to wrong analysis and bad decisions.

DaaS: You have SaaS, PaaS and now DaaS – Data as a Service. By providing on‑demand access to cloud‑hosted data, DaaS providers help quickly obtain high‑quality data.

Data virtualization – a data‑management approach that allows applications to retrieve and manipulate data without needing to know storage location or format. For example, social networks store our photos in their own networks.

Dirty data: Now big data is sexy, and people keep adding adjectives like dark data, dirty data, small data and now smart data. Dirty data is simply data that is inaccurate, duplicate or inconsistent.

Fuzzy logic: How certain are we of 100 % truth? Our brains aggregate data into partial truths, then abstract a threshold that determines our response. Fuzzy logic is a computation that mimics human reasoning by handling partial truth rather than absolute 0/1, used in NLP and other data‑related fields.

Gaming: In a typical game you have points, competition, rules, etc. Gamification in big data uses these concepts to collect or analyze data, or to motivate users.

Graph databases: Graph databases use concepts such as nodes and edges to represent people/companies and their relationships, e.g., how Amazon suggests other products you might buy.

Hadoop User Experience (Hue): Hue is an open‑source web interface that makes Hadoop easier. It provides a file browser for HDFS, a job designer for MapReduce, Oozie workflow editor, shell, Impala and Hive UI, and a set of Hadoop APIs.

HANA: High‑performance analytics application – SAP’s in‑memory platform designed for large‑scale data transactions and analytics.

HBase: Distributed, column‑oriented database. It uses HDFS as underlying storage and works with MapReduce and transactions to support batch‑style computation.

Load balancing: Distributing workload across multiple computers or servers to achieve optimal system performance and utilization.

Metadata: “Metadata is data about other data.” It aggregates basic information about data, making it easier to find and process specific data instances (e.g., author, creation date, file size). Metadata is also used for images, video, spreadsheets, web pages, etc.

MongoDB: MongoDB is a cross‑platform open‑source database that uses a document‑oriented data model instead of traditional table‑based relational structures, facilitating integration of structured and unstructured data in certain applications.

Mashup: The term is similar to everyday “mix‑up.” Basically, a mashup combines different data sets into a single application (e.g., merging real‑estate listings with demographic or geographic data). Very cool visualizations.

Multidimensional databases: Databases optimized for OLAP applications and data warehouses. If you wonder what a data warehouse is, it is essentially a central repository of multiple data sources.

MultiValue databases: They are a type of NoSQL and multidimensional database that directly handle 3‑dimensional data, working well with HTML and XML strings.

Natural language processing: Software algorithms that enable computers to understand human language more accurately, allowing more natural and effective interaction.

Neural networks: According to http://neuralnetworksanddeeplearning.com/, neural networks are biologically‑inspired programming models that enable computers to learn from observed data. Deep learning, a set of powerful neural‑network learning techniques, is closely related.

Pattern recognition: When algorithms locate recurring patterns or rules in large or different data sets, pattern recognition occurs. It is closely tied to machine learning and data mining and helps researchers discover insights that would otherwise remain hidden.

RFID: Radio‑frequency identification; a sensor that transmits data wirelessly via non‑contact RF fields. With the IoT revolution, RFID tags can be embedded in virtually anything, generating massive amounts of data to analyze.

SaaS: Software as a Service enables vendors to host applications and make them available over the Internet. SaaS providers deliver services through the cloud.

Semi‑structured data: Data that is not captured or formatted in a regular way, such as XML documents, emails, or JSON. It is not fully unstructured but contains some structural elements like tags or tables.

Sentiment analysis: Involves capturing and tracking consumer opinions, emotions or feelings expressed in various interactions or documents (social media, call center transcripts, surveys, etc.). Text analysis and NLP are typical activities. The goal is to determine the sentiment toward a company, product, service, person or event.

Spatial analysis: Refers to analyzing spatial data such as geographic or topological data to identify patterns and regularities within data distributed across physical space.

Stream processing: Aims to operate on real‑time and streaming data via “continuous” queries. With constant streams from social networks, there is a clear need for stream processing and analysis to compute mathematical or statistical results without interruption.

Smart data: Supposedly useful and actionable data that has been filtered by algorithms.

Terabytes: A relatively large unit of digital data; one terabyte (TB) equals a thousand gigabytes. It is estimated that 10 TB could hold the entire printed collection of the U.S. Library of Congress, and a single tuberculosis could hold 1,000 Encyclopaedia Britannica volumes. You must read this article to learn more about all these terms.

Visualization – with proper visualization, raw data becomes usable. Visualization does not just mean simple charts; it refers to complex graphics that can contain many data variables while remaining understandable and readable.

About 1,000 bytes, or 2.5 trillion‑byte DVDs. Today the entire digital universe is 1 zettabyte, which doubles every 18 months. You must read this article to learn more about all these terms.

Gigabyte – about 1,000 bytes or 1 billion bytes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering Data Analytics

Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.