Essential Big Data Glossary: Key Terms Every Data Professional Should Know
This article presents an A‑to‑Z glossary of common big‑data terminology, offering concise definitions for concepts such as aggregation, algorithms, analytics, AI, cloud computing, databases, machine learning, and more, to help readers quickly grasp the core vocabulary of the big‑data ecosystem.
The emergence of big data has introduced many new terms that are often difficult to understand. This article provides a common big‑data terminology list, serving as a starting point for deeper exploration, with some definitions referenced from related blog posts.
A
Aggregation – The process of searching, merging, and displaying data.
Algorithms – Mathematical formulas that can perform specific data analysis.
Analytics – Used to discover the inherent meaning of data.
Anomaly detection – Searching for data items that do not match expected patterns or behavior; also known as outliers, exceptions, surprises, or contaminants, often providing actionable information.
Anonymization – Making data anonymous by removing all personally identifiable information.
Application – Computer software that implements a specific function.
Artificial Intelligence – Developing intelligent machines and software that can perceive their environment, react accordingly, and even learn autonomously.
B
Behavioural Analytics – Analyzes user behavior (what they do, why they do it, and what they have done) to draw conclusions, focusing on human‑centric patterns in data.
Big Data Scientist – A person who designs big‑data algorithms to make large datasets useful.
Big data startup – An emerging company developing cutting‑edge big‑data technologies.
Biometrics – Identity verification based on personal characteristics.
BB: Brontobytes – Approximately 1,000 Yottabytes, representing the scale of the future digital universe; one Brontobyte contains 27 zeros.
Business Intelligence – A set of theories, methodologies, and processes that make data easier to understand.
C
Classification analysis – A systematic process for obtaining important relational information from data; also known as metadata, describing data about data.
Cloud computing – Distributed computing systems built on a network, with data stored off‑site (in the cloud).
Clustering analysis – Grouping similar objects together into clusters to analyze differences and similarities among data.
Cold data storage – Storing rarely used old data on low‑power servers, which makes retrieval time‑consuming.
Comparative analysis – Step‑by‑step comparison and calculation when pattern‑matching in very large datasets.
Complex structured data – Data composed of two or more complex, interrelated parts that cannot be simply parsed by SQL or similar tools.
Computer generated data – Data such as log files generated by computers.
Concurrency – Executing multiple tasks or processes simultaneously.
Correlation analysis – A data‑analysis method used to determine whether variables are positively or negatively correlated.
CRM: Customer Relationship Management – Technology for managing sales and business processes; big data influences CRM strategies.
D
Dashboard – Uses algorithms to analyze data and display results in chart form on a dashboard.
Data aggregation tools – The process of converting data from many sources into a new, unified data source.
Data analyst – A professional who performs data analysis, modeling, cleaning, and processing.
Database – A repository that stores data using a specific technology.
Database-as-a-Service – Cloud‑deployed databases that are pay‑as‑you‑go, e.g., Amazon Web Services (AWS).
DBMS: Database Management System – Collects, stores, and provides access to data.
Data centre – A physical location housing servers that store data.
Data cleansing – Reviewing and validating data to delete duplicates, correct errors, and ensure consistency.
Data custodian – A technical professional responsible for maintaining the technical environment needed for data storage.
Data ethical guidelines – Principles that help organizations keep data transparent, secure, and private.
Data feed – A data stream, such as Twitter subscriptions or RSS.
Data marketplace – An online venue for buying and selling data sets.
Data mining – Extracting specific patterns or information from a data set.
Data modelling – Using modeling techniques to analyze data objects and uncover underlying meanings.
Data set – A collection of large amounts of data.
Data virtualization – Integrating data from various sources (databases, applications, file systems, web technologies, big‑data technologies) to obtain richer information.
De‑identification – Also called anonymization; ensures individuals cannot be identified from data.
Discriminant analysis – A statistical method that classifies data into groups and derives classification rules from known information.
Distributed File System – Provides a simplified, highly available way to store, analyze, and process data.
Document Store Databases – Document‑oriented databases designed to store and manage semi‑structured document data.
E
Exploratory analysis – Discovering patterns in data without a standard process, revealing main characteristics of a data set.
EB: Exabytes – Approximately 1,000 petabytes; the world generates about 1 EB of new information daily.
ETL: Extract, Transform and Load – A process for databases or data warehouses that extracts data from various sources, transforms it to meet business needs, and loads it into a database.
F
Failover – Automatically switches tasks to another available server or node when a server fails.
Fault‑tolerant design – A system design that continues operating even when part of it fails.
G
Gamification – Applying game thinking and mechanics to non‑game domains to create and detect data in a friendly, effective way.
Graph Databases – Store data using graph structures (nodes, edges, properties) that allow direct relationships between elements.
Grid computing – Connecting many geographically distributed computers, often via the cloud, to solve specific problems.
H
Hadoop – An open‑source distributed framework for developing distributed programs and performing big‑data computation and storage.
HBase – An open‑source, non‑relational, distributed database used together with the Hadoop framework.
HDFS – Hadoop Distributed File System, designed to run on commodity hardware.
HPC: High‑Performance‑Computing – Using supercomputers to solve extremely complex computational problems.
I
IMDB: In‑memory – A database management system that stores data in main memory rather than on disk, enabling high‑speed processing.
Internet of Things – Embedding sensors in ordinary devices so they can connect to networks anytime, anywhere.
J
Juridical data compliance – Ensuring that data stored in cloud solutions across different countries complies with local laws.
K
Key‑Value Databases – Store data as a key pointing to a specific record, enabling fast lookup of basic data types.
L
Latency – The delay in system time.
Legacy system – An old application, technology, or computing system that is no longer supported.
Load balancing – Distributing workload across multiple computers or servers to achieve optimal results and maximum utilization.
Location data – GPS information, i.e., geographic location data.
Log file – Files automatically generated by computer systems that record operational processes.
M
Machine2Machine data (M2M) – Content exchanged between two or more machines.
Machine data – Data generated by sensors or algorithms on machines.
Machine learning – A subset of AI where machines learn from tasks they perform, improving over time.
MapReduce – A software framework for processing large‑scale data (Map: mapping, Reduce: reducing).
MPP: Massively Parallel Processing – Using multiple processors or computers simultaneously to handle a single computational task.
Metadata – Data that describes other data, i.e., information about data attributes.
MongoDB – An open‑source NoSQL (non‑relational) database.
Multi‑Dimensional Databases – Databases optimized for online analytical processing (OLAP) and data warehousing.
MultiValue Databases – A type of NoSQL database capable of handling three‑dimensional data, especially long strings, HTML, and XML.
N
Natural Language Processing – A branch of computer science that studies interaction between computers and human language.
Network analysis – Analyzing relationships between nodes in a network or graph.
NewSQL – An elegant, well‑defined database system that is easier to learn than SQL and newer than NoSQL.
NoSQL – “Not using SQL” databases that go beyond traditional relational databases, offering stronger consistency and handling massive scale and high concurrency.
O
Object Databases – Store data as objects for object‑oriented programming, differing from relational and graph databases.
Object‑based Image Analysis – Analyzes groups of related pixels (objects) rather than individual pixels in digital images.
Operational Databases – Databases that support routine organizational operations, typically using online transaction processing.
Optimization analysis – An algorithm‑driven optimization process during product design, allowing testing against preset criteria.
Ontology – A philosophical concept that defines a set of concepts and relationships within a domain, elevating data to a knowledge‑level representation.
Outlier detection – Identifying data points that deviate significantly from the average, indicating potential system issues.
P
Pattern Recognition – Using algorithms to identify patterns in data and predict new data from the same source.
PB: Petabytes – Approximately 1,000 terabytes; CERN’s Large Hadron Collider generates about 1 PB of particle data per second.
PaaS: Platform‑as‑Service – A service that provides all necessary platform components for cloud solutions.
Predictive analysis – A valuable big‑data analysis method that forecasts near‑future behavior using various data sets (historical, transactional, social, personal).
Privacy – Separating personally identifiable data from other data to ensure user privacy.
Public data – Information or data sets created by public funds.
Q
Quantified Self – Using applications to track every user action throughout the day for better behavior understanding.
Query – Searching for information that answers a specific question.
R
Re‑identification – Merging multiple data sets to identify personal information from anonymized data.
Regression analysis – Determining the dependency relationship between two variables, assuming a one‑way causal link.
RFID – Radio‑frequency identification; a wireless, non‑contact sensor that transmits data.
Real‑time data – Data created, processed, stored, analyzed, and displayed within milliseconds.
Recommendation engine – Algorithms that suggest products to users based on previous purchase behavior.
Routing analysis – Analyzing multiple variables to find optimal transportation paths, reducing fuel costs and improving efficiency.
S
Semi‑structured data – Data without strict storage structure but organized using tags or markers to maintain hierarchy.
Sentiment Analysis – Algorithms that determine how people feel about specific topics.
Signal analysis – Analyzing product performance by measuring physical quantities over time or space, often using sensor data.
Similarity searches – Querying databases for objects most similar to a given object, regardless of type.
Simulation analysis – Simulating real‑world processes or systems, considering multiple variables to ensure optimal product performance.
Smart grid – Using sensors to monitor energy networks in real time, improving efficiency.
SaaS: Software‑as‑a‑Service – Web‑based applications accessed via a browser.
Spatial analysis – Analyzing geographic or topological data to discover patterns and rules in spatial distributions.
SQL – A programming language used to retrieve data from relational databases.
Structured data – Data that can be organized into rows and columns, such as records or fields, and precisely located.
T
TB: Terabytes – Approximately 1,000 gigabytes; a 1 TB capacity can store about 300 hours of video.
Time series analysis – Analyzing well‑defined data collected at regular time intervals.
Topological Data Analysis – Focuses on composite data models, cluster identification, and statistical significance of data.
Transactional data – Dynamic data that changes over time.
Transparency – Providing consumers with clear information about how their data is used and processed.
U
Un‑structured data – Large amounts of raw text data that may also contain dates, numbers, and instances.
V
Value – One of the 4 Vs of big data; usable data that creates significant value for organizations, society, and consumers.
Variability – Data meaning changes rapidly; the same word can have different meanings in different contexts.
Variety – Data appears in many forms, including structured, semi‑structured, un‑structured, and complex structured data.
Velocity – The speed at which data is created, stored, analyzed, virtualized, and processed in the big‑data era.
Veracity – Ensuring data correctness, which is essential for accurate analysis.
Visualization – Complex charts that convey large amounts of information in an easily understandable way.
Volume – The size of data, ranging from megabytes to Brontobytes.
W
Weather data – An important open public data source that, when combined with other data, can provide deep analytical insights.
X
XML Databases – Databases that store data in XML format, often associated with document‑oriented databases, allowing queries, export, and serialization.
Y
Yottabytes – Approximately 1,000 Zettabytes, equivalent to about 2.5 trillion DVDs; the global digital universe is about 1 YB and doubles every 18 years.
Z
ZB: Zettabytes – Approximately 1,000 Exabytes; daily global data traffic was projected to reach 1 ZB by 2016.
Storage Capacity Units Conversion Table
1 Bit = Binary Digit
8 Bits = 1 Byte
1,000 Bytes = 1 Kilobyte
1,000 Kilobytes = 1 Megabyte
1,000 Megabytes = 1 Gigabyte
1,000 Gigabytes = 1 Terabyte
1,000 Terabytes = 1 Petabyte
1,000 Petabytes = 1 Exabyte
1,000 Exabytes = 1 Zettabyte
1,000 Zettabytes = 1 Yottabyte
1,000 Yottabytes = 1 Brontobyte
1,000 Brontobytes = 1 Geopbyte
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
