Big Data 8 min read

What’s New in Apache CarbonData 1.0.0? 80+ Features Boost Big Data Performance

Apache CarbonData 1.0.0, now an Apache incubating project, adds over 80 new features and bug fixes—including a new data loading solution, Spark 2.1 integration, update/delete SQL support, adaptive compression for numeric types, B‑Tree LRU cache, V2 format for faster first‑query performance, vectorized reader, bucket‑table joins, off‑heap memory, single‑pass loading, and pre‑generated dictionaries—aimed at delivering faster, more flexible, and efficient columnar storage for big‑data workloads.

Huawei Cloud Developer Alliance

Feb 7, 2017

What’s New in Apache CarbonData 1.0.0? 80+ Features Boost Big Data Performance

Release Overview

On the second day of the Lunar New Year, Apache CarbonData released its fourth stable version, CarbonData 1.0.0. Developed by Huawei and open‑sourced under Apache Hadoop, CarbonData is a columnar storage file format that supports indexing, compression, and encoding, aiming to satisfy multiple data needs with faster interactive queries. The project is currently in Apache incubation.

Key New Features (80+)

New data loading solution

Integration with Spark 2.1

Support for UPDATE/DELETE SQL

Adaptive compression for int/bigint/decimal types

Custom Date/Timestamp format per column

B‑Tree with LRU cache

CarbonData V2 format for faster first‑query performance

Vectorized reader

Bucket‑table support for fast joins

Off‑heap memory usage to reduce GC

Single‑pass data loading

Pre‑generated dictionaries for data loading

New Data Loading Solution

The previous version relied on the Kettle engine, which was not designed for big‑data scenarios and was hard to maintain. CarbonData 1.0.0 introduces a modular, Kettle‑free loading solution that improves performance.

Support for Spark 2.1 Integration

Spark 2.1 brings many new features and performance improvements. CarbonData now allows direct use of these Spark 2.1 capabilities.

Support UPDATE/DELETE SQL

Standard SQL syntax can be used to delete and update CarbonData tables. This feature is currently available only for Spark 1.5/1.6; Spark 2.1 users need to wait for future releases.

Adaptive Compression for Numeric Types

int, bigint, and decimal columns can now use adaptive compression to reduce storage size, selecting the best compression technique based on the data.

Custom Date/Timestamp Formats

Users can define Date/Timestamp formats for each column during data loading, and also set default formats to avoid repeated definitions.

B‑Tree LRU Cache

The B‑Tree stores block and blocklet information. With LRU caching, only the most recently or frequently accessed block/blocklet metadata is kept in memory, automatically evicting unused entries.

CarbonData V2 Format Improves First‑Query Performance

The V2 format is more organized and stores less metadata, reading metadata only when needed. Tests show that first‑query response time is reduced by about 50% compared to V1, with lower I/O consumption.

Vectorized Reader

This feature reads data in batches, reducing GC time and improving scan performance.

Bucket‑Table Fast Join

If both tables are bucketed on the same column with the same number of buckets, joins can avoid shuffle, boosting join performance. This feature is supported in Spark 2.1.

Off‑Heap Memory to Reduce GC

By storing data off‑heap, both loading and reading performance improve, and GC overhead is reduced.

Single‑Pass Data Loading

Previously, data loading required two jobs (dictionary generation and data loading). The new single‑pass approach combines both steps, improving performance for scenarios with minimal dictionary updates.

Pre‑Generated Dictionaries for Data Loading

Users can generate dictionaries in advance and also provide custom dictionaries to accelerate data loading.

Additional Resources

Download Apache CarbonData 1.0.0: https://www.apache.org/dyn/closer.lua/incubator/carbondata/1.0.0-incubating

Community links:

GitHub source: https://github.com/apache/incubator-carbondata

Mailing list: [email protected]

Apache JIRA: https://issues.apache.org/jira/browse/CARBONDATA/

Project homepage: http://carbondata.apache.org

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

big data data compression Columnar Storage Apache CarbonData Spark Integration

Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.