How StarRocks Keeps Metadata Consistent Across FE Nodes

This article explains the roles of StarRocks FE and BE nodes, details the metadata stored in FE, describes the leader‑follower‑observer architecture, and shows how BDB JE replication, journal logs, and checkpoint mechanisms ensure metadata synchronization and durability even after node failures.

Big Data Technology Tribe
Big Data Technology Tribe
Big Data Technology Tribe
How StarRocks Keeps Metadata Consistent Across FE Nodes

1. What metadata does the FE store?

StarRocks FE nodes act as the front‑end of the cluster, managing metadata, client connections, query planning and scheduling. All metadata is kept in memory via the GlobalStateMgr central manager, protected by read‑write locks for high‑concurrency reads and atomic updates.

1.1 Catalog, database and table definitions

Catalogs group databases and tables, supporting internal tables, external Hive/Iceberg tables, and data‑lake catalogs. The hierarchy is Catalog → Database → Table. Each catalog defines access permissions and can be queried without data migration.

Catalog types: Internal, External, Data Lake.

Database: basic info (ID, name, status).

OlapTable: table state, index metadata, key types, partition info, distribution info.

Partition: visible version, next version, partition state.

PartitionInfo: data properties, replication number, in‑memory flag.

MaterializedIndex: index metadata.

1.2 Data distribution

Each table/partition contains multiple tablets; each tablet has replicas on different BE nodes.

@SerializedName(value = JSON_KEY_ID)
protected long id;
// Replicas list
@SerializedName(value = "replicas")
private List<Replica> replicas;

Replica: status, version, backend location.

TabletMeta: hierarchical info (dbId, tableId, partitionId, indexId, storageMedium).

TabletInvertedIndex: maps tablet IDs to their metadata.

1.3 Node information

FE, BE, and Compute Node lists, status, heartbeat timestamps, and disk capacity are stored in classes such as SystemInfoService, Frontend, and Backend.

@SerializedName(value = "r")
private FrontendNodeType role;
@SerializedName(value = "n")
private String nodeName;
@SerializedName(value = "h")
private String host;
@SerializedName(value = "e")
private int editLogPort;
@SerializedName(value = "q")
private int queryPort;
@SerializedName(value = "rpc")
private int rpcPort;
private long replayedJournalId;
private long lastUpdateTime;
private boolean isAlive;

1.4 Load and tasks

Running import/export jobs, routine loads, stream loads, schema changes, and clone tasks are tracked with IDs, status, progress, configuration, error info, retry counts, and authentication details.

1.5 Permissions and security

User accounts.

Roles and role‑based permissions.

Privileges at database, table, and column levels.

Authentication configuration.

1.6 Resources and configuration

Resource groups, tags, variables, and system parameters are managed, e.g., a warehouse big_cluster bound to BE‑1,2,3.

1.7 Transactions and versions

Each import transaction has an ID, status, and visible version. Example: transaction 20001 committed, version 102 visible to all queries.

1.8 Statistics

TableStatistic: row count, NDV, histograms.

ColumnStatistic: column‑level stats.

TabletStat and CompactionTask info.

private final StatisticsMetaManager statisticsMetaManager;
private final TabletStatMgr tabletStatMgr;
private final CompactionMgr compactionMgr;

2. How do FE nodes synchronize metadata?

StarRocks uses a Leader‑Follower‑Observer architecture with BDB JE replication. The leader FE writes changes to a journal; BDB JE replicates the journal to followers and observers.

graph TB
subgraph "FE Cluster"
  L[Leader FE]
  F1[Follower FE 1]
  F2[Follower FE 2]
  O1[Observer FE 1]
  O2[Observer FE 2]
end
subgraph "BDB JE Replication Group"
  BDB_L[BDB Leader]
  BDB_F1[BDB Replica 1]
  BDB_F2[BDB Replica 2]
  BDB_O1[BDB Secondary 1]
  BDB_O2[BDB Secondary 2]
end
subgraph "Metadata Sync Flow"
  direction TB
  W[Write Request] --> L
  L --> J[Journal]
  J --> BDB_L
  BDB_L -->|replication| BDB_F1
  BDB_L -->|replication| BDB_F2
  BDB_L -->|replication| BDB_O1
  BDB_L -->|replication| BDB_O2
  BDB_F1 --> R1[Replay Log]
  BDB_F2 --> R2[Replay Log]
  BDB_O1 --> R3[Replay Log]
  BDB_O2 --> R4[Replay Log]
  R1 --> F1
  R2 --> F2
  R3 --> O1
  R4 --> O2
end

2.1 Overall architecture

Leader FE: sole writer, handles all metadata changes.

Follower FE: read‑only, participates in election.

Observer FE: read‑only, does not vote, used for query scaling.

2.2 Core sync mechanism

StarRocks configures BDB JE via ReplicationConfig (node name, host/port, group name, timeouts, durability, consistency policy). The journal system records every metadata change and persists it to BDB JE.

replicationConfig = new ReplicationConfig();
replicationConfig.setNodeName(selfNodeName);
replicationConfig.setNodeHostPort(selfNodeHostPort);
replicationConfig.setGroupName(STARROCKS_JOURNAL_GROUP);
replicationConfig.setConfigParam(ReplicationConfig.ENV_UNKNOWN_STATE_TIMEOUT, "10");
replicationConfig.setMaxClockDelta(Config.max_bdbje_clock_delta_ms, TimeUnit.MILLISECONDS);
// ... other parameters ...
if (isElectable) {
  replicationConfig.setReplicaAckTimeout(Config.bdbje_replica_ack_timeout_second, TimeUnit.SECONDS);
  replicationConfig.setConsistencyPolicy(new NoConsistencyRequiredPolicy());
} else {
  replicationConfig.setNodeType(NodeType.SECONDARY);
  replicationConfig.setConsistencyPolicy(new NoConsistencyRequiredPolicy());
}

2.3 Election mechanism

The HAProtocol interface defines methods such as fencing(), getLeader(), and node‑list queries. BDBHA implements this interface, managing node roles, election logic, and epoch handling to avoid split‑brain.

2.4 Data consistency guarantees

Write consistency uses a quorum: a majority of followers must acknowledge writes. Read consistency relies on version control; followers replay journals up to the latest version.

public static void loadJournal(GlobalStateMgr globalStateMgr, JournalEntity journal) {
  short opCode = journal.getOpCode();
  switch (opCode) {
    case OperationType.OP_CREATE_DB:
      globalStateMgr.replayCreateDb((Database) journal.getData());
      break;
    case OperationType.OP_CREATE_TABLE:
      globalStateMgr.replayCreateTable((CreateTableInfo) journal.getData());
      break;
    case OperationType.OP_ADD_BACKEND:
      GlobalStateMgr.getCurrentSystemInfo().replayAddBackend((Backend) journal.getData());
      break;
    // ... other operation types ...
  }
}

2.5 Failure recovery

When a node becomes leader, it opens the journal, performs fencing, replays all logs up to the max journal ID, and re‑initializes the journal writer.

try {
  journal.open();
  if (!haProtocol.fencing()) {
    throw new Exception("fencing failed. will exit");
  }
  long maxJournalId = journal.getMaxJournalId();
  replayJournal(maxJournalId);
  journalWriter.init(maxJournalId);
} catch (Exception e) {
  LOG.error("failed to init journal after transfer to leader! will exit", e);
  System.exit(-1);
}

3. How does StarRocks prevent metadata loss after FE crashes?

All metadata changes are written to BDB JE journal files (suffix .jdb) and periodically checkpointed into image files ( image.txid). On FE startup, the latest image is loaded first, then incremental journals are replayed to restore the full state.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

distributed systemsmetadataStarRocksReplicationBDB JE
Big Data Technology Tribe
Written by

Big Data Technology Tribe

Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.