Big Data 19 min read

How Banks Can Tame Petabytes of Unstructured Data: Architecture and Best Practices

This article presents a comprehensive design and deployment plan for a bank's unstructured data service platform, covering data growth challenges, lifecycle management, three‑tier storage architecture, Elasticsearch indexing, fault‑tolerant disaster recovery, monitoring, and future development directions.

dbaplus Community

Nov 19, 2020

How Banks Can Tame Petabytes of Unstructured Data: Architecture and Best Practices

Introduction

Banking systems generate massive amounts of unstructured data—images, audio, video, documents, XML/HTML, chat logs, and emails. Daily image file creation approaches 100,000 files with an annual growth of ~10 TB, and the overall unstructured data volume is increasing geometrically, exceeding the capabilities of traditional relational databases.

Design Objectives

Provide a highly reliable, efficient, and scalable platform capable of storing petabyte‑level unstructured data, supporting fast metadata search, tiered storage, and automated lifecycle management.

Three‑Layer Architecture

Presentation Layer

Exposes HTTP/HTTPS RESTful APIs to various business channels. The layer handles authentication, encryption, and request routing without embedding business logic.

Business‑Logic Layer

Consists of the Unstructured Data Central Processing (UCP) service and a third‑party upload service. UCP is responsible for persisting files and metadata, while the upload service validates, caches, and load‑balances incoming data. Standard CRUD operations (upload, download, replace, delete) are available for common formats (jpg, png, pdf, docx, xls, etc.). Non‑standard interfaces are customized for specific loan‑system workflows.

Data Layer

Built on an Elasticsearch (ES) cluster for distributed indexing and search, complemented by a lightweight relational database for configuration and logs, and a software‑defined storage fabric for the actual file payloads.

Core Components

Interface Service : REST APIs with Content‑Type handling for single or batch file operations. Supports upload, download, replace, and delete. Non‑standard endpoints add fields such as control IDs for loan‑system integration.

Index Service : ES creates in‑memory inverted indexes using tokenizers, filters, and character mappers, then flushes immutable segment files to disk. The cluster comprises 5 nodes, 10 primary shards with replica copies, ensuring availability despite node failures.

Storage Service : Tiered storage architecture

Online tier – SSD‑based NAS (≈10 TB, deduplication & compression) for low‑latency access.

Near‑line tier – GlusterFS on commodity servers (≈100 TB, three‑copy redundancy) for less‑frequently accessed data.

Offline tier – Tape library or optical media for long‑term cold data, with lower cost per GB.

Lifecycle Management & Tier Migration

Data automatically migrates from online → near‑line → offline based on age and access frequency. Migration includes MD5/hash verification, secure deletion of source copies, and path updates so business applications retain transparent access. Policies can be triggered manually or via a scheduler.

Fault Tolerance & Disaster Recovery

Both application and storage layers use load‑balancing and active‑active configurations. ES’s multi‑replica cluster provides inherent fault tolerance; a 5‑node cluster with 2 replicas can survive up to two node failures. Cross‑data‑center replication is achieved with:

SnapMirror for SSD‑NAS synchronous replication (RPO = 0).

Kafka‑driven metadata sync between primary and DR sites.

Dual‑center GlusterFS deployment for near‑line data, enabling rapid failover.

Monitoring & Performance

Service processes and business ports are continuously monitored. Typical latency under peak load:

Write latency < 150 ms.

Read latency < 250 ms (including batch download scenarios).

Health checks include process status, port availability, and ES cluster health metrics.

Future Enhancements

Planned improvements include:

Integration of optical media to further reduce archival cost.

Automated validation of backup data for rapid disaster recovery.

Extending the platform to handle voice recordings, video contracts, and blockchain‑based electronic evidence.

Leveraging ES for log analytics to support intelligent data‑center operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch Disaster Recovery storage architecture unstructured data

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.