How Banks Can Tame Petabytes of Unstructured Data: Architecture and Best Practices
This article presents a comprehensive design and deployment plan for a bank's unstructured data service platform, covering data growth challenges, lifecycle management, three‑tier storage architecture, Elasticsearch indexing, fault‑tolerant disaster recovery, monitoring, and future development directions.
Introduction
Banking systems generate massive amounts of unstructured data—images, audio, video, documents, XML/HTML, chat logs, and emails. Daily image file creation approaches 100,000 files with an annual growth of ~10 TB, and the overall unstructured data volume is increasing geometrically, exceeding the capabilities of traditional relational databases.
Design Objectives
Provide a highly reliable, efficient, and scalable platform capable of storing petabyte‑level unstructured data, supporting fast metadata search, tiered storage, and automated lifecycle management.
Three‑Layer Architecture
Presentation Layer
Exposes HTTP/HTTPS RESTful APIs to various business channels. The layer handles authentication, encryption, and request routing without embedding business logic.
Business‑Logic Layer
Consists of the Unstructured Data Central Processing (UCP) service and a third‑party upload service. UCP is responsible for persisting files and metadata, while the upload service validates, caches, and load‑balances incoming data. Standard CRUD operations (upload, download, replace, delete) are available for common formats (jpg, png, pdf, docx, xls, etc.). Non‑standard interfaces are customized for specific loan‑system workflows.
Data Layer
Built on an Elasticsearch (ES) cluster for distributed indexing and search, complemented by a lightweight relational database for configuration and logs, and a software‑defined storage fabric for the actual file payloads.
Core Components
Interface Service : REST APIs with Content‑Type handling for single or batch file operations. Supports upload, download, replace, and delete. Non‑standard endpoints add fields such as control IDs for loan‑system integration.
Index Service : ES creates in‑memory inverted indexes using tokenizers, filters, and character mappers, then flushes immutable segment files to disk. The cluster comprises 5 nodes, 10 primary shards with replica copies, ensuring availability despite node failures.
Storage Service : Tiered storage architecture
Online tier – SSD‑based NAS (≈10 TB, deduplication & compression) for low‑latency access.
Near‑line tier – GlusterFS on commodity servers (≈100 TB, three‑copy redundancy) for less‑frequently accessed data.
Offline tier – Tape library or optical media for long‑term cold data, with lower cost per GB.
Lifecycle Management & Tier Migration
Data automatically migrates from online → near‑line → offline based on age and access frequency. Migration includes MD5/hash verification, secure deletion of source copies, and path updates so business applications retain transparent access. Policies can be triggered manually or via a scheduler.
Fault Tolerance & Disaster Recovery
Both application and storage layers use load‑balancing and active‑active configurations. ES’s multi‑replica cluster provides inherent fault tolerance; a 5‑node cluster with 2 replicas can survive up to two node failures. Cross‑data‑center replication is achieved with:
SnapMirror for SSD‑NAS synchronous replication (RPO = 0).
Kafka‑driven metadata sync between primary and DR sites.
Dual‑center GlusterFS deployment for near‑line data, enabling rapid failover.
Monitoring & Performance
Service processes and business ports are continuously monitored. Typical latency under peak load:
Write latency < 150 ms.
Read latency < 250 ms (including batch download scenarios).
Health checks include process status, port availability, and ES cluster health metrics.
Future Enhancements
Planned improvements include:
Integration of optical media to further reduce archival cost.
Automated validation of backup data for rapid disaster recovery.
Extending the platform to handle voice recordings, video contracts, and blockchain‑based electronic evidence.
Leveraging ES for log analytics to support intelligent data‑center operations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
