Practices and Reflections on Enterprise Cloud Platforms
The article shares the author’s experience designing and operating enterprise‑grade cloud platforms, covering resource and application management, the Platform EGO architecture, comparisons with Mesos, Yarn and Kubernetes, and practical insights on scaling, scheduling, security, and architectural evolution.
In this talk the speaker, an experienced cloud platform architect, shares practical experiences and reflections on building enterprise‑grade cloud application platforms, focusing on resource management, application management, and overall system design.
Resource Management and Application Management – Cloud‑based platforms are divided into two areas: (1) resource management technologies such as private clouds (OpenStack, CloudStack) and cluster management (Docker, Kubernetes, Mesos, etc.); (2) application construction and management, including deployment, monitoring, and elastic scaling.
The speaker reviews classic cluster technologies (Google’s GFS, MapReduce, BigTable, Borg, Mesos, Yarn, Kubernetes) and notes their evolution with container technologies.
Platform EGO – The Platform EGO system represents a 2.0 evolution of IBM Platform DCOS, originally designed in 2004‑2005 as a two‑layer scheduler (LSF + Symphony). It provides a resource manager, process execution manager, and various plugins for scheduling, deployment, events, and security.
EGO Architecture – The architecture follows a SOA model with core services (LIM, PEM, Master) and upper‑level services (initd, name service, cron, web API, portal, analytics). It supports C, Java, Python, and RESTful APIs. The Master is plugin‑based, loading resource, security, scheduling, provisioning, event, and execution plugins.
EGO Components
LIM (Load Information Manager) collects system metrics in a master‑slave fashion and abstracts resources as name‑type‑value triples, similar to the DMTF CIM model.
PEM (Process Execution Manager) executes tasks on allocated slots (CPU, memory, etc.) and abstracts work units as Containers/Activities.
Scheduling plugins allow custom policies; the author designed a DSL for administrators to write Python‑based scheduling strategies.
EGO Scheduling Strategy – The strategy considers application topology, resource load, network bandwidth, runtime characteristics, and supports resource fragmentation, migration, and multi‑tenant isolation.
Comparison with Mesos, Yarn, Kubernetes – Enterprise platforms need strong management capabilities (≈70% management, 30% core functionality). EGO introduces concepts such as users, consumers, and sessions, adopts the CIM model for resource abstraction, and supports priority, pre‑allocation, real‑time allocation, and pre‑emptive scheduling.
Application Management – The SkyForm product automates resource provisioning, software deployment, and configuration for OpenStack, CloudStack, Tomcat, MySQL, Hadoop, and HPC. It introduced a custom package format and a graphical application designer.
Feedback highlighted challenges in user adoption, need for topology visualization, performance monitoring, and auto‑scaling based on metrics. The system integrates monitoring (Zabbix, New Relic) and exposes APIs for custom collectors.
Architecture Evolution – Inspired by AWS, the architecture evolved to separate deployment, CloudWatch‑like monitoring, and Auto‑Scaling services, supporting containers, VMs, and legacy applications, and embracing micro‑service design.
Key lessons include studying AWS services (IAM, Auto‑Scaling), adopting service‑oriented and plugin‑based designs, and ensuring multi‑tenant support.
Q&A Highlights
Resource‑centric vs. application‑centric architectures differ in focus on supply vs. business logic.
HDFS is best run on physical machines for I/O performance.
EGO is used in finance and telecom sectors.
Legacy components can be wrapped with façade/adaptor patterns to expose service interfaces.
Fine‑grained resource management uses tags, groups, and a rule language for selection and ordering.
Real‑time monitoring is achieved via LIM collecting CPU, memory, and disk metrics, with optional integration to Zabbix or New Relic.
Auto‑Scaling does not require service downtime; multiple scaling groups maintain a minimum baseline.
High‑availability relies on load‑balancing and shared storage; distributed consensus (etcd/ZooKeeper) can be used for larger scales.
Scheduling algorithms often use heuristic approaches such as knapsack or linear programming to produce near‑optimal allocations quickly.
Fault tolerance is achieved through shared storage, restart‑able components, and application‑level consistency handling.
The session concluded with a reminder to prioritize construction, operation, maintenance, and optimization when designing enterprise cloud platforms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
