Intelligent cluster reliability and self-healing: A new approach for Enterprise Storage Environments

By Madhukar Gunjan Chakhaiyar, Product Test Architect, Wipro.

10 years ago Posted in Storage + Servers Storage Technology

Increasing network content, database applications, rich email, and e-commerce are driving rapid growth in the volume of data moving across public and enterprise networks. In turn, this is driving the demand for affordable, reliable and accessible storage available on the network. IT administrators often need to deal with data inaccessibility and information flow interruption as a result of cluster environment failures. These failures have become a growing concern for large-scale computational platforms, as they have an adverse effect on the performance of the system. Any downtime can have a severe impact on the entire organisation. The new approach of cluster reliability aware applications supports these networks that run a mission-critical database.

To set the data accessible all time, an obvious solution is desirable to establish robust cluster environment to perform failure pattern diagnostic, critical application/process offload, Pre-Determination of node failure, node capability determination, periodic cluster health/state analysis, workload determination on each cluster nodes to load balance, cluster failover rate, failover of critical application/process to suitable node, reliability of node to sustain long run, periodic continuous tracking of state of the cluster node, and a self-heal mechanism that sense the failure to perform necessary action to heal it during runtime dynamically. To meet these needs, any network that runs a mission critical database requires Cluster reliability aware application.

This traditional Redundant Array of Independent Nodes (RAIN) storage approach has emerged as the architecture of choice as it creates reliable communication between nodes in a cluster with a built-in acknowledgement scheme that ensures reliable packet delivery. RAIN minimises the number of nodes in the chain, connecting client and server. The nodes are also more robust and independent of each other. By allowing nodes to join and exit from a cluster as and when needed, RAIN architectures insulate a storage cluster from the failure of one or more nodes. By replicating data on multiple nodes, RAIN-type archives can automatically compensate for node failure or removal.

In an Enterprise Storage Environment, RAIN Technology is widely established. Typically, RAIN systems are largely delivered as hardware appliances designed from identical components within a closed system. Despite its benefits, RAIN is limited as it does not “see” the actual information in a packet, and can therefore not allocate priorities. Then, the technology requires the installation of switches to interconnect clients, which not only becomes expensive but also requires traffic load-balancing across switches. As an example, if one of the switches fails it affects the network performance, load and reliability and the switch has to be repaired as early as possible. Whilst RAIN prevents failure by replacing or moving the I/O path to existing or new hardware in case of a system failure, it cannot predict failure. In addition, RAIN cannot analyse error patterns and logs to find a remedy for issues in case they occur in the future. In other words: RAIN does not have a self-heal mechanism.
A new approach is Cluster Reliability Aware Application and Self-Heal Intelligence (CRASSHI) – an intelligent mechanism on top of the heterogeneous cluster node to implement reliability and predictability. CRASSHI can run on dedicated centralised hardware, for example a high-performing computational node or a dedicated high-end hardware server. The dedicated server should be connected to every existing cluster node via a dedicated network and bandwidth for smooth and flawless data transaction.

The CRASSHI mechanism could also run as a logical application on top of each node, synced with each other via a dedicated high-performing network through which the clustered nodes are connected to each other. In this case the CRASSHI application should be configured with separate network and TCP/IP credentials, and should use the dedicated network interface card for smooth and flawless data transaction.

The mechanism then analyses the events and logs that occur during the runtime in cluster nodes. It utilises the logs of existing nodes in the cluster environment to determine a range of metrics such as type of failure occurrence, pattern of failure that occurs frequently, cascading effect of the failure with in cluster node, impact of failure on application, workload determination on each cluster nodes to load balance and so on. CRASSHI closely examines the logs collected via the dedicated network from each existing node. The decision making algorithm of CRASSHI filters the required data from the core, dump, logs and so on.

This approach has distinctive advantages:
· CRASSHI detects potential link failures and instructs an impacted node to heal itself, before the link reaches its timeout value and disconnects. CRASSHI is also equipped to understand mean time between failures among nodes. Based on cluster nodes log analysis, the logical intelligence module understands the time between arrivals of failures in a cluster node and predicts the failure before it happens.
· CRASSHI continuously monitors the functional event log and scheduler log related to functional operations a node was performing during the time of failure. After examining the logs, the mechanism determines the frequency of functional failure on each existing cluster node and to anticipate failure in the future. It then determines anticipatory operation failures by analysing the probability as well as the sensitivity of function failures. CRASSHI then instructs the self-heal algorithm to construct a functional scheduler to run operations on the cluster node in defined intervals, to avoid frequent functional failure.
· CRASSHI continuously monitors the vent log and failover log of each existing cluster node to determine the impact of a node failure on another node. It then examines the factor that will impact other nodes during failover. It keeps a record of the factors impacting cluster node failover as well as the impact on other node during failure in its database. As and when such node failure occurs, the self-heal mechanism should jump in action to take preventive measures and ceases the cascading effect of failure.
· CRASSHI checks the surviving timespan of each existing cluster node under standard workload condition from the logs. Based on hardware log analysis, it determines the most reliable and less reliable node. It then runs less critical applications on less reliable nodes, and mission-critical applications on the most reliable node in order to balance the workload across the cluster.
· CRASSHI maintains a record of component health by monitoring the hardware logs of each component of cluster node. This generates a predictive component failure alarm and instructs the self-heal mechanism to make proactive remedies. The self-heal mechanism switches the running application from a failing node to a reliable cluster node, and automatically informs the IT administrator.
· CRASSHI monitors the performance characteristics and application job handling capability of each existing cluster node available in the cluster environment from their performance and application logs. These parameters are captured and maintained in a database. If a cluster node handling mission-critical applications fails, the self-heal algorithm allocates the job or application to a cluster node which is best capable to handle that job without failure.
· CRASSHI identifies patterns and their occurrence interval. It also monitors the impacts of events in the cluster environment. Analysing logs assists in identifying the most critical damage mechanisms associated with cluster node failure, which results in a reliable classification. The impact of a cluster node failure as a result of unrecognised events is handled by the self-heal algorithm with proper remedies.

CRASSHI supports the scalability of the three dimensions of a cluster node: capacity, performance and operations. A single administrator can manage and scale any amount of cluster nodes with high cluster performance by dynamically redeploying workloads across cluster nodes. CRASSHI achieves full robustness of the cluster environment and maximum predictability of failure. The mechanism and its self-heal algorithm maintain the workload across the cluster node hardware which reduces hardware crashes. The operating time of the cluster is maximised, resulting in negligible downtime. Running suitable jobs on node hardware and intelligent prioritisation of jobs prolongs the operational runtime.

Ultimately, the major benefit to an organisation is seamless data accessibility, self-healing cluster frame work, stable ROI for data storage, a simplified cluster framework and less hardware maintenance. With this, CRASSHI not only optimises performance but also cost.

Intelligent cluster reliability and self-healing: A new approach for Enterprise Storage Environments

By Madhukar Gunjan Chakhaiyar, Product Test Architect, Wipro.

Juniper Networks introduces purpose-built solution for GPUaaS and AIaaS providers

NEXCOM unveils FTA 5190 AI-powered edge server with Intel Xeon 6 SoC

Veeam expands partnership with Microsoft

Europe’s balancing act: businesses are shifting toward hybrid storage

Cisco expands partnership with NVIDIA

StorONE disrupts market

StorMagic and Supermicro expand server offerings for edge computing

Vultr deploys AMD Instinct MI325X GPUs