Blog

How to Realize Your File Requirements in Azure - Part 3

High-Performance Computing - Challenges in the Cloud

The cloud is effectively based on an agile pool of infrastructure resources – namely compute resources – that allows working environments to expand and contract as needed regardless of the size of the task at hand. By extension, one would expect high-performance computing (HPC) environments to be the perfect workloads to move to the cloud because compute is endless.

The struggle is that it never really is that simple. While cloud providers continue to increase the capabilities of the compute nodes, every application has a reliance on data. To get workloads, such as Oil & Gas, Genomic sequencing, or even Electronic Design Automation (Semiconductor design) applications to perform in the cloud, the data that these environments reside in must have not only the speed to keep up with compute, but also the resilience and availability to ensure that these critical workloads can get up and running and processed in the timeliest fashion.

The following items highlight the requirements to moving these high-performance computing workloads to the cloud:

High Data Availability. High-performance computing workloads often ask a lot of the data layer in order for the computations to be completed as rapidly as possible, and often these runs can be long – in some cases in the order of days. This means that a compute run cannot afford to have a disruption for any reason, especially due to the dataset not being available. It is imperative that the dataset have the highest availability for the applications to minimize costs.

Simple, intuitive interface. HPC workloads are regularly part of a business operation and the specialists that run these environments are looking for solutions that don’t require storage-centric knowledge. A goal of the cloud is to make resources simple and easy to consume and storage should be no different.

Shared Access to data. HPC applications are often highly parallelized, where each compute node in a cluster needs access to data. Shared access to the dataset amongst a large set of compute clients allows for a broader distribution of the workload and, therefore, a faster completion time for the computational task.

Scalability. Datasets grow and contract depending on the task at hand, and the cloud infrastructure must grow and contract regularly to meet the needs of the computation. This is required for both the total capacity of the dataset and the performance needed for computation.

Reliability. Because the storage environment is an integral part of the database platform, it must be available for sustained access by database servers, without fail. If access to the storage is interrupted, database operations might come to a halt, potentially causing a major disruption for all dependent applications and systems.

High-Performance Computing with Azure NetApp Files

Azure NetApp Files looks like a service that was custom-built for a high-performance computing environment, which is exactly true. The service levels and the base feature capabilities are in place to address the critical needs of many HPC datasets.

HPC environments realize the following enterprise benefits:

Unprecedented Cloud Performance. The Ultra service level provides performance for even the most stringent applications. The Azure NetApp Files performance operates at sub-millisecond latencies – unprecedented across any cloud file service – enabling companies to move and operate their HPC workloads in the cloud as if those applications were on-premises, in a way that would never have been possible before.

Large-scale Shared Access. Azure NetApp Files offers wide-scale access to both Linux and Windows file shares. The highly parallelized HPC architectures can grow their compute node count and effectively reduce the computational time for a given task and save overall costs to the business by having shorter development cycles or quicker data response times.

Configurable Service Levels. Storage is allocated in accordance with the service level defined when a volume is created. Azure NetApp Files uniquely allows the service level to be changed on demand to best suit the performance requirements of the computational task. This allows users/administrators agile control of performance – and effectively costs – by allocating faster storage only when necessary for the task. As a core design principle of the service, when an application needs to reach a given service level, Azure NetApp Files delivers consistent performance, at all times, without variation.

Highly Available Service. Azure NetApp Files is a cloud-native service build on highly available NetApp technology directly with the Azure data centers. It ensures that even the most critical applications always have access to the required data. A service is only as good as the infrastructure from which it operates and, with Azure NetApp Files, you can be confident that the service will be available when your HPC applications require it most.

Scalability. HPC workloads are always running on variable datasets and Azure NetApp Files makes it simple and easy, through the Azure portal or via Azure CLI, to adjust the capacity of a given volume to expand and contract on-the-fly. This substantially reduces the administrative overhead of reacting to a change in dataset requirements

“Azure NetApp Files looks like a service that was custom-built for a high-performance computing environment, which is exactly true.”

Data Analytics – Challenges in the Cloud

With data being the lifeblood of a company, looking for ways to utilize enterprise data for better business outcomes has driven the growth of data analytics, while the flexibility and scale of compute in the cloud has brought the two of these forces together.

Data in large organizations grows organically, normally spreading across various databases and other repositories. One of the first tasks in setting up a data analytics platform is to consolidate all relevant data into a single repository, or data lake. This repository can then be accessed by a cluster of compute nodes that apply different algorithms to the data in an attempt to find patterns and gain insights.

Data lakes are simultaneously accessed by hundreds or thousands of compute nodes, which requires the host storage service to guarantee scalable and predictable I/O performance.

This can be difficult to achieve with a custom file server solution built on cloud compute and storage resources, because managing the capacity and performance characteristics of the underlying disks becomes more and more challenging as the deployment grows.

Another major challenge for creating a centralized data repository is keeping the data synchronized with the source data after the initial baseline copy is created. As the source data continues to change, updates must be applied efficiently to the analytics data repository.

Once the raw source data is synchronized with a data lake, a certain amount of preprocessing may be required to optimize the data for processing by downstream analytics engines. Data engineers require independent copies of the data to work with in order to develop these data transformation routines. Due to the huge volumes of information in a data lake, it’s very difficult to maintain multiple, up-to-date test copies of the data.

With the data lake ready to serve out data, a compute cluster can be used to target the repository and execute analytics workloads. HDInsight is a framework for building distributed data processing solutions that natively supports cluster scheduling, management, and Map Reduce operations. The foundation is a clustered file service for horizontally scaling out data storage with fault tolerance. Each compute can operate on its local data, as well as on the data in the rest of the cluster. In this way, a cluster is able to support a wide range of other data processing platforms.

Creating a data lake in the cloud means that it is no longer necessary to set up a Hadoop cluster manually. Cloud services provide ready-to-use solutions for working with Hadoop and data warehousing, respectively.

Data Analytics with Azure NetApp Files

The successful establishment of data lakes in the cloud opens up a world of possibilities for data analysis. Azure NetApp Files is part of an easy-to-use, robust, high-performance platform with the precise feature set required to create and support data lake environments.

As described in the previous sections, Azure NetApp Files offers many advantages for data operations. Here is a summary of how Azure NetApp Files benefits data analytics solutions:

Robust infrastructure. Azure NetApp files, built on NetApp technology, brings decades of experience to the infrastructure the service is built upon. The service’s highly available infrastructure with dedicated connections to the Azure compute environment means that the infrastructure is fully optimized and readily available.

High I/O performance. Processing large volumes of data, as is typical in analytics environments, requires consistent, high-performance I/O systems to ensure that data is readily available to compute resources. With Azure NetApp Files’ three service levels that can be changed on demand, the data lake performance can be tailored to the analytics engine requirements.

Scalability. Azure NetApp Files scales data access to a level that is not possible with other shared file services. As analytics clusters grow in size, the storage environments they depend on must continue to provide predictable high performance. This can be especially difficult to achieve with custom-built NAS solutions.

Faster results. Analytics environments usually require temporary working copies of the data to perform preprocessing operations. Such an environment is required, for example, when testing data transformations that enrich the source data. By using Snapshot and the cloning technology that come with Azure NetApp Files, writable cloned volumes can be created in a very short time. And multiple clones of the same source volume can be created concurrently.

Data consolidation. Data can be seamlessly synchronized to and from multiple data sources. Data can be consolidated from cloud-based environments, on-premises systems, and even across cloud platforms. Consolidating data from multiple sources into Azure NetApp Files brings consistent performance with enterprise data protection.

Multiple Service Levels. Environments have the flexibility to have their data performance and cost optimized to match the need with multiple service levels. Unlike with other data services, choosing the wrong performance is of little concern – with the ability to change on demand, making a change can be done in just a few clicks.

With Azure NetApp Files, NetApp brings to bear decades of experience in building enterprise NAS solutions. This means that the service easily scales to meet the most demanding conditions, providing concurrent access to thousands of client hosts and applications. Scalability to this degree is a challenging requirement for large-scale environments and is impossible to achieve with custom-built NAS solutions.

“Azure NetApp Files is part of an easy-to-use, robust, high-performance platform...”

Conclusion and Next Steps

Database systems are complex enterprise applications that depend heavily on the I/O systems they use. For the best results, storage services must combine performance, data protection, scalability, security, and flexibility into a single solution.

High-performance, scalable and highly available shared file storage is crucial to delivering a data analytics platform. The ability to effectively manage data from multiple source systems can be another major obstacle. Azure NetApp Files provides cloud-based file service solutions that address the major challenges in creating a repository for data analytics workloads, and can be used with custom-built Apache Hadoop clusters or public cloud analytics services.

Azure NetApp Files has been purpose-built to deliver the highest levels of I/O performance and scalability. End users simply input the size of storage volume they need, choose the appropriate service level for their performance requirements, and NetApp takes care of the rest. This removes the significant burden on organizations to manage in-house NAS solutions.

The synchronization capabilities of Azure NetApp Files allow data from multiple systems to be consolidated into a single storage volume. Data can also be synchronized out of Azure NetApp Files to provide integration with other external systems. Volume cloning adds to the ability to manage and work with large volumes of data.

Deciding whether Azure NetApp Files is right for you is simple because it is a service you can easily run from your Azure portal. For instructions on how to get started, contact us email: [email protected]