Call a Specialist Today! 800-886-5369

Modern Infrastructure for Sparkbased Applications


Apache SparkWe are living in an era of data deluge and as a result, the term ‘‘big data’’ is appearing in many contexts, including meteorology, genomics, complex physics simulations, biological and environmental research, finance, IoT and healthcare.

Apache Spark is an open source cluster computing framework for large-scale data processing. It provides parallel distributed processing, fault tolerance and scalability for big-data workloads.

Key Benefits

  • Improve accuracy of decisions by accessing high volumes of data for analysis
  • Mix multiple Spark-based analytics workloads using a common infrastructure
  • Increase operational flexibility
  • Deliver more application density per rack
  • Reduce costs by taking advantage of shared resources and easier management.
  • Run multiple modeling operations on the same data set at full speed

Pavilion Overview

  • Fastest block storage for flexible private cloud deployments
  • Latency of direct-attached SSDs
  • Up to 920 TB in 4U
  • Frictionless deployment
  • Data resiliency & high availability
  • Space-efficient instant snapshots and clones
  • Thin provisioning
  • Pay-As-You Grow scalability
  • Expand for capacity or performance, independently
  • Increase storage utilization up to 10X or more

Storage Challenges and Apache Spark

Many challenges exist related to data management and data storage in large scale data analytics platforms. Some of these challenges may be:

It turns out that all of these challenges can be overcome by using the appropriate shared storage solution.

NVMe Storage for Spark-based Big Data Workloads

NVMe is a new storage technology and it is inherently parallel. It is 250 times more parallel than SAS and 2000 times more parallel than SATA. In addition, modern web (transactional) and Machine Learning/AI (real-time analytics) applications are also built upon massively parallel and clustered databases and filesystems because of the performance requirements of these applications.

By leveraging NVMe in Spark-based environments using a storage platform that supports data management operations, organizations can now gain much more value from their data. Multiple applications and analytic workloads can easily be applied to the same data set, or multiple data sets, using common rack-scale infrastructure.

Operational Benefits

Infrastructure Benefits

The diagram represents the reference architecture for a modern Apache Spark implementation across complex persistent layers for modern applications, leveraging scalable NVMe-based shared storage resources.

Run Multiple Spark-Based Applications using a common, high-speed storage platform with Pavilion
Run Multiple Spark-Based Applications using a common, high-speed storage platform with Pavilion