Apache Spark is a cluster computing platform designed to be fast and general purpose.
On the speed side, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing.

Speed is important in processing large datasets, as it means the difference between exploring data interactively and waiting minutes or hours.

One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk.

On the generality side, Spark is designed to cover a wide range of workloads that previously required separate distributed systems, including batch applications, iterative
algorithms, interactive queries, and streaming.

By supporting these workloads in the same engine, Spark makes it easy and inexpensive to combine different processing types, which is often necessary in production data analysis pipelines. In addition, it reduces the management burden of maintaining separate tools.

Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala,
and SQL, and rich built-in libraries. It also integrates closely with other Big Data

In particular, Spark can run in Hadoop clusters and access any Hadoop data
source, including Cassandra.

Spark is a Unified Stack

The Spark project contains multiple closely integrated components. At its core, Spark
is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster.

Because the core engine of Spark is both fast and general-purpose, it powers multiple higher-level components specialized for various workloads, such as SQL or machine learning.

These components are designed to interoperate closely, letting you combine them like libraries in a software project.

Spark Core

Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems,
and more.

Spark Core is also home to the API that defines resilient distributed data‐ sets (RDDs), which are Spark’s main programming abstraction. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel.

Spark Core provides many APIs for building and manipulating these collections.

Spark SQL

Spark SQL is Spark’s package for working with structured data. It allows querying
data via SQL as well as the Apache Hive variant of SQL—called the Hive Query Language (HQL)—and it supports many sources of data, including Hive tables, Parquet,
and JSON.

Beyond providing a SQL interface to Spark, Spark SQL allows developers to intermix SQL queries with the programmatic data manipulations supported by RDDs in Python, Java, and Scala, all within a single application, thus combining SQL with complex analytics.

This tight integration with the rich computing environment provided by Spark makes Spark SQL unlike any other open source data warehouse tool. Spark SQL was added to Spark in version 1.0.

If you like this post, don’t forget to share 🙂

This article is written by our awesome writer
Comments to: What Is Apache Spark

Your email address will not be published. Required fields are marked *

Attach images - Only PNG, JPG, JPEG and GIF are supported.

New Dark Mode Is Here

Sign In to access the new Dark Mode reading option.

Join our Newsletter

Get our monthly recap with the latest news, articles and resources.

By subscribing you agree to our Privacy Policy.

Latest Articles

Explore Tutorials By Categories


Codeverb is simply an all in one interactive learning portal, we regularly add new topics and keep improving the existing ones, if you have any suggestions, questions, bugs issue or any other queries you can simply reach us via the contact page


Welcome to Codeverb

Ready to learn something new?
Join Codeverb!

Read Smart, Save Time
    Strength indicator
    Log In | Lost Password