Spark is a general-purpose framework for cluster computing, it is used for a diverse range of applications.
Let’s take a closer look at each group and how it uses Spark. Unsurprisingly, the typical use cases differ between the two but we can roughly classify them into two categories, data science and data applications.
Data Science Tasks
Data science, a discipline that has been emerging over the past few years, centers on
analyzing data. While there is no standard definition, for our purposes a data scientist
is somebody whose main task is to analyze and model data.
Data scientists may have experience with SQL, statistics, predictive modeling (machine learning), and programming, usually in Python, Matlab, or R.
Data scientists also have experience with techniques necessary to transform data into formats that can be analyzed for insights (sometimes referred to as data wrangling).
Spark supports the different tasks of data science with a number of components. The Spark shell makes it easy to do interactive data analysis using Python or Scala. Spark
SQL also has a separate SQL shell that can be used to do data exploration using SQL,
or Spark SQL can be used as part of a regular Spark program or in the Spark shell.
Machine learning and data analysis is supported through the MLLib libraries. In
addition, there is support for calling out to external programs in Matlab or R.
Spark enables data scientists to tackle problems with larger data sizes than they could before with tools like R or Pandas.
Data Processing Applications
The other main use case of Spark can be described in the context of the engineer persona. For our purposes here, we think of engineers as a large class of software developers who use Spark to build production data processing applications.
These developers usually have an understanding of the principles of software engineering,
such as encapsulation, interface design, and object-oriented programming. They frequently have a degree in computer science.
They use their engineering skills to design and build software systems that implement a business use case.
For engineers, Spark provides a simple way to parallelize these applications across
clusters, and hides the complexity of distributed systems programming, network
communication, and fault tolerance.
The system gives them enough control to monitor, inspect, and tune applications while allowing them to implement common tasks quickly. The modular nature of the API (based on passing distributed collections of objects) makes it easy to factor work into reusable libraries and test it locally.
Spark’s users choose to use it for their data processing applications because it provides a wide variety of functionality, is easy to learn and use, and is mature and reliable.
If you like this post, don’t forget to share 🙂