Data Science Series #5: What is Apache Spark?

Another of the projects used to handle big data is apache spark. In many sources, the explanation of spark begins by comparing it with hadoop. It is true that there are parts that we can compare with hadoop, but hadoop and spark do not function the same. Let's examine it in more detail.

Spark is a big data processing engine. It serves to perform parallel processing on data, is open source and written in spark language.

What to do with Spark?

Data integration and etl (extract, transform, load):
is the process of cleaning and merging data from many sources so that it can be visualized, processed, and analyzed. It can be done with Spark.

Machine learning and advanced analytics:
used for estimating outputs using mixed algorithms, error detection, extraction of stored data, and decision making based on inputs.

Streaming (real-time data processing):
Continuous data collection and processing so that the error rate is minimal. An example of real-time data processing is the fact that orders placed in e-commerce are instantly dropped into the system or account movements in the bank can be tracked.

‍

‍

Spark Ecosystem

Spark core
It is the most basic engine for large-scale parallel and distributed data processing. It allows various tasks to be carried out using the libraries it contains. It allows Spark to perform its most basic tasks. Works with R, sql, python, scala and java. Spark sql supports streaming, mllib and graphx application programming interfaces (api).

With this interface, querying data through sql or hive query language is supported. Spark sql provides a performance-enhancing solution for rdbms databases.

It enables efficient processing of real-time data streams. It is the library used to run machine learning in Apache spark.
It is the interface used for graphs and interactive chart calculations.

Spark and Hadoop
Apache spark and hadoop are similar but never identical systems. It can be compared in different areas, but one is not a substitute for the other. Considering the project you are going to do, you need to evaluate the features and make a decision.
Kick; It features high performance speed, ease of continuous operations, graphics processing, real-time analysis and machine learning. But if you need a file system that will provide integrity in your project, if you are not developing a real-time application, and your datasets are large in size, hadoop will be the better choice for you.

If you are looking for a combination of real-time data processing and storing processed data, it is useful to use both systems for hdfs. As you can see, although they have comparable characteristics, it is not possible to put spark and hadoop in the same guise. Knowing the requirements of your project, you should choose one of two technologies.

We have provided detailed information about Apache Spark. We hope it will be useful. Contact us for any questions. Check out our blog page for more information on data science. See you in our next post!

Bibliography:

https://medium. com/5bayt/apache-spark-what-not-i%C 5% 9f-yap-5797c28eb95
https://medium. Com/ @amine. Yesilyurt/apache-kick-where-kick-enter%C 5% 9f-582d2e0059af
https: //ozgununlu. com/blog-detail/apache-spark-what-is

‍

Disclaimer: All rights to any articles and content published belong to Efilli Software. All or part of any content, such as text, audio, video, and even if the source is shown or the active link is provided, cannot be used, published, shared or modified.

Gelişmelerden Haberdar Olun

Öne Çıkan Yazılar

Data Science Series #5: What is Apache Spark?

What to do with Spark?

Spark Ecosystem

Bibliography:

What Does It Mean When Google Opts Out of Removing 3rd Party Cookies

Data is the New Oil Today

Data Security in the Insurance World: The Key to Customer Trust

Cookie Management in Tourism: Regulations Are Shaping the Tourism Sector!

About the Personal Data Protection Agency