Data Science Series #4: Apache Hadoop

We have previously shared some posts on big data and its processing. You can check out these articles for more information about data science and methods used to analyze big data. Today we are going to talk about apache hadoop infrastructure in Article 4 of the data science series.

Apache hadoop is an infrastructure that allows us to process, save and manage data in parallel on multiple machines with large data sets created with architectures such as apache kafka. Hdfs (hadoop distibuted file system) consists of mapreduce, hadoop common and yarn modules. It is written in Java and is open source.

What is hdfs (hadoop distributed file system)?

The hdfs file system combines the disks of ordinary servers, creating another large and virtual disk. This created virtual disk allows files of very large size to be stored in the system and their processing.
Hadoop consists of 4 modules, which can also be called the building blocks of hadoop. In systems developed for big data analysis, each module performs an important task. We simply divide Hadoop into 4 phases and there are 4 modules corresponding to these 4 basic tasks. Now let's briefly describe these modules.

When describing this concept, which is translated into Turkish as distributed file system, we will mention “distributed file system”, or even dfs for short, due to the ease of discourse and the software language usually being English. Dfs is the first of the two most important in 4 modules. Dfs serves to store data in an easily accessible way across multiple storage devices.

This data is then looked at with mapreduce, the second most important module.
The “file system” is the data storage system used by the computer, thanks to which data is easily found and used. Normally it is determined by the operating system of the computer, but hadoop has its own file system, known as hdfs (hadoop dfs), which is different from these. Hdfs is above the file system of the server computer, meaning it can be accessed from any computer using the os operating system.

Mapreduce converts the captured data into a format suitable for analysis (map) and allows mathematical operations to be performed. For example, calculating the number of women aged 45 and over, etc.—this process is called “reduce” because you limit the data to a certain rate.
This module provides the necessary java tools for reading data stored under hdfs in different operating systems (windows, unix, etc.).

As a final module, yarn manages the resources of its systems that store and perform analysis.

Although many different procedures, libraries, and features have been added to Hadoop over the years, hdfs, mapreduce, hadoop common, and hadoop yarn are the top 4 modules.

Using Hadoop

Thanks to Hadoop's flexible structure, companies can add or modify their data systems with inexpensive and ready-to-use parts that they can find in every tech store according to their needs. Today, hadoop has become one of the most frequently used systems with more affordable and ready-to-use hardware than others. It is said that more than half of Fortune 500 companies use this system.

Since almost all of the online service providers use this system and everyone can modify the system for their own purposes, the developers of the hadoop system are also making changes in this direction and improving the product. Modifications made by expert developers at large companies such as Amazon and Google have also served as examples for hadoop developers and helped them improve the system.

One of the most important and useful features of an open source software is joint development with the collaboration of volunteer and corporate users.
The reason for creating different versions for commercial use is that hadoop, in its raw form, is very complicated even for IT professionals, with the basic tools that you will see on the website. With cloudera, for example, it has been made easier for companies to install and operate the hadoop system, and services such as training and support have been combined to create a package for commercial use.

With Hadoop's flexible structure, companies can expand their systems as they grow and refine their data analysis operations accordingly. This system, which has advanced in great strides thanks to its open source code, has managed to make big data analysis accessible to everyone.
Contact us for questions about hadoop, one of the systems we love to use. Stay tuned to our blog page to read more of the data science series!
Don't forget to follow us on Facebook, LinkedIn and Instagram.
See you soon...

Sources:

https://www. Bernardmarr. Com/Default. asp? Content=1080
https://medium. Com/ @kadiralan021 /hadoop-and-hadoop-bile %c 5% 9fenleri-47f53f0dbdde
https://www. I'm angry. com/news/hadoop-what/
https://hadoop. It's Apache. Org/

Disclaimer: All rights to any articles and content published belong to Efilli Software. All or part of any content, such as text, audio, video, and even if the source is shown or the active link is provided, cannot be used, published, shared or modified.