Web proceedings papers

Authors

Genti Daci and Frida Gjermeni

Abstract

There are many challenges today for storing, processing and transferring intensive amounts of data in a distributed, large scale environment like cloud computing systems, where Apache Hadoop is a recent, well-known platform to provide such services. Such platforms use HDSF File System organizedon two key components: the Hadoop Distributed File System (HDSF) for file storage and MapReduce, a distributed processing system for intensive cloud data applications. The main features of this structure are scalability, increased fault tolerance, efficiency and high-performance for the whole system. Hadoop supports today scientific applications, like high energy physics, astronomy genetics or even meteorology. Many organizations, including Yahoo! and Facebookhave successfully implemented Hadoop as their internal distributed platform. Because HDSF architecture relies on a master/slave model, where a single name-node server is responsible for managing the namespace and all metadata operations in the file system. As a result this poses a limit on its growth and ability to scale due to the amount of RAM available on the single namespace server. A bottleneck resource is another problem that derives from this internal organization. By analyzing the limitations of HDSF architecture and resource distribution challenges, this paper reviews many solutions for addressing such limitations. This is done by discussing and comparing two file systems: Ceph Distributed File System and the Scalable Distributed File System. We will discuss and evaluate their main features, strengths and weaknesses with regards to increased scalability and performance. Also we will analyse and compare from a qualitative approach the algorithms implemented in these schemes such as CRUSH, RADOS algorithm for Ceph and RA, RM algorithms for SDFS.

Keywords

Algorithm, cloud filesystems, Hadoop distributed filesystem, performance, Ceph File System, Scalable Distributed File System