Ter@tec


		Version française

Home > TERATEC FORUM > Workshop 7

TERATEC Forum 2015
Workshop 7 - Wednesday, June 24 from 14:00 to 17:30
Innovative storage and IO technologies for exascale

Performance Comparison of SQL based Big Data Analytics using Lustre and HDFS file systems
Rekha SINGHAL, TCS INNOVATION LAB - PERFORMANCE ENGINEERING - Gabriele PACIUCCI, INTEL TECHNOLOGY

Download the presentation

The performance benefits of parallel processing technology have led the migration of existing RDBMS applications to big data technologies such as Hadoop and Hive. This migration brings in additional challenges to catch up performance of parallel RDBMS using parallelism for data processing in commodity based nodes’ cluster- this raises the need to replace the traditional file systems such as HDFS with parallel file systems such as Lustre. Moreover, convergence of HPC with Big data motivates further to have unified file system to avoid data transfer across different subsystems.

In this presentation, we share performance comparison of HDFS and Intel Lustre for FSI, Telecom and Insurance SQL workload evaluating the performance of the application on an integrated stack with Hive and Lustre through Hive extensions such as Hadoop Adapter for Lustre (HAL) developed by Intel, while comparing the performance against the Hadoop Distributed File System (HDFS). The environment used for this evaluation shall be hosted in the Intel BigData Lab in Swindon (UK). The cluster consists in 16 Intel Ivy Bridge nodes connected by an Intel TrueScale Infiniband network set up with CDH 5.2. Another similar cluster will be used to compare HDFS performance. We use Intel Enteprise Edition for Lustre 2.2 for the experiment based on Lustre 2.5 and Hadoop Adapter for Lustre 3.1. Both the systems will be evaluated on performance metric ‘query average response time’ for FSI workload. Tests will be run for application data volumes varying from 100 GB to 7 TB.

Dr. Rekha Singhal has 17 years of research and teaching experience. Currently she is working as Senior Scientist with TCS Innovation Lab. She is a CMG member and in TPC of CMG India. She has worked with CDAC and TRDDC research centers. One of CDAC products, Revival 2000, developed under her guidance had received NASSCOM Technology award. She has publications both in international conferences and journals. She has filed patents in India. She has taught in prestigious Institutes such as TISS, NITIE etc. Her research interests are Query Performance Prediction, Big Data System Performance, Database Performance Modelling, Distributed Database systems, Storage Area Networks, TCP/IP networks and Health IT. She is Ph.D and M.tech from IIT Delhi.

Ing. Gabriele Paciucci is a solution architect in the High Performance Data Division at Intel. In this role, Ing. Paciucci provides technical consultations to partners and customers and evangelizes the Lustre technology worldwide. Gabriele joined Intel in 2013. Previously, Gabriele was a senior software engineer specialized in HPC and Cloud solution based on Open Source Software. He has architected a number of high performance computing storage solutions based on Lustre on a variety of hardware platforms since 2006. Gabriele is involved in several Open Source projects and has promoted the adoption of Open Source software and Linux in the Enterprise since 2000 when he worked as software engineer at Red Hat. Ing. Paciucci received his Master Degree in Chemical Engineering from “Università degli Studi di Roma La Sapienza” in 1999.

Download the presentation