Research on Handling Data Skew in MapReduce

WANG Gang; LI Sheng-en

Journal Articles

Laws/Policies/Regulations

Companies/Products

Title, abstract, keywords:

Combined Search Advanced Search

Pay per View through On Demand Search

Research on Handling Data Skew in MapReduce

Author(s): WANG Gang, LI Sheng-en
Pages: 201-204
Year: 2016 Issue: 9
Journal: Computer Technology and Development
Keyword: big data; load balancing; sampling;
Abstract: With the rapid development of mobile Internet and the Internet of Things,the data size explosively grows,and people have been in the era of big data. As a distributed computing framework,MapReduce has the ability of processing massive data and becomes a focus in big data. But the performance of MapReduce depends on the distribution of data. The Hash partition function defaulted by MapReduce can’ t guarantee load balancing when data is skewed. The time of job is affected by the node which has more data to process. In order to solve the problem,sampling is used. It does a MapReduce job to sample before dealing with user’ s job in this paper. After learning the distribution of key,load balance of data partition is achieved using data locality. The example of WordCount is tested in experimental plat-form. Results show that data partition using sample is better than Hash partition,and taking data locality is much better than that using sample but no data locality.

Related Articles