Small Files Problem Resolution via Hierarchical Clustering Algorithm.

Big Data

School of Industrial Engineering and Management, Afeka Tel-Aviv Academic College of Engineering, Tel-Aviv, Israel.

Published: June 2024

The Small Files Problem in Hadoop Distributed File System (HDFS) is an ongoing challenge that has not yet been solved. However, various approaches have been developed to tackle the obstacles this problem creates. Properly managing the size of blocks in a file system is essential as it saves memory and computing time and may reduce bottlenecks. In this article, a new approach using a Hierarchical Clustering Algorithm is suggested for dealing with small files. The proposed method identifies the files by their structure and via a special Dendrogram analysis, and then recommends which files can be merged. As a simulation, the proposed algorithm was applied via 100 CSV files with different structures, containing 2-4 columns with different data types (integer, decimal and text). Also, 20 files that were not CSV files were created to demonstrate that the algorithm only works on CSV files. All data were analyzed via a machine learning hierarchical clustering method, and a Dendrogram was created. According to the merge process that was performed, seven files from the Dendrogram analysis were chosen as appropriate files to be merged. This reduced the memory space in the HDFS. Furthermore, the results showed that using the suggested algorithm led to efficient file management.

Download full-text PDF

Source
http://dx.doi.org/10.1089/big.2022.0181DOI Listing

Publication Analysis

Top Keywords

small files
12
hierarchical clustering
12
csv files
12
files
10
files problem
8
clustering algorithm
8
file system
8
dendrogram analysis
8
files merged
8
algorithm
5

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!