Fabrizio Marozzo
Loris BelcastroLoris Belcastro
Department of Informatics, Modeling, Electronics and Systems (DIMES), University of Calabria, 87036 Rende, Italy
Author to whom correspondence should be addressed. Appl. Sci. 2022, 12(20), 10567; https://doi.org/10.3390/app122010567 Submission received: 19 September 2022 / Accepted: 18 October 2022 / Published: 19 October 2022 (This article belongs to the Special Issue Cloud Computing for Big Data Analysis)With the spread of the Internet of Things, large amounts of digital data are generated and collected from different sources, such as sensors, cameras, in-vehicle infotainment, smart meters, mobile devices, applications, and web services. The large volume of data produced daily, coupled with the speed with which such data are generated and its heterogeneity, have led to interesting new technological challenges in the collection, storage, and analysis of this data. Those data volumes, commonly referred to as big data, can be exploited to extract useful information and produce helpful knowledge for science, industry, and public services [1,2]. Novel technologies, architectures, and algorithms have been developing to capture and analyze big data [3]. For example, in scientific and business fields, researchers and data scientists are analyzing big data to extract information and knowledge useful for making new discoveries and supporting decision processes [4].
Many researchers focused their studies on the development of applications for big data analysis in various application fields, including trend discovery, social media analytics, pattern mining, sentiment analysis, and opinion mining. For example, from the analysis of large amounts of user data, we can understand human dynamics and behaviors, including the following: ( i ) the main tourist attractions and also the mobility patterns within a city [5]; ( i i ) the areas of a city where it is necessary to improve the means of transport [6] or where it is more suitable to open new businesses [7]; ( i i i ) the behavior purchase of users while browsing an ecommerce [8]; ( i v ) the behavior of fans following important sporting events [9]; and ( v ) the political orientation of citizens and then estimates the outcome of a political event [10].
In this context, cloud computing is a valid and cost-effective solution for supporting big data storage and executing data analytic applications. Cloud computing can be defined as a distributed computing paradigm in which all resources, dynamically scalable and often virtualized, are provided as services over the Internet. As defined by NIST (National Institute of Standards and Technology) [11], cloud computing can be described as follows: “a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction”. From the NIST definition, we can identify five essential characteristics of cloud computing systems, which are on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service. Due to elastic resource allocation and high computing power, cloud computing represents a compelling solution for big data analytics, allowing faster data analysis and resulting in more timely results and greater data value.
From this perspective, this Special Issue aimed to contribute to the field by presenting the most relevant advances in this research area. Specifically, key scientific fields discussed in the papers that have been selected for this Special Issue include the following:
Programming models and algorithms for distributed computing environments; Systems for data processing on cloud platforms; Data analysis workflows for distributed environments; Scalable data mining algorithms; Programming models and scalable algorithms for big data; Big data analytics and applications; Applications of machine learning in big data; Cloud-based data mining applications; Libraries, algorithms, and applications for big social data analysis.There were five papers accepted for publication to this Special Issue, which focus on different topics. The first paper [12] proposes a machine learning approach to predict energy demands in the household, which exploits the Random Forest algorithm in a horizontal and vertical scaling cloud environment. Specifically, to ensure scalability and availability for large data volumes, the application has been designed to be executed on a Spark cluster, using the machine learning algorithms included in the native MLlib library.
The second and third papers focus on problems related to the efficiency and throughput of virtual machines and containers in the cloud environment, including performance overload and adaptive resource management issues for big data and scientific workflows. In particular, the second paper [13] investigates the benefits of using the vertical scalability of Docker for implementing an adaptive resource management scheme for big data workloads in a container-based cloud environment. During the execution, the adaptive resource manager scheme periodically monitors the resource usage of running containers and dynamically adjusts allocated computing resources, which results in substantial improvements in the overall system throughput. Instead, the third paper [14] focuses on addressing throughput and efficiency problems of virtual machines and containers in the cloud, exploiting different efficient approaches for resource provisioning that combine four CPU technologies and methods: hyperthreading, vCPU cores selection, vCPU affinity, and the isolation of vCPUs.
The fourth paper [15] presents three social big data analysis applications, defined and executed in parallel on a cloud platform by using ParSoDA [16], a programming library written in Java that enables developers to create cloud-based parallel applications for analyzing large volumes of social media data. Such applications focused on analyzing data from three different perspectives: ( i ) discovering the main tourist attractions and also the mobility patterns (i.e., trajectories) from geotagged posts [17]; ( i i ) understanding the political orientation of social media users so as to predict the outcome of political events [18]; ( i i i ) analyzing the hashtags used by social media users to discover the main topics underlying social media conversation and how users refer to them in publishing online content [19].
Finally, the latest paper [20] investigates the use of two supervised classification algorithms (i.e., Random Forest and K-Nearest Neighbor) to predict the behavior of criminal networks and turn it into useful information using natural language processing (NLP). Specifically, the authors extracted an unstructured database containing data on the crimes committed. Then, to estimate the criminals’ next actions, the authors performed a hotspot-based spatial analysis, for which its results are sent to two different classifiers for classification and prediction.
Although the Special Issue has been closed, substantially more research can be conducted in the context of big data and cloud-based analyses in which many issues need to be addressed, particularly regarding the management and mining of large-scale data archives. As an example, an open issue is the design and optimization of data-intensive computing platforms with a very large number of CPU cores, such as the recent exascale systems. Exascale systems refer to highly parallel computing systems that are capable of at least one exaFLOPS. Therefore, their implementation represents a major challenge from a technological and research point of view [21]. The design and development of Exascale systems is currently under investigation. Programming paradigms traditionally used in HPC systems (e.g., MPI, OpenMP, OpenCL, Map-Reduce, and HPF) are not sufficient/appropriate for programming software designed to run on systems composed of a very large set of computing elements. To reach Exascale size, it is required to define new programming models and languages that combine abstraction with both scalability and performance [22]. Hybrid models (shared/distributed memory) and communication mechanisms based on locality and grouping are currently investigated as promising approaches.
Data-intensive applications running on Exascale systems need to control millions of threads running on a very large set of cores. Such applications will need to avoid or limit synchronization, use less communication and remote memory, and handle software and hardware faults that could occur. Currently, no available programming languages provide solutions to these issues, especially when data-intensive applications are targeted. From a software point of view, these new IT platforms open great problems and challenges for software tools and runtime systems, which must be able to handle an extremely high degree of parallelism, communication, and data locality. Porting existing data analysis algorithms (or developing new ones) and designing novel fine-grained runtime models to exploit the exascale hardware will be a focus of research in the coming years.
All the authors contributed equally to the structuring, writing and review of this paper. All authors have read and agreed to the published version of the manuscript.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Marozzo, F.; Belcastro, L. Cloud Computing for Big Data Analysis. Appl. Sci. 2022, 12, 10567. https://doi.org/10.3390/app122010567
AMA StyleMarozzo F, Belcastro L. Cloud Computing for Big Data Analysis. Applied Sciences. 2022; 12(20):10567. https://doi.org/10.3390/app122010567
Chicago/Turabian Style
Marozzo, Fabrizio, and Loris Belcastro. 2022. "Cloud Computing for Big Data Analysis" Applied Sciences 12, no. 20: 10567. https://doi.org/10.3390/app122010567
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.