a simulation as a service cloud middlewaresshekhar/publications/ante15_simaas.pdf · to address...

$: A Simulation as a Service Cloud Middlewaresshekhar/publications/ANTE15_SIMaaS.pdf · To address these challenges, we make the following key contributions in this paper: { We propose$
Annals of Telecommunications manuscript No.(will be inserted by the editor)

A Simulation as a Service Cloud Middleware

Shashank Shekhar, Hamzah Abdel-Aziz,Michael Walker, Faruk Caglar, AniruddhaGokhale, Xenofon Koutsoukos

Received: date / Accepted: date

Abstract Many seemingly simple questions that individual users face in theirdaily lives may actually require substantial number of computing resources toidentify the right answers. For example, a user may want to determine the rightthermostat settings for different rooms of a house based on a tolerance rangesuch that the energy consumption and costs can be maximally reduced whilestill offering comfortable temperatures in the house. Such answers can be deter-mined through simulations. However, some simulation models as in this exam-ple are stochastic, which require the execution of a large number of simulationtasks and aggregation of results to ascertain if the outcomes lie within specifiedconfidence intervals. Some other simulation models, such as the study of trafficconditions using simulations may need multiple instances to be executed for anumber of different parameters. Cloud computing has opened up new avenuesfor individuals and organizations with limited resources to obtain answers toproblems that hitherto required expensive and computationally-intensive re-sources. This paper presents SIMaaS, which is a cloud-based Simulation-as-a-Service to address these challenges. We demonstrate how lightweight solutionsusing Linux containers (e.g., Docker) are better suited to support such servicesinstead of heavyweight hypervisor-based solutions, which are shown to incursubstantial overhead in provisioning virtual machines on-demand. Empiricalresults validating our claims are presented in the context of two case studies.

The final publication is available at Springer via http://dx.doi.org/10.1007/s12243-015-0475-6

Shashank Shekhar, Hamzah Abdel-Aziz, Michael Walker, Faruk Caglar, Aniruddha Gokhaleand Xenofon KoutsoukosDepartment of Electrical Engineering and Computer Science,Vanderbilt University, Nashville, TN 37235, USA

E-mail: {shashank.shekhar,hamzah.abdelaziz,michael.a.walker.1,faruk.caglar,a.gokhale,xenonfon.koutsoukos}@vanderbilt.edu

2 Shekhar et al

Keywords Cloud Computing · Middleware · Linux Container · Simulation-as-a-Service

1 Introduction

With the advent of the Internet of Things (IoT) paradigm [7], which involvesthe ubiquitous presence of sensors, there is no dearth of collected data. Whencoupled with technology advances in mobile computing and edge devices, usersare expecting newer and different kinds of services that will help them in theirdaily lives. For example, users may want to determine appropriate tempera-ture settings for their homes such that their energy consumption and energybills are kept low yet they have comfortable conditions in their homes. Otherexamples include estimating traffic congestion in a specific part of a city on aspecial events day. Any service meant to find answers to these questions willvery likely require substantial number of computing resources. Moreover, userswill expect a sufficiently low response time from the services.

Deploying these services in-house is unrealistic for the users since the mod-els of these systems are quite complex to develop. Some models may be stochas-tic in nature, which require a large number of compute-intensive executions ofthe models to obtain outcomes that are within a desired statistical confidenceinterval. Other kinds of simulation models require running a large numberof simulation instances with different parameters. Irrespective of the simula-tion model, individual users and even small businesses cannot be expectedto acquire the needed resources in-house. Cloud computing then becomes anattractive option to host such services particularly because hosting high per-formance and real-time applications in the cloud is gaining traction [4, 32].Examples include soft real-time applications such as online video streaming(e.g., Netflix hosted in Amazon EC2), gaming (Microsoft’s Xbox One andSony’s Playstation Now) and telecommunication management [19].

Given these trends, it is important to understand the challenges in hostingsuch simulations in the cloud. To that end we surveyed prior efforts [17, 27, 28,29] that focused on deploying parallel discrete event simulations (PDES) [16]in the cloud, which reveal that the performance of the simulation deterioratesas the size of the cluster distributed across the cloud increases. This occurs dueprimarily to the limited bandwidth and overhead of the time synchronizationprotocols needed in the cloud [42]. Thus, cloud deployment for this categoryof simulations is still limited.

Despite these insights, we surmise that there is another category of simula-tions that can still benefit from cloud computing. For example, complex systemsimulations that require statistical validation or those that compare simula-tion results under different constraints and parameter values often need to runrepeatedly are suited to cloud hosting. Running these simulations sequentiallyis not a viable option as user expectations in terms of response times have tobe met. Hence there is a need for a simulation platform where a large num-ber of independent simulation instances can be executed in parallel and the

A SIMaaS Cloud Middleware 3

number of such simulations can vary elastically to satisfy specified confidenceintervals for the results. Cloud computing becomes an attractive platform tohost such capabilities [41]. To that end we have architected a cloud-basedsolution comprising resource management algorithms and middleware calledSimulation-as-a-Service (SIMaaS).

It is possible to realize SIMaaS on top of traditional cloud infrastructure,which utilize a virtual machine (VM)-based data center to provide resourcesharing. However, in a scenario where real-time decisions have to be madebased on running a large number of multiple, short-duration simulations inparallel, the considerable setup and tear down overhead imposed by VMs, asdemonstrated in Section 5.2, is unacceptable. Likewise, a solution based onmaintaining a VM pool that is used by many cloud resource managementframeworks such as [24, 12, 44, 18] is not suitable either since it can lead toresource wastage and may not be able to cater to sudden increases in servicedemand. Thus, a lightweight solution is desired.

To address these challenges, we make the following key contributions inthis paper:

– We propose a cloud middleware for SIMaaS that leverages Linux con-tainer [30]-based infrastructure, which has low runtime overhead, higherlevel of resource sharing, and very low setup and tear down costs.

– We present a resource management algorithm, that reduces the cost to theservice provider and enhances the parallelization of the simulation jobs byfanning out more instances until the deadline is met while simultaneouslyauto-tuning itself based on the feedback.

– We show how the middleware intelligently generates different configura-tions for experimentation, and intelligently schedules the simulations onthe Linux container-based cloud to minimize cost while enforcing the dead-lines.

– Using two case studies, we show the viability of a Linux container-basedSIMaaS solution, and illustrate the performance gains of a Linux container-based approach over hypervisor-based traditional virtualization techniquesused in the cloud.

The rest of this paper is organized as follows: Section 2 deals with rele-vant related work comparing them with our contributions; Section 3 providestwo use cases that drive the key requirements that are met by our solution;Section 4 presents the system architecture in detail; Section 5 validates theeffectiveness of our middleware; and finally Section 6 presents concluding re-marks alluding to lessons learned and opportunities for future work.

2 Related Work

This section presents relevant related work and compares them with our con-tributions. We provide related work along three dimensions: simulations hosted

4 Shekhar et al

in the cloud, cloud frameworks that provide resource management with dead-lines, and container-based approaches. These dimensions of related work areimportant because realizing SIMaaS requires effective resource management atthe cloud infrastructure-level to manage the lifecycle of containers that hostand execute the simulation logic such that user-specified deadlines are met.

2.1 Related Work on Cloud-based Simulations

The mJADES [35] effort is closest to our approach in terms of its objective ofsupporting simulations in the cloud. It is founded on a Java-based architectureand is designed to run multiple concurrent simulations while automatically ac-quiring resources from an ad hoc federation of cloud providers. DEXSim [15]is a distributed execution framework for replicated simulations that providestwo-level parallelism, i.e., at CPU core-level and at system-level. This organi-zation delivers better performance to their system. In contrast, SIMaaS doesnot provide any such scheme; rather it relies on the OS to make effective useof the multiple cores on the physical server by pinning container processes tocores. The RESTful interoperability simulation environment (RISE) [3] is acloud middleware that applies RESTful APIs to interface with the simulatorsand allows remote management through Android-based handheld devices. LikeRISE, SIMaaS also uses RESTful APIs for clients to interact with our serviceand for the internal interaction between the containers and the managementsolution.

In contrast to these works, SIMaaS applies an adaptive resource schedulingpolicy to meet the deadlines based on the current system performance. Also,our solution uses Linux containers that are more efficient and more suitableto the kinds of simulations hosted by SIMaaS than the VM-based approachesused by these solutions.

CloudSim [11] is a toolkit for modeling and simulating VMs, data centersand resource allocation policies without incurring any cost, which in turn helpsto measure the feasibility and tune the performance bottlenecks. EMUSIM [13]enhances CloudSim by integrating an emulator to achieve the same purpose.SimGrid [14] is another distributed systems simulator used to improve the algo-rithms for data management infrastructure. We believe that the contributionsof SIMaaS are orthogonal to these work. These related projects provide theplatforms to evaluate resource allocation algorithms in the cloud while SIMaaSis a concrete realization of infrastructure middleware that supports differentresource allocations. SIMaaS can benefit from these related work where re-source management algorithms can first be evaluated in these platforms, andthen deployed in the SIMaaS middleware. Additionally, we believe these re-lated work do not yet support support Linux container based simulation ofthe cloud.


2.2 Related Work on Cloud Resource Management

There has been some work in cloud resource management to meet deadlines.Aneka [12] is a cloud platform that supports quality of service (QoS)-awareprovisioning and execution of applications in the cloud. It supports differentprogramming models, such as bag of tasks, distributed threads, MapReduce,actors and workflows. Our work on SIMaaS applies an advanced version of aresource management algorithm that is used by Aneka in the context of ourLinux container-based lightweight virtualization solution. Aneka also providesalgorithms to provision hybrid clouds to minimize the cost and meet deadlines.Although SIMaaS does not use hybrid clouds, our future work will considersome of the functionalities from Aneka.

Another work close to our resource allocation policy is [10] that employs acost-efficient scheduling heuristics to meet the deadline. However, this work issensitive to execution time estimation error, whereas our work self-tunes basedon feedback.

CometCloud [24] is a cloud framework that provides autonomic work-flow management by addressing changing computational and QoS require-ments. It adapts both application and infrastructure to fulfill its purpose.CLOUDRB [40] is a cloud resource broker that integrates deadline-based jobscheduling policy with particle swarm optimization-based resource schedulingmechanism to minimize both cost and execution time to meet a user-specifieddeadline. Zhu et al. [44] employed a rolling-horizon optimization policy to de-velop an energy-aware cloud data center for real-time task scheduling. All theseefforts provide scheduling algorithms to meet deadlines on virtual machine-based cloud platforms where they maintain a VM pool and scale up or downbased on constraints. In contrast to these efforts, our work uses a lightweightvirtualization technology based on Linux containers which provides significantperformance improvement and mitigates the need to keep a pool of VMs orcontainers. We also apply a heuristic based feedback mechanism to ensuredeadlines are met with minimum resources.

In prior work [37, 26], we have designed and deployed multi-layered re-source management algorithms integrated with higher-level task (re-)planningmechanisms to provide performance assurances to distributed real-time andembedded applications. These algorithms were integrated within middlewaresolutions that were deployed on a distributed cluster of machines, which canbe viewed as small-scale data centers. These prior works focused primarily onaffecting the application, such as migrating application components, load bal-ancing, fault tolerance, deployment planning and to some extent scheduling.We view these prior works of ours as complementary to the current work. Inthe present work, we are more concerned with allocating resources on-demand.A more significant point of distinction is that the prior works focused on dis-tributed applications that are long running while in current work we are fo-cusing on applications that have a short running time but where we need toexecute a large number of copies of the same application.

6 Shekhar et al

2.3 Related Work using Linux Containers

The Docker [34] open source project that we utilize in our framework auto-mates the deployment of applications via software containers utilizing operat-ing system (OS)-level virtualization. Docker is not an OS-level virtualizationsolution; rather it uses interchangable execution enviroments such as LinuxContainers (LXC) and its own libcontainer library to provide Container ac-cess and control.

Previous work exists on the creation [33] and benchmark testing [39] ofgeneric Linux-based containers. Similarly, there exists work that use containersas a means to provide isolation and a lightweight replacement to hypervisorsin use cases such as high performance computing (HPC) [43], reproduciblenetwork experiments [21], and peer-to-peer testing environments [8]. The de-mands and goals of each of these three efforts focus on a different aspect of thebenefit stemming from the use of containers. For HPC, the effort focused moreon the lightweight nature of containers versus hypervisors. The peer-to-peertesting work focused on the isolation capabilities of containers whereas the re-producible network experiments paper focused more on the isolation featuresand the ability to distribute containers as deliverables for others to use in theirown testing. Our work leverages or can leverage all these benefits.

3 Motivating Use Cases and Key Requirements for SIMaaS

We now present two use cases belonging to systems modeling that we haveused in this paper to bring out the challenges that SIMaaS should address,and to evaluate its capabilities.

3.1 System Modeling Use Cases

System modeling for simulations is a rich area that has been used in a widerange of different engineering disciplines. The type of system modeling dependson the nature of the system to be modeled and the level of abstraction neededto be achieved through the simulation. We use two use cases to highlight thedifferent types of simulations that SIMaaS is geared to support.

3.1.1 Use Case 1: The Multi-room Heating System

In use case 1, we target complex engineering systems which exhibit continu-ous, discrete, and probabilistic behaviors, known as stochastic hybrid systems(SHS). The computer model we use to construct a formal representation of aSHS system and to mathematically analyze and verify it in a computer systemis the discrete time stochastic hybrid system (DTSHS) model [1].

We discuss here a DTSHS model of a multi-room heating system [5] withits discretized model developed by [2]. The multi-room heating system consists


of h rooms and a limited number of heaters n where n < h. Each room has atmost one heater at a time. Moreover, each room has its own user setting (i.e.,constraints) for temperature. However, the rooms have an exchangeable effectwith their adjacent rooms and with the ambient temperature.

Each room heater switches independently of the heater status of otherrooms and their temperatures. The system has a hybrid state where the dis-crete component is the state of the individual heater, which can be in ONor OFF state, and the continuous state is the room temperature. A discretetransition function switches the heater’s status in each room based on using atypical controller which switches the heater on if the room temperature getsbelow a certain threshold xl and switches the heater off if the room tempera-ture exceeds a certain threshold xu.

The main challenge for our use case is the limited number of heaters andthe need for a control strategy to move a heater between the rooms. Typicalsystem requirements that can be evaluated using simulations are:

– The temperature in each room must always remain above a certain thresh-old (i.e., user comfort level).

– All rooms share heaters with other rooms (i.e., acquire and relinquish aheater).

In our model of the system, we have used one of many possible strategieswhere room i can acquire a heater with a probability pi if:

– pi ∝ geti − xi when xi < geti.– pi = 0 when xi ≥ geti.

where geti is control threshold used to determine when room i needs toacquire a heater. The simulation model for this use case uses statistical modelchecking by Bayesian Interval Estimates [45].

3.1.2 Use Case 2: Traffic Simulation for Varying Traffic Density

Use case 2 targets transportation researchers and traffic application providers,such as Transit Now ( http://transitnownashville.org/), who want tomodel and simulate different traffic scenarios within a relevant time windowbut do not have sufficient resources to do it in-house. We motivate this usecase with a microscopic traffic simulator called SUMO [9] that can simulatecity level traffic. The simulator can import a city map in OpenStreetMap [20]format to its own custom format. The user can supply various input parameterssuch as number of vehicles, traffic signal logic, turning probability, maximumlane speed and study their impact on traffic congestion.

One such “what if” scenario involves the user changing the number ofvehicles moving in a particular area of the city and studying its impact. Incontrast to use case 1 where all the stochastic simulation instances had nearlythe same execution time in ideal conditions, in this use case, the simulationexecution time varies with the input number of vehicles. Figure 1 illustrateshow the execution time varies with the number of vehicles for a duration of1000 seconds.

8 Shekhar et al

�

��

��

��

��

��

��

� ��

��

� ��

��

Fig. 1: Simulation Execution Time

3.2 Problem Statement and Key Requirements for SIMaaS

Based on the two use cases described above, we now bring out the key require-ments that must be satisfied by SIMaaS. Addressing these requirements formsthe problem statement for our research presented in this paper.

• Requirement 1: Ability to Elastically Execute Multiple Sim-ulations – Recall that the simulation model for use case 1 is stochastic,which means that every simulation execution instance may yield a differ-ent simulation trajectory and results. To overcome this problem, we have touse the statistical model checking (SMC) approach based on Bayesian statis-tics [44, 45]. SMC is a verification method that provides statistical evidenceto check whether a stochastic system satisfies a wide range of temporal prop-erties with a certain probability and confidence level or not. The probabilitythat the model satisfies a property can be estimated by running several differ-ent simulation trajectories of the model and dividing the number of satisfiedtrajectories (i.e., true properties) over the total number of simulations. Thus,SMC requires execution of a large number of simulation tasks.

On the other hand, although the simulation models for use case 2 are notstochastic, the result of the simulation will often be quite different dependingon the parameters supplied to the model. For example, varying the number ofvehicles on the road, number of traffic lights, number of lanes, and speed limitswill all generate different results. A user may be interested in knowing theresults for various scenarios, which in turn requires a number of simulations tobe executed seeded with different parameter values. In addition, the service will


be used by multiple users who need to execute different number of simulations,which is not known to the system a priori. This requirement suggests the needto elastically scale the number of simulation instances to be executed.

In summary, the two use cases require that SIMaaS be able to elasticallyscale the number of simulations that must be executed.

• Requirement 2: Bounded Response Time – In both our use cases,the user expects that the system respond to their requests within a reasonableamount of time. Thus, the execution of a large number of simulations that areelastically scheduled on the cloud platform, and result aggregation must beaccomplished within a bounded amount of time so that it is of any utility tothe user. Moreover, use case 2 illustrates an additional challenge that requiresestimating the expected execution time for previously unknown parametersand ensuring that the system can still respond to user request in a timelymanner.

In summary, SIMaaS must ensure bounded response times to user requests.

• Requirement 3: Result Aggregation – Both our use cases highlightthe need for result aggregation. In use case 1, there is a need to aggregate theresults from the large number of model executions to illustrate the confidenceintervals for the results. In use case 2, the user will need a way to aggregateresults of each run corresponding to the parameter values. Since SIMaaS ismeant to be a broadly applicable service, it will require the user to supply theappropriate aggregation logic corresponding to their needs.

In summary, SIMaaS needs an ability to accept user-supplied result aggre-gation logic, apply it to the results of the simulations, and present the resultsto the user.

• Requirement 4: Web-based Interface – Since SIMaaS is envisionedas a broadly applicable cloud-based, simulation-as-a-service, it will not knowthe details of the user’s simulation model. Instead, it will require the user tosupply a simulation model of their system and various parameters to indicatehow SIMaaS should run their models. For example, since use case 1 requiresstochastic model checking, it will require a large number of simulation tra-jectories to be executed. Thus, SIMaaS will require the user to supply thesimulation image and specify how many such simulations should be executed,the building layout, the number of heaters, the strategy used and so on. Simi-larly, for use case 2, SIMaaS will need to know how the model should be seededwith different parameter values and how they should be varied, which in turnwill dictate the number of simulations to execute and their execution time.Finally, the aggregated results must somehow be displayed to the user.

In summary, SIMaaS should provide a web-based user interface to the usersso they can supply both the simulation model and the parameters as well asreceive the results using the interface.

10 Shekhar et al

4 SIMaaS Cloud Middleware Architecture

A cloud platform is an attractive choice to address the requirements high-lighted in Section 3.2 because it can elastically and on-demand execute themultiple different simulation trajectories of the simulation models in parallel,and perform aggregation such as SMC to obtain results within a desired con-fidence interval. The challenge stems from provisioning these simulation tra-jectories in the cloud in real-time so that the response times perceived by theuser are acceptable. To that end we have architected the SIMaaS cloud-basedsimulation-as-a-service and its associated middleware as shown in Figure 2.The remainder of this section describes the architecture and shows how itaddresses all the requirements outlined earlier.

Fig. 2: System Architecture

4.1 Dynamic Resource Provisioning Algorithm: Addressing Requirements 1and 2

Requirement 1 calls for elastic deployment of a large number of simulationexecutions depending on the use case category, which needs dynamic resourcemanagement. Requirement 2 calls for timely response to user requests. Thus,the dynamic resource management algorithm should be geared towards meet-ing the user needs.

For this paper, we define a QoS-based resource allocation policy that al-locates containers for each requested simulation model such that its deadlineis met and its cost, i.e. the number of assigned containers is minimized. We


assume that the user provides the following inputs to our allocation algorithm:simulation model, number of simulations and their corresponding simulationparameters, and the estimated execution time using some of the simulationparameters.

Formally, we define the requested execution of the simulation model as ajob. Each job is made up of several different tasks representing an executioninstance of that simulation job. At any instant in time k, J(k) is the set ofjobs which our allocation algorithm handles. Furthermore, for the jth job,Jj(k) ∈ J(k), we define its deadline as DLj , the number of containers it usesas Bj(k), its i

th simulation task as Tij(k), the simulation parameter of its ith

task as θij and finally, the expected execution time of its ith task with itscorresponding parameter θij as Eθij (Tij(k)).

The primary objective of our allocation algorithm is to minimize the re-source usage cost considered in terms of the number of containers used to servethe user simulation request, while maintaining the user constraint stated asmeeting the deadline. To formalize this objective, we define it as the followingoptimization problem:

minBj

c(K) =∑j

∑k

Bj(k) = c(K − 1) +∑j

Bj(K)

subject to ∀j ∈ J(k),Rj(k)

Bj(k)=

∑i Eθij (Tij(k))

Bj(k)≤ DLj

where, c(k) is the cost function at time instant k, and Rj(k) is the jth

job’s total execution time which is equal to the summation of the executiontime Eθij (Tij(k)) for all the unserved tasks Tij(k). This constraint equationcalculates the total time a job would take to finish if its simulations’ execu-tions have been parallelized using Bj(k) containers. Therefore, it bounds theselection of Bj(k) such that each job finishes before or by the deadline. Totackle this problem, we developed a simple heuristic shown in Algorithm 1 forefficiently selecting the minimum Bj(k) such that each job finishes its requiredsimulation tasks by their deadline.

Formally, we calculate Bj using the following formula. To simplify thenotation, we will omit the time index k throughout the remainder of thissection:

Bj =Rj

DLj

Two major challenges arise when calculating Bj based on the above for-mula. First, it is difficult to calculate analytically Rj in a mathematicallyclosed form because a task’s execution time varies based on many dynamicfactors such as performance interference, overbooking ratio, etc. Second, theexecution times of a job’s tasks are not necessarily identical when they havedifferent simulation parameter θij , as demonstrated in Figure 1. Therefore,the above formula is not accurate and we may need to increase the value ofBj calculated above in order to meet the deadline. Moreover, scheduling thejob’s tasks Tij in the reserved containers Bj is a non-trivial task.

12 Shekhar et al

Input: J , αwhile TRUE do

Wait for(max(feedback event, minimum period));foreach Jj ∈ J do

// Update only jobs with new feedback data

if (HasFeedback(Jj)) then// Update the estimated error factor

Fj ←− UpdateErrorFactor(E∗θij

(Tij)) ;

// Update the estimated execution time function

UpdateExecutionTimeFunction(Fj);// Update the number of containers with their scheduling

extraContainersNeeded ←−BestFitDecreasing(DLj) - Bj ;if extraContainersNeeded > 0 then

Reserve(extraContainerNeeded, Jj);// Avoid frequent resource allocation and de-allocation

else if extraContainersNeeded < threshold thenRelease(extraContainerNeeded, Jj);

end

end

end

end

Algorithm 1: QoS-based Resource Allocation Policy

To overcome the first challenge, we make our heuristic calculation of Bj

based on an estimated execution time E′θij

(Tij) of each task Tij and periodi-cally update this estimation and consequently the Bj calculation based on thefeedback of the actual executed time E∗

θij(Tij). This simple feedback mech-

anism allows us to maintain our algorithm objective mentioned above whiletolerating the effect of estimation errors, and handling the dynamic changeof our system environment (e.g. performance interference). Furthermore, ouralgorithm has to wait for at least a minimum period of time even if there isa new feedback, in order to enhance the algorithm’s performance by avoidinghigh frequent recalculation. In addition to this, we recalculate Fj and E′

θij(Tij)

in every iteration and the algorithm is executed only for jobs that have newfeedback data.

In order to estimate the execution time Rj of the jth job, we use an initialexecution time function Eθij (Tij) as an initial function to our estimator. Sincethe user provides the estimated execution time for only a few parameters,we use a regression algorithm based on sequential exponential kernel [36] tobuild the initial function of Eθij (Tij) using the data point provided by theuser. Then, we update this function by an error factor Fj calculated usingthe feedback data. The calculation of the error factor Fj and the estimatedexecution time function E′

θij(Tij) are shown in the following equation:

E′θij (Tij) = Eθij (Tij)× (1 + Fj)

such that:

Fj = E[∆Eθij (Tij)/Eθij (Tij)] + α×√V ar

[∆Eθij (Tij)/Eθij (Tij)

]


∆Eθij (Tij) = Eθij (Tij)− E∗θij (Tij)

where α ≥ 0 is an estimator parameter that determines how pessimisticis our estimator because Fj covers more errors as α increases. For example,when α = 1, 2, 3, Fj will approximately estimate the worst-case scenario of68%, 95%, and 99.7% of the feedback error values, respectively. For a real-time implementation of error factor calculation, we use an online algorithmdeveloped by Knuth [25] to calculate E[.] and V ar[.] incrementally, in order toavoid saving and inspecting the entire feedback data every time a new feedbackentry has arrived.

To overcome the second challenge, we used best fit decreasing bin packingalgorithm [23], where, we pass DLj to this bin packing algorithm as its bin’ssize input, and the estimated task execution times E′

θij(Tij) as its items’ size

input. Therefore, the number of slots produced by the bin packing algorithmrepresents the required containers Bj and the distribution of the tasks Tij ineach slot represents the tasks’ schedule over the containers.

Note that the system resources constraints limit the number of jobs in Jthat can be serviced at the same time. Therefore, we use an admission controlalgorithm shown in Algorithm 2 to maintain a reliable service. The admissioncontrol algorithm is used to accept any new incoming user request which can behandled using the remaining available resources without missing its and otherrunning jobs deadline. It basically estimates the number of containers neededto serve the new request such that it finishes by its deadline. Then, it checkswhether there are idle containers available to serve it or not. The algorithmhas two administrator configurable parameters (β ≥ 1) and (γ ≥ 0) whichadd a margin of resources to overcome the estimation error and to maintainanother margin of resources for other running jobs to be used by the aboveallocation algorithm.

Input: availableCapacity, newJobOutput: Accepted/RejectedcontainersNeeded ←−BestFitDecreasing(DLnewJob);if containersNeeded ×β < availableCapacity −γ then

Accept newJob;else

Reject newJob;end

Algorithm 2: Admission Control Algorithm

4.2 Dynamic Resource Provisioning Middleware: Addressing Requirements 1and 2

The second aspect of dynamic resource management is the middleware infras-tructure that encodes the algorithm and provides the service capabilities. Themiddleware aspect is described here.

14 Shekhar et al

4.2.1 Architectural Elements of the SIMaaS Middleware

The central component of the SIMaaS middleware shown in Figure 2 that isresponsible for resource provisioning and handling user requests is the SIMaaSManager (SM). All the coordination and decision making responsibilities arecontrolled by this component. It employs the strategy design pattern; thus ithas a pluggable design that is used to strategize the virtualization approachto be used by the hosted system. The strategy pattern also allows the SMto swap the scheduling policy if needed, however, we use a single schedulingpolicy during the life-cycle of SIMaaS to avoid conflicts.

A cloud platform typically uses virtualized resources to host user applica-tions. Different types of virtualization include full virtualization (e.g., KVM),paravirtualization (e.g., Xen) and lightweight containers (e.g., LXC Linux con-tainers). Since full and para virtualization require the entire OS to be bootedfrom scratch whenever a new virtual machine (VM) is scheduled, this bootup time incurs a delay in availability of new VMs, not to mention the costof the application’s initialization time. All of these impact the user responsetime. Since Requirement 2 calls for bounded response time, SIMaaS uses thelightweight containers, which suffice for our purpose.

The life cycle of these containers is managed by the Container Manager(CM) shown in the Figure 2. The pluggable architecture of SM allows CMto switch between various container providers, which can be Linux containeror hypervisor-based VM cloud. The Linux container is the default containerprovider of CM. Specifically, we use the Docker [34] container virtualizationtechnology since it provides portable deployment of Linux containers and pro-vides a registry for images to be shared across the hosts with significant per-formance gains over hypervisor-based approaches. Thus, the CM is responsiblefor keeping track of the hosts in the cluster and provision the running and tear-ing down of the Docker containers. It downloads and deploys different imagesfrom the Docker registry for instantiating different simulations on the clusterhosts.

Our earlier design of the CM leveraged Shipyard [38] for communicatingwith the Docker hosts, however, due to sluggish performance we observed, wehad to implement a custom solution with a reduced role for Shipyard. Over-coming the reasons for the sluggish performance and reusing existing artifactsmaximally is part of our future investigations when we also evaluate othercontainer managers such as Apache Mesos, Google Kubernetes and DockerSwarm.

4.2.2 Resource Instrumentation and Monitoring

Recall that meeting user-specified deadlines is an important goal for SIMaaS(Requirement 2). These deadlines must be met in the context of either thestochastic model checking that requires multiple simultaneous runs of thestochastic simulation models or simulations executed under a range of param-eter values. Thus, SIMaaS must be cognizant of overall system performance


so that our resource allocation algorithm can make effective dynamic resourcemanagement decisions. To support these system requirements, effective systeminstrumentation is necessary.

Since SIMaaS uses Linux containers, we leveraged the Performance Monitor(PerfMon) package from the JMeter Plugins group of packages on Linux. Perf-Mon is an open-source Java application which runs as a service on the hosts tobe monitored. Since the monitored statistics are required by the PerformanceManager (PM) component instead of a visual rendition, we implemented acustom software to tap into PerfMon via its TCP/UDP connection capabil-ities. PerfMon is by no means the only option available but it sufficed ourneeds.

PerfMon depends on the SIGAR API and uses it for gathering systemmetrics. The metrics available are classified into eight broad catagories. Thesecatagories include: CPU, Memory, Disk I/O, Network I/O, JMX (Java Man-agement Extensions), TCP, Swap, and Custom executions. We are currentlynot using the JMX, TCP, or Swap metrics, but they are available for use ifneeded. Each of these catagories have parameters to allow customization of thedesired returned metrics, e.g., Custom allows for the returning of any customcommand line execution. We use this to execute a custom script that returnsthe process id and container id pairs of each running Docker container. Thisallows us to monitor each individual container’s performance precisely.

4.3 Result Aggregation: Addressing Requirement 3

Stochastic model checking as in use case 1 requires that results of the mul-tiple simulation runs be aggregated to ascertain if the specified probabilisticproperty is met or not. Similarly, as in use case 2, multiple simulation runs fordifferent simulation parameters result in different outcomes, which must beaggregated and presented to the user. To accomplish this and thereby satisfyRequirement 3, a key component of our middleware is the Result Aggregator(RA). RA receives the simulation results from the Docker containers. It usesZeroMQ messaging queue service for reliable result delivery. It has two roles:first, it sends feedback to the SM about the completion of task for decisionmaking. Second, it performs the actual result aggregation.

Since the aggregation logic is application-dependent, it is supplied by theuser when the service is hosted, and is activated when the simulation jobcompletes. For use case 1, the aggregation logic is a Bayesian statistical modelchecking which produces a single string result. On the other hand, use case 2aggregation logic parses and collates the XML files produced as the result ofsimulation runs.

16 Shekhar et al

4.4 Web Interface to SIMaaS - SIMaaS Workflow and User Interaction:Addressing Requirement 4

Finally, we discuss how the user interacts with SIMaaS, which is a web-basedinterface, and the workflow triggered by a typical user action. The interfaceto the Simulation Manager of SIMaaS is hosted on a lightweight web server,CherryPy [22] to interact with the user and also to receive feedback from otherSIMaaS components. The interaction involves two phases. In the design phasea user interacts with the SIMaaS interface and provides the initial configura-tion which includes the simulation executables and the aggregation logic. Acontainer image is generated after including hooks to send the temporary re-sults. This image is then uploaded to a private cloud registry accessible to thecontainer hosts. The aggregation logic is deployed in the Result Aggregatorcomponent that can collect the temporary simulation results and generate thefinal response.

Fig. 3: SIMaaS Interaction Diagram

The execution phase is depicted in Figure 3, wherein, time bounded, on-demand simulation jobs are performed. The user can use a RESTful API(or a web-form if deadline is not immediate) to supply name-value pairs ofparameters. The following parameters are supplied by all the users:


– Simulation Model Name: A simulation model name is required to identifythe container image and aggregation logic.

– Number of Simulations: The number of simulation instances to run.– Deadline: The deadline for the job– Required Resources: Number of CPUs to be allocated for each container.

Other types of resources will be added in future.– Simulation Command: The command to initiate the simulation.– List of (Execution Time Parameter, Estimated Execution Time): This is

a small set of the execution time parameter values and the correspondingexecution times that is used to generate the regression curve for estimatingthe unknown execution time for remaining parameters.

The following parameters are specific to the use case:

– Use Case 1 - Number of Heaters: The number of heaters active in thebuilding (explained in section 3.1.1)

– Use Case 1 - Sampling Rate: The rate at which data is sampled for tem-perature simulation.

– Use Case 1 - Strategy: The model strategy to be used.– Use Case 1 - Confidence Level: A confidence value for the Bayesian Aggre-

gator (explained in section 3.1.1).– Use Case 2 - SUMO Configuration File: A configuration file used by SUMO

simulation for selecting the inputs and deciding the output details.– Use Case 2 - List of Vehicle Counts: Different vehicle counts to be simu-

lated.

We note that the simulation execution time is also a user input, however,this value can be determined in a sandbox environment, wherein executinga single simulation instance gives the value for a constant time simulation(such as use case 1) or by executing a subset of simulation instances from adifferent range of parameters (such as use case 2) and using a regression curveto estimate the execution time for others.

The request is then forwarded and processed by the SM. It validates the in-put and applies admission control as explained in Algorithm 2 using a resourceallocation and scheduling policy, and checks if sufficient resources are avail-able. If not, then it immediately responds to the user with a failure message.In future, based on the criticality of the request, some jobs may be swappedfor a higher priority job. If the job can be scheduled, then it allocates the re-sources and contacts the CM to run the simulation containers. The containerslog the result to RA that keeps sending feedback to the SM and performs theaggregation when the desired number of simulation results are received. TheSM also runs a service, applying Algorithm 1 at a configurable interval, todetermine if the deadline will be met based on the current performance data,and accordingly contacts the CM to acquire additional resources and run thecontainers. Once the simulation completes, the RA responds to the user withthe result. Currently it uses a shared folder. However, going forward we plan toimplement either an interface that can send the response as an asynchronouscallback or send a notification to the user about the availability of the result.

18 Shekhar et al

5 Experimental Validation

This section evaluates the performance properties of the SIMaaS middlewareand validates our claims in the context of hosting the use cases described inSection 3.1.

5.1 Experimental Setup

Our setup consists of ten physical hosts each with the configuration definedin Table 1. The same set of machines were used for experimenting with bothLinux containers and virtual machines. Docker version 1.6.0 was used for Linuxcontainer virtualization and QEMU-KVM was used for hypervisor virtualiza-tion with QEMU version 2.0.0 and Linux kernel 3.13.0-24.

Table 1: Hardware & Software Specification of Physical Servers

Processor 2.1 GHz OpteronNumber of CPU cores 12

Memory 32 GBDisk Space 500 GB

Operating System Ubuntu 14.04 64-bit

Even though our solution is designed to leverage the Linux containers in-stead of virtual machines, since we did not have access to large number ofphysical machines yet had to measure the scalability of our approach, we testedour solution over a homogeneous cluster of 60 virtual machines deployed asdocker hosts for running the simulation tasks, i.e., the docker containers werespawned inside the VMs. The same set of physical machines were used to hostthe VMs with configuration as defined in Table 2.

Table 2: Configuration of VM Cluster Nodes

Kernel Linux 3.13.0-24Hypervisor Qemu-KVM

Number of Virtual Machines 6Overbooking Ratio 2.0

Guest CPUs 4Guest Memory 4 GB

Guest OS Ubuntu 14.04 64-bit

Note that the SM, CM, RA and PM components of the SIMaaS middle-ware reside in individual virtual machines deployed on a separate set of hosts,each with 4 virtual CPUs, 8 GB memory and running Ubuntu 14.04 64-bitoperating system in our private cloud managed by OpenNebula 4.6.2. The


Simulation Manager was deployed on CherryPy 3.6.0 web server. The con-tainer manager used Shipyard version v2 for managing the docker hosts. Theperformance monitor relied on a customized Perfmon Server Agent 2.2.3.RC1residing in each docker host to collect performance data. The Result Aggre-gator utilized ZeroMQ version 4.0.4 for receiving simulation results from thedocker containers.

5.2 Validating the Choice of Linux Container-based SIMaaS Solution

We first show why we used the container-based approach in the SIMaaS so-lution instead of traditional virtual machines. This set of experiments affirmthe large difference in startup times for containers in Linux container-basedcloud and virtual machines in hypervisor-based traditional cloud. In [31], theauthors showed that there is a high start up time required on different popularpublic clouds. We tested similar configurations in our private cloud, managedby OpenNebula and running QEMU-KVM hypervisor. We used overbookingratios of 1, 2 and 4 with a minimal image from use case 1. While the startuptime were in the order of sub-seconds for our Linux container host, they were176, 300 and 599 seconds, respectively, for the hypervisor host. The large startup time can be ascribed to the time taken in cloning the image as the VM filesystem and booting up of the operating system.

Another set of experiments were performed to compare the performanceof a host running simulations using Linux container versus virtual machines.Table 3 shows that the Linux container host performs better in most of thecases as it does not incur the overhead of running another operating systemas a VM does.

Table 3: Comparison of Simulation Execution Time

OverbookingRatio 1

OverbookingRatio 2

OverbookingRatio 4

Linux Container(Physical Server - 4s)

4.74s 7.19s 13.32s

Virtual Machine(Physical Server - 4s)

5.17s 9.71s 19.05s

Linux Container(Physical Server - 50s)

50.5s 98.29s 180.45s

Virtual Machine(Physical Server - 50s)

52.4s 97.56s 202.5s

5.3 Workload for Container-based Experimentation

The workload we used in our experiments consists of several jobs correspond-ing to user requests, each having a number of simulation instances as a bag

20 Shekhar et al

of independent tasks. The simulations are containerized as docker images onUbuntu 14.04 64-bit operating system. The jobs are based on both the use casesdescribed in section 3.1, however, we created several variations of these usecases by changing the execution parameters. The building heating stochasticsimulation jobs have near constant execution time, but we used three varia-tions of it by using different sampling rates of 10, 5, and 2 milliseconds. Thesmaller the sampling rate, better is the accuracy of the simulation results atthe expense of longer execution times. The traffic simulations jobs also hadseveral variations based on the range of the vehicle count.

The simulations for each job may have different resource requirements thatwill be provided by the user. For these experiments, we have considered CPU-intensive workloads and modeled the user input as three resource types with1, 2 or 4 CPUs per container, which is an indication of how much CPU sharethat container gets. Thus, we convert these values to CPU share per host asthe docker container input.

We generated a synthetic workload to measure the system performance. Weconducted two sets of experiments. The first set of experiments were conductedto measure the efficacy of our algorithm using a single type of job on a tenphysical host cluster, whereas the second set of experiments were performedto demonstrate the scalability of our algorithm using a 60 virtual-host clusterwith different types of jobs arriving at different points in time. We applied thePoisson distribution with λ of 1 for a duration of two hours to find the jobarrival distribution. The number of tasks per job was uniformly distributedfrom 100 to 500. The deadline per job was also varied as a uniform distributionfrom 5 minutes to 20 minutes.

5.4 Evaluating SIMaaS for Meeting Deadlines and Resource Consumption

We evaluate the ability of the SIMaaS middleware to meet the user-specifieddeadline and its effectiveness in minimizing the resources consumed. In usecase 1 described in section 3.1.1, the user provides the approximate numberof simulations needed for stochastic model checking as an input to attain thedesired confidence level for the output [45]. For the second use case describe in3.1.2, the number of simulations is a user input. These studies were conductedfor different resource overbooking ratios, simulation count, deadlines, simula-tion duration, and execution times. Overbooking refers to the number of timesthe capacity of a physical resource is exceeded. For example, suppose each con-tainer is assigned a single CPU; thus for a 12-core system, an overbooking ratioof 2 translates to 24 containers running on the host. This strategy is cost ef-fective when the guests do not consume all the assigned resources all at thesame time. We run the scheduling policy defined in Algorithm 1 at an intervalof 2 secs that dynamically allocates extra hosts if the deadline cannot be metwith the assigned hosts.


Test 1 – Determining the Error Estimation Parameter (α): These experimentswere performed on the ten physical host cluster with use case 1 as the sim-ulation model with the following parameters: a deadline of 2 minutes, 500simulation tasks, one CPU per container and overbooking ratio of 2 per host.The purpose of these tests was to determine the error estimation parameter– α for our system. This value is used to calculate the error factor, explainedin Section 4.1, that plays a crucial role in meeting the deadline and allocat-ing container slots. Recall also that our algorithm attempts to minimize thenumber of containers while meeting deadlines on a per simulation job basis.

Fig. 4: Container Count and Deadline Variation with α

Figure 4 depicts the simulation results for α values 0, 1, 1.5, 2 and 3.We observe that values of 0 and 1 were too low and the system missed thedeadline. A value of 1.5 was found to meet the deadline as well as causingless peak resource usage. This value can be made dynamic based on the userurgency and strictness of the deadline.

Test 2 – Studying Variations in Container Count (Bj) with Error Factor (Fj):These tests were conducted to study the impact of changing the feedback basederror factor (Fj) on the container count (Bj) as well as the input simulationexecution time. From Figure 5, we observe that initially, the resource consump-tion varies according to the error factor. Later, the resource consumption staysconstant after the error estimate reaches a steady state. We conclude that the

22 Shekhar et al

error estimate becomes accurate and close to the real value we get from thefeedback. However, as the simulation moves towards completion, resources getreleased as lesser number of simulations remain to be executed.

(a) 5 seconds simulation execution time (b) 10 seconds simulation execution time

(c) 15 second simulation execution time (d) 20 second simulation execution time

Fig. 5: Variation in Error Factor (Fj) and Container Count (Bj)

Another observation we make from the results is that a pessimistic ex-ecution time estimate (here 20s) results in more initial resource allocation.Resource allocation has its own cost. For example, we measured the cost ofdeployment of simulation from the private registry to the docker hosts for bothof our use cases. For the image deployment of heater simulation of use case 1,it took 135.1 sec and for the SUMO simulation deployment of use case 2, ittook 34.6 sec. These values are network-dependent but are incurred one-timeper host which can be done as a setup process.

Since resource allocation incurs cost, an optimistic estimate is better be-cause the system can adjust itself. However, if the estimate is too optimistic,the system may not be able to finish the job within the user-defined deadline.

Test 3 – Validating the Applicability of the Feedback-based Approach: The goalof applying Algorithm 1 is to spread the load per job over the deadline periodso that multiple jobs can run in parallel while meeting their deadlines. In otherwords, the system does not schedule a job (and its tasks) immediately upon


a request but delays it such that the load is balanced yet ensuring that thedeadline will be met.

Figure 6 compares a scenario where we do not apply the feedback-basedapproach and instead allocate resources based on a fixed expected executiontime. Estimating an accurate execution time is not a trivial task and maynot even be realistically feasible. The execution time does not just depend onthe input parameter and hardware; it is also dependent on the performanceinterference due to other processes running on the shared resource. Too littlea value, and we miss the deadline while too high value will result in wastageof resources. We observe from the results that the feedback based approachmeets the deadline in all the cases while minimizing the resource consumptionby releasing the containers if not needed.

�

��

��

��

��

��

��

��

�

�

��

� ��

Nu

mb

er

of

co

nta

ine

rs (

Bj)

Simulation Duration (sec)

��

(a) 5 seconds simulation execution time

�

��

��

��

��

��

��

��

�

� ��

Nu

mb

er

of

co

nta

ine

rs (

Bj)


��

(b) 10 seconds simulation execution time

�

��

��

��

��

��

��

��

�

�

� ��

Nu

mb

er

of

co

nta

ine

rs (

Bj)


��

(c) 15 second simulation execution time

�

��

��

��

��

��

��

� ��

Nu

mb

er

of

co

nta

ine

rs (

Bj)


��

(d) 20 second simulation execution time

Fig. 6: Comparison of Feedback vs No Feedback Approaches

Test 4 – Varying Host Overbooking Ratios: This set of experiments were per-formed to measure the capabilities of the system to handle multiple parallelrequests made to the hosts with varying overbooking ratios, and study theirperformance while executing the containers. Table 4 shows the results of theexperiments where we varied the overbooking ratio from 0.5 to 6. The specifieddeadline was 4 minutes and expected simulation time was 10 seconds.

24 Shekhar et al

Table 4: System Performance with Varying Host Overbooking Ratios

Overbook-

ing Ratio

MaxCon-tainerCount

MaxHostsAc-

quired

SimulationDuration(in sec)

MeasuredExecu-

tion Timeper Sim(in sec)

TurnaroundTime perSim (insec)

MeasuredOver-

head(%)

0.5 29 5 239.4 9.31 10.48 12.571 41 4 233.6 9.53 11.52 20.882 59 3 231.3 15.62 18.59 19.014 114 3 213.3 28.72 32.76 14.076 160 3 Deadline

Missed41.62 46.9 12.67

We measure the container count and the actual number of hosts acquiredby the system to meet the deadline. The system’s goal is to minimize thisnumber to keep the economic cost within the bounds. We also measure thesimulation duration observed by the system user after the system finds the de-sired solution, the average turnaround time per simulation from the instant itgets requested till the results get logged, the actual simulation execution timeper simulation and the corresponding system overhead. This overhead includesthe performance interference overhead, resource contention and the time con-sumed in data transfer at different components of the SIMaaS workflow asshown in Figure 3.

From the results we can conclude that for CPU-intensive applications –simulations tend to fall in this category – the non-overbooked system providesthe best results, however, the number of hosts needed is also high, which inturn increases the economic cost. A highly overbooked system too has highcost and will be unable to meet the deadlines due to performance overheadand should be avoided. Based on empirical results, a lower overbooked scenarioprovides ideal trade-off as it needs less number of hosts and is able to meetthe deadlines. We also note that the system overhead remains at a reasonablelevel of less than 21% during the experiments.

Based on the experiments, we illustrate in Figures 7, the CPU utilizationand memory utilization for use case 1. The simulations have a low memoryfootprint but the CPU utilization is quite high. This conforms to our ear-lier result that having no or low overbooking for the host will provide betterperformance. The results were similar for use case 2.

Test 5 – Varying Number of Simulations: The purpose of these tests is todemonstrate the scalability of SIMaaS middleware with increasing numberof simulations that are needed as the fidelity of statistical model checkingincreases. The tests were executed with a deadline of 600 seconds while otherparameters were kept the same as in previous experiments. Table 5 shows theresults, which illustrates that the system is able to scale to 5,000 simulationsfor a job without significant overhead.


�

��

��

��

��

��

��

� ��

��

��

��

(a) CPU Utilization Variations with Simu-lation Count

�

�

�

�

�

��

��

��

� ��

��

��

��

(b) Memory Utilization Variations withSimulation Count

Fig. 7: Use Case 1: CPU Utilization Variations with Simulation Count

Table 5: System Performance with Varying Number of Simulations

Numberof Simu-lations

MaxCon-tainerCount

MaxHostsAc-

quired

SimulationDuration(in secs)

MeasuredExecutionTime per

Sim (in sec)

TurnaroundTime perSim (inms)

MeasuredOver-

head(%)

500 18 1 513.8 4.81 7.79 61.951000 33 2 547.1 5.26 11.45 117.682500 71 3 588.5 5.52 11.02 99.645000 137 6 591.4 5.49 10.72 95.26

Test 6 - Varying Simulation Execution Time: For these experiments, we varythe sampling rate parameter of use case 1 and use a deadline of 10 minutesto increase the simulation execution time of our simulation model. Table 6measures and presents the simulation performance for varying execution time.We observe that the overhead reduces significantly as the duration of thecontainer execution increases, which is attributed mainly to less percentage oftime spent in scheduling and start up of containers.

Table 6: System Performance with Varying Simulation Duration

SamplingRate (in

ms)

MaxCon-tainerCount

MaxHostsAc-

quired

SimulationDuration(in secs)

MeasuredExecutionTime per

Sim (in sec)

TurnaroundTime perSim (inms)

MeasuredOver-

head(%)

10 6 1 526.3 4.69 5.87 25.915 11 1 533.0 9.35 10.48 12.091 108 9 526.0 63.91 66.81 4.53

Test 7 – Scalability Test and Admission Control with Incoming Workload:These are scalability experiments with the workload described in Section 5.3

26 Shekhar et al

with 60 docker hosts and an incoming request flow generated using Poissondistribution. The system applied admission control and informed the usersabout the decision to accept the request. Table 7 summarizes the results forthe tests.

Table 7: Scalability Test Summary

Number of Jobs 103Test Duration 1h:47min:57s

Number of Simulations Performed 15873Hosts Utilized 54 / 60Jobs Rejected 1

Number of Jobs that Missed Deadline 1

We observed that one job with deadline of 618 seconds missed it by 1.39seconds. This failure can be eliminated with stricter error estimation parameter(α), explained in Section 4.1. SIMaaS scaled to 54 virtualized hosts during theexperiments. In future, we would like to experiment with a larger cluster totest the system’s scalability limit.

6 Conclusions

This paper described the design and empirical validation of a cloud middle-ware solution to support the notion of simulation-as-a-service. Our solution isapplicable to those systems whose models are stochastic and require a poten-tially large number of simulation runs to arrive at outcomes that are withinstatistically relevant confidence intervals, or systems whose models result indifferent outcomes for different parameters.

Many insights were gained during this research as follows and resolvingthese form the dimensions of our future investigations:

– Several competing alternatives are available to realize different aspects ofcloud hosting. Effective application of software engineering design patternsis necessary to realize the architecture for cloud-based middleware solutionsso that individual technologies can be swapped with alternate choices.

– Our empirical results suggest that an overbooking ratio of 2 and α valueof 1.5 provided the best configuration to execute the simulations. However,these conclusions were based on the existing use cases and the small size ofour private data center. Moreover, no background traffic was considered.Our future work will explore this dimension of the work as well as determinea mathematical bound for the optimal configuration.

– In our approach the number of simulations to execute for stochastic modelchecking were based on published results for the use case. In future therewill be a need to determine these quantities through modeling and empiricalmeans.


– We have handled basic failures in our system where a container is scheduledagain if it does not start, however we need advanced fault tolerance mecha-nism to handle failure of hosts and various SIMaaS components. Our priorwork [6] has explored the use of VM-based fault tolerance, however, for thecurrent work we will need container-based fault tolerance mechanisms.

– We did not consider a billing model for the end users in our work but sucha consideration should be given to generate revenues for such a service andalso so that the user does not abuse the system.

– As most of the cloud providers are rolling out Linux container-based ap-plication deployment, we need to design a hybrid cloud to leverage it.

– Currently our middleware architecture is realized as a centralized deploy-ment in our small-scale private data center. In large data centers, we willrequire a distributed realization of the various entities of our middleware.This will give rise to a number of distributed systems issues, and addressingthese form the dimensions of our future work.

All the scripts and source code, and experimen-tal results of SIMaaS are available for download fromhttp://www.dre.vanderbilt.edu/~sshekhar/download/SIMaaS.

Acknowledgments.

This work was supported in part by the National Science Foundation CAREERCNS 0845789 and AFOSR DDDAS FA9550-13-1-0227. Any opinions, findings,and conclusions or recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of NSF and AFOSR.

References

1. Abate A, Prandini M, Lygeros J, Sastry S (2008) Probabilistic Reachabil-ity and Safety for Controlled Discrete Time Stochastic Hybrid Systems.Automatica 44(11):2724–2734

2. Abate A, Katoen JP, Lygeros J, Prandini M (2010) Approximate ModelChecking of Stochastic Hybrid Systems. European Journal of Control16(6):624–641

3. Al-Zoubi K, Wainer G (2011) Distributed Simulation using RESTful Inter-operability Simulation Environment (RISE) Middleware. In: Intelligence-Based Systems Engineering, Springer, pp 129–157

4. Alamri A, Ansari WS, Hassan MM, Hossain MS, Alelaiwi A, HossainMA (2013) A Survey on Sensor-cloud: Architecture, Applications, andApproaches. International Journal of Distributed Sensor Networks 2013

5. Alur R, Pappas G (2004) Hybrid Systems: Computation and Control: 7thInternational Workshop, HSCC 2004, Philadelphia, PA, USA, March 25-27, 2004, Proceedings, vol 7. Springer

6. An K, Shekhar S, Caglar F, Gokhale A, Sastry S (2014) A Cloud Mid-dleware for Assuring Performance and High Availability of Soft Real-time

28 Shekhar et al

Applications. Elsevier Journal of Systems Architecture (JSA) 60(9):757–769, DOI http://dx.doi.org/10.1016/j.sysarc.2014.01.009

7. Atzori L, Iera A, Morabito G (2010) The Internet of Things: A Survey.Computer networks 54(15):2787–2805

8. Bardac M, Deaconescu R, Florea AM (2010) Scaling Peer-to-Peer Testing using Linux Containers. In: Roedunet InternationalConference (RoEduNet), 2010 9th, IEEE, pp 287–292, URLhttp://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5541555

9. Behrisch M, Bieker L, Erdmann J, Krajzewicz D (2011) SUMO-Simulationof Urban MObility-an Overview. In: SIMUL 2011, The Third InternationalConference on Advances in System Simulation, pp 55–60

10. Van den Bossche R, Vanmechelen K, Broeckhove J (2011) Cost-efficientScheduling Heuristics for Deadline Constrained Workloads on HybridClouds. In: Cloud Computing Technology and Science (CloudCom), 2011IEEE Third International Conference on, IEEE, pp 320–327

11. Calheiros RN, Ranjan R, Beloglazov A, De Rose CA, Buyya R (2011)CloudSim: A Toolkit for Modeling and Simulation of Cloud ComputingEnvironments and Evaluation of Resource Provisioning Algorithms. Soft-ware: Practice and Experience 41(1):23–50

12. Calheiros RN, Vecchiola C, Karunamoorthy D, Buyya R (2012) The AnekaPlatform and QoS-driven Resource Provisioning for Elastic Applicationson Hybrid Clouds. Future Generation Computer Systems 28(6):861–870

13. Calheiros RN, Netto MA, De Rose CA, Buyya R (2013) EMUSIM: AnIntegrated Emulation and Simulation Environment for Modeling, Evalu-ation, and Validation of Performance of Cloud Computing Applications.Software: Practice and Experience 43(5):595–612

14. Casanova H, Giersch A, Legrand A, Quinson M, Suter F (2013) SimGrid:A Sustained Effort for the Versatile Simulation of Large-scale DistributedSystems. arXiv preprint arXiv:13091630

15. Choi C, Seo KM, Kim TG (2014) DEXSim: An Experimental Environmentfor Distributed Execution of Replicated Simulators using a Concept ofSingle-simulation Multiple Scenarios. Simulation p 0037549713520251

16. Fujimoto RM (1990) Parallel Discrete Event Simulation. Communicationsof the ACM 33(10):30–53

17. Fujimoto RM, Malik AW, Park A (2010) Parallel and Distributed Simu-lation in the Cloud. SCS M&S Magazine 3:1–10

18. Gao Y, Wang Y, Gupta SK, Pedram M (2013) An Energy and DeadlineAware Resource Provisioning, Scheduling and Optimization Frameworkfor Cloud Systems. In: Proceedings of the Ninth IEEE/ACM/IFIP Inter-national Conference on Hardware/Software Codesign and System Synthe-sis, IEEE Press, p 31

19. Garcıa-Valls M, Cucinotta T, Lu C (2014) Challenges in Real-Time Vir-tualization and Predictable Cloud Computing. Journal of Systems Archi-tecture

20. Haklay M, Weber P (2008) Openstreetmap: User-generated Street Maps.Pervasive Computing, IEEE 7(4):12–18


21. Handigol N, Heller B, Jeyakumar V, Lantz B, McKeown N (2012)Reproducible Network Experiments using Container-based Emulation.In: Proceedings of the 8th international conference on Emergingnetworking experiments and technologies, ACM, pp 253–264, URLhttp://dl.acm.org/citation.cfm?id=2413206

22. Hellegouarch S (2007) CherryPy Essentials: Rapid Python Web Applica-tion Development. Packt Publishing Ltd

23. Kenyon C, et al (1996) Best-Fit Bin-Packing with Random Order. In:SODA, vol 96, pp 359–364

24. Kim H, El-Khamra Y, Rodero I, Jha S, Parashar M (2011) AutonomicManagement of Application Workflows on Hybrid computing Infrastruc-ture. Scientific Programming 19(2):75–89

25. Knuth DE (1969) The Art of Computer Programming, Vol. 2: Seminu-merical Algorithms, Revised Edition

26. Lardieri P, Balasubramanian J, Schmidt DC, Thaker G, Gokhale A, Dami-ano T (2007) A Multi-layered Resource Management Framework for Dy-namic Resource Management in Enterprise DRE Systems. Journal of Sys-tems and Software: Special Issue on Dynamic Resource Management inDistributed Real-time Systems 80(7):984–996

27. Ledyayev R, Richter H (2014) High Performance Computing in a CloudUsing OpenStack. In: CLOUD COMPUTING 2014, The Fifth Interna-tional Conference on Cloud Computing, GRIDs, and Virtualization, pp108–113

28. Li Z, Li X, Duong T, Cai W, Turner SJ (2013) Accelerating OptimisticHLA-based Simulations in Virtual Execution Environments. In: Proceed-ings of the 2013 ACM SIGSIM conference on Principles of advanced dis-crete simulation, ACM, pp 211–220

29. Liu X, He Q, Qiu X, Chen B, Huang K (2012) Cloud-based Computer Sim-ulation: Towards Planting Existing Simulation Software into the Cloud.Simulation Modelling Practice and Theory 26:135–150

30. LXC (2014) Linux Container. URL https://linuxcontainers.org/, lastaccessed: 10/11/2014

31. Mao M, Humphrey M (2012) A performance study on the vm startup timein the cloud. In: Cloud Computing (CLOUD), 2012 IEEE 5th InternationalConference on, IEEE, pp 423–430

32. Mauch V, Kunze M, Hillenbrand M (2013) High Performance Cloud Com-puting. Future Generation Computer Systems 29(6):1408–1416

33. Menage PB (2007) Adding generic process containers to the linux kernel.In: Proceedings of the Linux Symposium, Citeseer, vol 2, pp 45–57, URLhttps://www.kernel.org/doc/ols/2007/ols2007v2-pages-45-58.pdf

34. Merkel D (2014) Docker: Lightweight Linux Containers for Con-sistent Development and Deployment. Linux J 2014(239), URLhttp://dl.acm.org/citation.cfm?id=2600239.2600241

35. Rak M, Cuomo A, Villano U (2012) Mjades: Concurrent Simulation in theCloud. In: Complex, Intelligent and Software Intensive Systems (CISIS),2012 Sixth International Conference on, IEEE, pp 853–860

30 Shekhar et al

36. Rasmussen CE, Williams CKI (2005) Gaussian Processes for MachineLearning (Adaptive Computation and Machine Learning). The MIT Press

37. Shankaran N, Kinnebrew JS, Koutsoukas XD, Lu C, Schmidt DC, BiswasG (2009) An Integrated Planning and Adaptive Resource ManagementArchitecture for Distributed Real-time Embedded Systems. Computers,IEEE Transactions on 58(11):1485–1499

38. Shipyard (2014) Shipyard Project. URLhttp://shipyard-project.com/, last accessed: 10/11/2014

39. Soltesz S, Potzl H, Fiuczynski ME, Bavier A, Peterson L (2007)Container-based Operating System Virtualization: A Scalable,High-performance Alternative to Hypervisors. In: ACM SIGOPSOperating Systems Review, ACM, vol 41, pp 275–287, URLhttp://dl.acm.org/citation.cfm?id=1273025

40. Somasundaram TS, Govindarajan K (2014) CLOUDRB: A Framework forScheduling and Managing High-Performance Computing (HPC) Applica-tions in Science Cloud. Future Generation Computer Systems 34:47–65

41. Tao F, Zhang L, Venkatesh V, Luo Y, Cheng Y (2011) Cloud Manufactur-ing: A Computing and Service-oriented Manufacturing Model. Proceedingsof the Institution of Mechanical Engineers, Part B: Journal of EngineeringManufacture p 0954405411405575

42. Vanmechelen K, De Munck S, Broeckhove J (2012) Conservative Dis-tributed Discrete Event Simulation on Amazon EC2. In: Proceedings ofthe 2012 12th IEEE/ACM International Symposium on Cluster, Cloudand Grid Computing (ccgrid 2012), IEEE Computer Society, pp 853–860

43. Xavier MG, Neves MV, Rossi FD, Ferreto TC, Lange T, De RoseCA (2013) Performance Evaluation of Container-based Virtualiza-tion for High Performance Computing Environments. In: Paral-lel, Distributed and Network-Based Processing (PDP), 2013 21stEuromicro International Conference on, IEEE, pp 233–240, URLhttp://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6498558

44. Zhu X, Chen H, Yang LT, Yin S (2013) Energy-Aware Rolling-Horizon Scheduling for Real-Time Tasks in Virtualized Cloud Data Cen-ters. In: High Performance Computing and Communications & 2013IEEE International Conference on Embedded and Ubiquitous Comput-ing (HPCC EUC), 2013 IEEE 10th International Conference on, IEEE,pp 1119–1126

45. Zuliani P, Platzer A, Clarke EM (2013) Bayesian Statistical Model Check-ing with Application to Stateflow/Simulink Verification. Formal Methodsin System Design 43(2):338–367

a simulation as a service cloud middlewaresshekhar/publications/ante15_simaas.pdf · to address...

Documents