automatic deployment for application service provider ... · np-complete. finally, we gave a...

N◦ d’ordre : 375N◦ attribué par la bibliothèque : 06ENSL0 375N◦ interne au LIP le: PhD2006-04

ÉCOLE NORMALE SUPÉRIEURE DE LYONLaboratoire de l’Informatique du Parallélisme

THÈSE

pour obtenir le grade deDocteur de l’École Normale Supérieure de Lyon

spécialité : Informatique

au titre de l’école doctorale de MathIF

présentée et soutenue publiquement le 28 september 2006

par Pushpinder Kaur CHOUHAN

Automatic Deployment forApplication Service Provider

Environments

Directeurs de thèse : Monsieur Eddy CARONMonsieur Frédéric DESPREZ

Après avis de : Monsieur Daniel HAGIMONT, RapporteurMonsieur Christian PEREZ, RapporteurMonsieur Thierry PRIOL, Rapporteur

Devant la commission d’examen formée de :

Monsieur Eddy CARON, MembreMonsieur Frédéric DESPREZ, MembreMonsieur Daniel HAGIMONT, Membre/RapporteurMonsieur John MORRISON, MembreMonsieur Christian PEREZ, Membre/RapporteurMonsieur Thierry PRIOL, Membre/Rapporteur

To my husband

3

Acknowledgments

I have an incredible chance to thank all the people who helped and taught me to havewhatever I wished. I cannot imagine what I would be without them.

First and foremost I would like to thank my parents, Col. Avtar Singh Chouhanand Mrs. Harbans Kaur Chouhan, for educating me in ethics, culture, and sciences,for having faith in me, for unconditional support and encouragement to pursue mystudies, even when things went beyond the boundaries of their language, field andgeography.

A special thanks to my PhD advisors, Eddy Caron, Assistant Professor at ENS-Lyon and Frederic Desprez, INRIA Researcher and director of LIP. They have beensupportive since the day I began working on the subject of “Deployment Planning”.After my masters training they helped me by giving me an interesting thesis topicand guided me throughout to accomplish the thesis goal. They supported me, notonly by providing a research assistant-ship, but also academically and emotionally toovercome the rough road to finish this thesis. They inculcated confidence in me whenI doubted my abilities. During the thesis, they gave me the moral support and thefreedom I needed to move on.

Thanks to my thesis committee members: Daniel Hagimont, Professor at INPT/ENSEEIHT, Jean-François Méhaut, Professor in Polytechnic Grenoble, John Morrison,Professor at University College Cork, Ireland, Christian Perez, INRIA Researcher inproject PARIS at IRISA/INRIA-Rennes and Thierry Priol, INRIA Research Director.Special thanks to Christian, Daniel and Thierry, for vetting my first thesis writeup.Thanks to John for his invaluable guidance and support.

I wish to thank two persons for their support in helping me to validate the researchpresented in this thesis: Holly Dail and Raphaël Bolze.

Special thanks to Frederic Vivien, INRIA Researcher at ENS-Lyon and Yves Robert,Professor at ENS-Lyon, for their input and guidance, which greatly improved the qual-ity of my work.

A special acknowledgment goes to my office mates: Loris Marchal, PhD studentat ENS-Lyon. Thanks to Loris and his family. He has been a true friend ever since Iarrived at ENS-Lyon. We started sharing an office from the beginning of our Master’straining. I have enjoyed Christmas parties at Loris’s home every year since 2002 and Iwish I could join him for this party for the years to come.

Special thanks to Arnaud Legrand, researcher at ID-IMAG Grenoble, for teachingme Perl, SIMGRID and working with Linux. I really appreciate his way of answering

5

my queries. Arnaud knows very well how to clear the doubt. He is the perfect personto work with to improve and gain technical knowledge in a simple and easy manner.

I also thank to my other great team mates (Abdelkader, Alan, Antoine, Aurélia, Cé-dric, Emmanuel, Hélène, Jean-Sébastien, and many others) who have been humorousand supportive in every way. I want to thank Benoit, Cédric, and Jérémie who helpedme in different ways when I resumed my studies at ENS-Lyon.

Thanks to Anne Benoit, Jean-Yves L’Excellent, Alain Darte, Pierre Lescanne, Serge,Dominique - in short, thanks to all members of LIP who helped and taught me, andmade the environment so favorable that I really enjoyed working in LIP, ENS-Lyon.

Thanks to Sylvie Boyer, Corinne Iafrate, and Isabelle Pera for their humorous na-ture and the smiles which have an appreciable role in making the beautiful environ-ment of LIP.

I want to especially thank to my friends who made my stay at Lyon comfortableand enjoyable: David, Dino, Dragisa, Eva, Fatiha, Monika, Myriam, Nidhi, Sylvia, andVeronika. Thanks for sharing many joyful and memorable moments with me.

Thanks to Adarsh and Nidhi for making my stay in Cork colorful and unforget-table.

Thanks to my brother Lakhwinder Singh Chouhan who cherished with me everymoment and supported me whenever I needed it.

Last, but not the least, I want to thank my wonderful husband Ashish Meena forencouraging me to do my PhD, for his constant support throughout the last coupleof years, and for believing in me. I could not have accomplished this without him.Thanks to his family for their faith in me.

God only knows what I’d be without you.The Beach Boys (Petsound).

Abstract

The main objective of the thesis is to improve the performance of a NES environmentso as to use these environments efficiently. Here efficiency means the maximum num-ber of completed requests that can be treated in a time step by these environments.

The very first problem which comes to picture is related to the applications schedul-ing on the selected servers. We have proposed algorithms for the scheduling of thesequential tasks on an NES environment. Experimentally we proved that the dead-line scheduling with priority along with fallback mechanism can increase the overallnumber of tasks executed by the NES.

Another important factor that influence the efficiency of the NES environments isthe mapping style of the environment’s components on the available resources. Thequestions such as “which resources should be used?”, “how many resources should beused?” and “should the fastest and connected resource be used for middleware or as acomputational resource?” remained unanswered. In this thesis we gave the solutionsto these questions.

We have shown theoretically that the optimal deployment on cluster is a CompleteSpanning d-ary (CSD) tree. Considering heterogeneous resources we presented a de-ployment heuristic, as finding the best deployment among heterogeneous resourcesis amounts to find the best broadcast tree on a general graph, which is known to beNP-complete.

Finally, we gave a mathematical model that can analyze an existing deploymentand can improve the performance of the deployment by finding and then removingthe bottlenecks. This is an heuristic approach for improving deployments of NES en-vironments that has been defined by other means.

Deployment planning algorithms and heuristic presented in thesis are validatedby implementing them to deploy a hierarchical middleware DIET, on different sites ofGrid’5000, a set of distributed computational resources in France.

7

Résumé

L’objectif principal de cette thèse vise l’amélioration de l’utilisation des environnementsde type NES afin d’employer ces environnements de façon efficace. Par efficacité nousentendons le nombre maximum de requêtes réalisées pouvant être traitées dans unlaps de temps par ces environnements.

Le premier problème illustrant l’utilisation des NES est lié aux applications util-isant des serveurs dédiés. Nous avons mis en oeuvre des algorithmes permettantd’équilibrer la charge des serveurs. Nous avons expérimentalement montré l’impactpositif en terme de charge globale d’un ordonnanceur prenant en compte des échéancesd’exécution et un mécanisme de priorité avec reprise.

Cette première étude sur l’ordonnancement nous a conduit à nous intéresser à unautre facteur important lié à l’efficacité des NES, le déploiement des composants del’environnement sur les ressources disponibles. Pour offrir un déploiement efficaceon doit répondre à des questions du type: "Quelles ressources employer ?", "Combiende ressources sont nécessaires ?" ou encore "La ressource sélectionnée est-elle adaptéeaux besoins (en terme de capacité et de rapidité de calcul par exemple) ?" . Dans cettethèse nous proposons des solutions afin de répondre à ces questions de planificationdu déploiement.

Dans le cadre de NES hiérarchiques nous avons notamment démontré que le dé-ploiement optimal dans un cadre homogène est un arbre CSD (pour Complete Spanningd-ary (CSD)). Dans le cas de ressources hétérogènes, le problème étant NP-complet,nous fournissons des heuristiques visant à fournir le meilleur arbre de diffusion desrequêtes.

Nous proposons un modèle mathématique afin d’analyser un déploiement existantet d’améliorer ce dernier par détection des goulots d’étranglement.

Les algorithmes et heuristiques de planification du déploiement présentés danscette thèse ont été validés expérimentalement en utilisant un intergiciel hiérarchiquede découverte de services de calcul DIET sur la plate-forme grille Grid’5000.

9

Contents

1 Introduction 3

2 Network Enabled Server Environments 92.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Other environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Characteristics comparison of NES environments . . . . . . . . . . . . . 11

2.3.1 Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.4 Task execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.5 Scheduling and dynamic load balancing . . . . . . . . . . . . . . 252.3.6 Fault tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3.7 Data management . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.3.8 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.3.9 Deployment and visualization tools . . . . . . . . . . . . . . . . . 35

2.4 Comparison of systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Deadline Scheduling with Priority of tasks on NES 413.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2 Scheduling algorithms for NES environments . . . . . . . . . . . . . . . . 42

3.2.1 Client-server scheduler with load measurements . . . . . . . . . . 423.2.2 Client-server scheduler with a forecast correction mechanism . . 433.2.3 Client-server scheduler with a priority mechanism . . . . . . . . 44

3.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Deployment Tools 514.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 Software deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.1 Automatic Deployment of Applications in a GridEnvironment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2.2 GoDIET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.3 JXTA Distributed Framework . . . . . . . . . . . . . . . . . . . . . 55

i

4.2.4 Pegasus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2.5 Sekitei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3 System deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.3.1 Dell OpenManage Deployment Toolkit . . . . . . . . . . . . . . . 574.3.2 Kadeploy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.3.3 Warewulf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Heuristic to Structure Hierarchical Scheduler 635.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1.1 Operating models . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2 Architectural model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.3 Deployment constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.4 Deployment construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.5 Model implementation for DIET . . . . . . . . . . . . . . . . . . . . . . . 695.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.6.1 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . 715.6.2 Performance model validation . . . . . . . . . . . . . . . . . . . . 725.6.3 Deployment selection validation . . . . . . . . . . . . . . . . . . . 75

5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6 Automatic Middleware Deployment Planning on Clusters 776.1 Platform deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.1.1 Platform architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 786.1.2 Optimal deployment . . . . . . . . . . . . . . . . . . . . . . . . . . 806.1.3 Deployment construction . . . . . . . . . . . . . . . . . . . . . . . 836.1.4 Request performance modeling . . . . . . . . . . . . . . . . . . . . 83

6.2 Steady-state throughput modeling . . . . . . . . . . . . . . . . . . . . . . 856.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.3.1 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . 876.3.2 Model parametrization . . . . . . . . . . . . . . . . . . . . . . . . . 876.3.3 Throughput model validation . . . . . . . . . . . . . . . . . . . . . 886.3.4 Deployment selection validation . . . . . . . . . . . . . . . . . . . 906.3.5 Validation of model for mix workload . . . . . . . . . . . . . . . . 93

6.4 Model forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7 Automatic Middleware Deployment Planning for Grid 977.1 Platform Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977.2 Heuristic for middleware deployment on heterogeneous resources . . . 977.3 Request performance modeling . . . . . . . . . . . . . . . . . . . . . . . . 1007.4 Steady-state throughput modeling . . . . . . . . . . . . . . . . . . . . . . 1037.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.5.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . 104

1

7.5.2 Model Parametrization . . . . . . . . . . . . . . . . . . . . . . . . . 1057.5.3 Performance model validation on homogeneous platform . . . . 1067.5.4 Heuristic validation on heterogeneous cluster . . . . . . . . . . . 107

7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8 Improve Throughput of a Deployed Hierarchical NES 1118.1 Hierarchical deployment model . . . . . . . . . . . . . . . . . . . . . . . . 1118.2 Throughput calculation of an hierarchical NES . . . . . . . . . . . . . . . 112

8.2.1 Find a bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1138.2.2 Remove the bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . 114

8.3 Parameter measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1158.4 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8.4.1 Test-bed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1188.4.2 Computing a good deployment . . . . . . . . . . . . . . . . . . . . 118

8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

9 Automatic Deployment Planning Tool 1219.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219.2 Formulation of performance model . . . . . . . . . . . . . . . . . . . . . . 1229.3 Working of ADePT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1269.4 ADePT as a deployment planner for ADAGE . . . . . . . . . . . . . . . . 1269.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

10 Conclusion 129

A Bibliography 141

B Publications 149

Chapter 1

Introduction

In 1801 Alessandro Volta had demonstrated the electrical battery for Napoleon I atthe French National Institute, Paris. His invention evolved into a worldwide electricalpower grid. Inspired by the electrical power grid’s pervasiveness, ease of use, andreliability, computer scientists in the mid-1990s began exploring the design and de-velopment of an analogous infrastructure called the computational power grid [73]for wide-area parallel and distributed computing. The motivation for computationalgrids was initially driven by large-scale, resource (computational and data) intensivescientific applications that require more resource than a single computer (PC, work-station, supercomputer, or cluster) could provide in a single administrative domain.A grid is a type of parallel and distributed system that enables the sharing, selection,and aggregation of geographically distributed “autonomous” resources dynamicallyat runtime depending on their availability, capability, performance, cost, and user’squality-of-service requirements. Applications that are submitted to grid are specialclass of distributed applications that have high computing and resource requirements.

Grid platforms are very promising but are very challenging to use because of theirintrinsic heterogeneity in terms of hardware capacities, software environment andeven system administrator orientations. One of the solutions to facilitate the use ofgrid platform is an extension of classical approach Remote Procedure Call (RPC),called “GridRPC”. Environments based on this concept are a subset of Network En-abled Server (NES) environments [66]. End-users submit their application to grid plat-form through NES environments as these environments hide all the complexities of thegrid platform from users by providing easy access to the grid resources. NumerousNES environments already exist. We present a survey of the most prevalent NES en-vironments (DIET [28], NeOS [14], NetSolve [12], Nimrod [3], Ninf [71], PUNCH [54],and WebCom [68]) in Chapter 2 .

Basically NES environment have five components. Clients provide user interfaceand submit requests to execute libraries or applications to servers. Servers receive re-quests from clients and execute libraries or applications on their behalf. The Databasecontains the status (dynamic and static information) of the monitored resources. Mon-itors dynamically store and maintain the status of the available computational re-sources in the database. The Scheduler selects a potential server from the list of servers

3

4 CHAPTER 1. INTRODUCTION

maintained in the database by frequent monitoring and maps client requests to thatserver.

The main objective of the thesis is to improve the performance of a NES environ-ment so as to use these environments efficiently. Here efficiency means the maximumnumber of completed requests that can be treated in a time step by these environments.A completed request is one for which a response has been returned to the client. Wecalculate the environments efficiency in terms of throughput (number of completedrequests per time unit), because the traditional objective of makespan minimisationis NP-hard in most practical situations [15, 77]. Instead of absolute minimisation ofthe execution time, we look at asymptotic optimality by searching an optimal steadystate scheduling techniques. In steady-state scheduling techniques [19], startup andshutdown phases are not considered and the precise ordering and allocation of tasksand messages are not required. Instead, the main goal is to characterize the averageactivities and capacities of each resource during each time unit.

To achieve the thesis objective, the main goal is to remove the obstacle that causethe hindrance in the efficiency of the environments. The first problem which comesinto picture is related to the applications scheduling on the selected servers. NES en-vironments are able to find the computing power and the capacity storage necessaryfor the execution of an application, but it remains to determine a scheduling that of-fers the greatest possible effectiveness on the servers. NES environment generally usethe Minimum Completion Time (MCT) [64] on-line scheduling algorithm where-by allapplications are scheduled immediately or refused. This approach can overload inter-active servers in high load conditions and does not allow adaptation of the scheduleto task dependencies. Thus, we first studied one scheduling technique that can beadopted for scheduling tasks on NES environment. As a result, we present algorithmsfor the scheduling of the sequential tasks on an NES environment in Chapter 3. Wemainly discuss a deadline scheduling with priority strategy that is more appropriatefor multi-client, multi-server scenario.

Even though already some methods and technologies have been added to the ba-sic NES environments to improve the performance like development of good perfor-mance forecasting tools, hierarchical arrangements of NES’s components, etc. But stillthere are some important fields that can be improved to increase the efficiency of theNES environments as stated above, by improving the scheduling techniques used atserver level to schedule the assigned tasks.

Another important factor that influence the efficiency of the NES environments isthe mapping style of the environment’s components on the available resources. Gen-erally components of these environments are mapped on the available resources asdefined by the user or environment’s administrator. We call this process as deployment.Their exist some deployment tools, that deploy the NES’s components on selected re-sources. In Chapter 4 we present a survey of some deployment tools.

To show that the efficiency of the environments depends on the arrangements oftheir components on the available resources (nodes), we did experiments. For experi-ments, we used an hierarchical NES environment called DIET (Distributed InteractiveEngineering Toolbox). The DIET scheduler can be hierarchically distributed or used

5

(b)(a)

Star, all nodes are connected to one scheduler Nodes are divided among two schedulers

(d)(c)

Add four children to each node till all 150 nodes are used Mix of chain and star graph

?selectto

which

nodes

link

2

Figure 1.1: Some of the possible deployments.

in a centralized fashion. Figure 1.2 shows the experimental results performed withthe DIET. These experiments were performed using 151 nodes of the Orsay cluster ofGrid’5000, a set of distributed computational resources in France. Depending on theavailable number of nodes, numerous type of deployments are possible. For exam-ple, some of the possible deployments with 150 nodes are shown in Figure 1.1. Forthe experiment we tested two deployments (a and b) shown in Figure 1.1. In the firstdeployment, one node is dedicated to a centralized scheduler that is used to managescheduling for the remaining 150 nodes, which are dedicated computational nodes ser-vicing requests. In the second deployment, three nodes are dedicated to schedulingand are used to manage scheduling for the remaining 148 nodes, which are dedicatedto servicing computational requests. In this test the centralized scheduler is able tocomplete 22,867 requests in the allotted time of about 1400 seconds, while the hierar-chical scheduler is able to complete 36,307 requests in the same amount of time.

The distributed configuration performed significantly better, despite the fact thattwo of the computational servers are dedicated to scheduling and are not availableto service computational requests. However, the optimal arrangement of schedulersis unknown. Should we dedicate four machines for scheduling in the above experi-ment? Perhaps ten? It is clearly impossible to test all possible arrangements for a givenenvironment.

For the above explained experiments we used the DIET deployment tool calledGoDIET [85]. We gave the deployment hierarchy in the form of an XML file with allthe required information like, which resource is connected to which, path of the binary


3−agent distributedscheduler

36,307requests

requests

scheduler

22,867

centralized

Time (seconds)0 200 400 600 800 1000 1200 1400 1600

0

5000

10000

12000

20000

25000

30000

35000

40000

Com

plet

ed r

eque

sts

Figure 1.2: Comparison of requests completed by a centralized DIET scheduler versusa three agent distributed DIET scheduler.

files etc.Like GoDIET, all existing deployment tools deploy the components on the given

resources as defined by the users or NES’s administrator, with all the required in-formation. However, the deployment tools does not allow the physicists, chemists,biologists, etc., to deploy middleware according to their applications simply on grids:the user who wants to deploy his application must be expertise in the field of selectingthe appropriate resources of the grids based on their application type, characteristicsof middleware component and available resources. Moreover, the user should selectthe resources according to the application, should know how to carry out all the filetransfers, implementation of the configuration files and the launching of remote tasksfor each selected resource, and then finally submit the application to the deployedplatform.

Even if, the deployment plan of NES’s components is very important factor to in-fluence the throughput of an NES environment, yet neither theoretical nor practicalframework exist, to find the best organization of the components. Organizing thenodes in an efficient manner and using the nodes’ computation power efficiently isout of the scope of the end users. In recent years much work has been focused onthe design and implementation of the NES environments but neither a tool nor anyalgorithms have been developed for an appropriate mapping of environment’s com-ponents to the resources. Thus, a planning is needed to arrange the resources in sucha manner that when NES’s components are deployed on the arranged resources, max-imum number of requests can be processed in a time step by the NES environment.We called this planning as deployment planning. Even, environmental designers oftennote that deployment planning is an important problem but still enough work is not

7

done for efficient and automatic deployment.To the best of our knowledge, no deployment planning algorithm or model has

been given for arranging the NES’s components in an efficient way. The questionssuch as “which resources should be used?”, “how many resources should be used?”and “should the fastest and connected resource be used for middleware or as a com-putational resource?” remained unanswered.

In this thesis we gave solutions to these questions in Chapters 5, 6, 7, and 8. Chap-ter 5, presents an heuristic to deploy middleware on homogeneous resources. Whileimplementing the heuristic for homogeneous resources we found that optimal deploy-ment is possible as all resources are homogeneous, so in Chapter 6 we show optimaldeployment planning on homogeneous resources. Chapter 7 gives an heuristic formiddleware deployment on heterogeneous resources. Chapter 8 presents a greedyalgorithm to improve the existing deployment on heterogeneous resources.

We also gave an initial idea to develop a tool based on our automatic deploymentplanning models in Chapter 9.

Chapter 2

Network Enabled ServerEnvironments

2.1 Introduction

One of the early definitions of the grid was provided by Carl Kesselman in [73], Acomputational grid is a hardware and software infrastructure that provides dependable, con-sistent, pervasive, and inexpensive access to high-end computational capabilities. It was LenKleinrock who foresaw the upcoming development of the grid in 1969: We will prob-ably see the spread of ’computer utilities’, which, like present electric and telephone utilities,will service individual homes and offices across the country.

Computational grids enable the sharing, selection, and aggregation of a wide vari-ety of geographically distributed computational resources (such as supercomputers,compute clusters, storage systems, data sources, instruments, people) and presentthem as a single, unified resource for solving large-scale compute and data intensivecomputing applications (e.g, molecular modeling for drug design, brain activity anal-ysis, and high energy physics).

For the development of efficient grid environments, there co-exist several approachesto offer the user a uniform view of grid’s resources. One of the resource managementapproach on the grid is called Application Service Provider (ASP) approach. ASPis based on the client-server computation model. In ASP, providers offer, not neces-sarily for free, computing resources (hardware and software) to clients. Among theadvantages of this approach is that end users do not need to be experts in parallel pro-gramming to benefit from high performance parallel programs and computers. Thismodel is closely related to the classical Remote Procedure Call (RPC) paradigm. On agrid platform, RPC (or GridRPC ) offers easy access to available resources from a Webbrowser, a Problem Solving Environment (PSE), or a simple client program written inC, Fortran, or Java. It also provides more transparency by hiding the selection andallocation of computing resources.

In a grid context, this approach requires the implementation of middleware to facil-itate client access to remote resources. In the ASP approach, a common way for clientsto ask for resources to solve their problem is to submit a request to the middleware.

9

10 CHAPTER 2. NETWORK ENABLED SERVER ENVIRONMENTS

The middleware will find the most appropriate server that will solve the problem onbehalf of the client using a specific software. Several environments, usually calledNetwork Enabled Servers (NES), have been developed such as NetSolve, Ninf, NEOS,DIET, etc.

NES environments evolved only a decade ago, when programming paradigms andcore middlewares such as ORB [36], RMI [81], RPC [76] and webservices, laid a strongfoundation for distributed resource connectivity. Some of them make use of a multipletier structure in which clients, agents, and servers communicate efficiently and themost appropriate resource for a computation request is selected.

The main goal of NES environments is to provide access to the end users (biolo-gists, mathematicians, astrophysicists, etc.) the computational facilities via potentialservers to solve their problems. These problems are large (i.e., require numbers ofCPU hours) and cannot be solved using a single machine. They also require multiplepotential servers during an application lifetime, so as to find the result in appropriatetime, or using some specific feature or library. NES environments provide access tohigh computing power and large storage capacity at low cost.

Theoretically all the NES environments have the same goal, but they are many anddiffer according to the underlying models they adopt and the features they provide. Inthis chapter seven NES environments are presented in a comparative manner. Theseenvironments are DIET, NeOS, NetSolve, Ninf, Nimrod, PUNCH, and WebCom.

2.2 Other environments

In addition to the NES environments presented in this chapter, other middleware in-frastructures are available to execute remote jobs over the grid. Some of them are basedon cycle steeling concept unlike NES environments, which are dedicated servers. Somesystems that are dedicated but does not provide scheduling system, job managementservices, queueing mechanisms as provided by NES environments. In this section asmall overview of the most common job management systems and most common gridsystems is given for the completeness.

Condor1 is a job management system. Although Condor provide a scheduling sys-tem, job management services, queueing mechanism, it is based on cycle steeling con-cept unlike NES environments, which use servers that are dedicated to some specificapplications. Condor [80] is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queue-ing mechanism, scheduling policy, priority scheme, resource monitoring, and resourcemanagement. Users submit their serial or parallel jobs to Condor, which in turn placesthem into a queue, chooses when and where to run the jobs based following a policy,carefully monitors their progress, and ultimately informs the user upon completion.Condor is the product of the Condor Research Project at the University of Wisconsin-Madison. Most flavors of Unix are supported, as well as Windows XP/2K. Condoralso provides a platform for other systems to be built upon it using its core services.

1http://www.cs.wisc.edu/condor/

http://www.cs.wisc.edu/condor/

2.3. CHARACTERISTICS COMPARISON OF NES ENVIRONMENTS 11

XtremWeb [27] is a project developed at University of Paris-Sud, France. XtremWeb2

is a software platform designed to serve as a substrate for global computing experi-ments. XtremWeb uses CPU idle time for task execution. XtremWeb is intended todistribute applications over a set of hosts using cycle stealing scheme and particularlyfocuses on multi-parameters applications which have to be computed several timeswith different inputs and/or parameters, each computation being fully independentfrom each others. Most flavors of Unix are supported, as well as Windows and MacOS-X.

Globus system is a dedicated grid system but does not provide scheduling sys-tem, job management services, queueing mechanisms as done by NES environments.Globus [48] is a grid toolkit to build computational grids and grid-based applications.The Globus Toolkit acts as a glue to integrate various resources (such as Desktop PC’s,clusters, supercomputers, databases, visualization instruments, etc.) under differentdomains of interests called Virtual Organizations. Virtual Organizations are islands ofdomains where researchers are working under common interests. This toolkit is beingdeveloped by the Globus Alliance 3 and many other collaborators around the world.There are many systems built on top of Globus utilizing the services of Globus likeNinf-G, Nimrod-G, and Condor-G.

JiPANG (Jini-based Portal Augmenting Grids) 4 is a collaborative work with Indi-ana University, Electrotechnical Institute, Tokyo Institute of Technology, and Univer-sity of Tennessee. JiPANG is a portal system and a toolkit which provides uniformaccess interface layer to a variety of grid systems, and is built on top of Jini distributedobject technology. JiPANG performs uniform higher-level management of the com-puting services and resources being managed by individual grid systems such as Ninf,NetSolve, Globus, etc. In order to give the user a uniform interface to the grids JiPANGprovides a set of simple Java APIs called the JiPANG Toolkits, and furthermore, al-lows the user to interact with grid systems, again in a uniform way, using the JiPANGbrowser application.

2.3 Characteristics comparison of NES environments

In this section we compare and contrast seven NES environments according to theircharacteristics under different sections. The sequence of these NES environments isin an alphabetic order DIET, NeOS, NetSolve, Nimrod, Ninf, PUNCH, and WebCom.Some characteristics are not present in all NES environments that we are presenting,so the respective sections are not described.

2http://www.lri.fr/~fedak/XtremWeb/3http://www.globus.org/alliance/4http://ninf.is.titech.ac.jp/jipang/

http://www.lri.fr/~fedak/XtremWeb/

http://www.globus.org/alliance/

http://ninf.is.titech.ac.jp/jipang/


2.3.1 Development

Development of NES environment has been done across the world for many years.These NES environments have been a research development of many research do-mains and are being developed using different programming technologies.

2.3.1.1 DIET

Distributed Interactive Engineering Toolbox (DIET) [28] is developed by GRAAL 5

team at École Normale Supérieure de Lyon. DIET project is started in 1997 and isavailable for all flavors of UNIX 6 . DIET is implemented using C++ programminglanguage.

DIET has hierarchical arrangement of its components to provide scalability. Com-munication between components are performed by CORBA. The goal of DIET is tobuild computational servers based on set of tools. Multiple-hierarchies in DIET isachieved either through DIETj [29] or multiMA [37].

2.3.1.2 NeOS

Network-Enabled Optimization Server [45] is developed at the Argonne NationalLaboratory. The first version of NeOS Server was available in September 1995 and thelatest version7 is compatible with Win 9x/XP and all flavors of Unix. NeOS implemen-tation and interfaces are written in Tcl/Tk, Java and Kestrel. NeOS is an environmentfor solving optimization problems over the Internet.

2.3.1.3 NetSolve

Network-enabled Solver-based [31] system is developed at University of Tennessee,Knoxville. It 8 is available for all popular variants of the UNIX operating system,and part of the system are available for the Microsoft windows platforms. The firstversion was available in January 1996. The NetSolve system is implemented using theC programming language, with the exception of the thin upper layers of the client APIthat serve as environment specific interfaces to the NetSolve system. NetSolve is aclient-server system that enables users to solve complex problems remotely. NetSolveis also called as GridSolve9.

2.3.1.4 Nimrod

Nimrod [4] is developed at School of Computer Science and Software EngineeringMonash University. It came into picture in August 1995. It is implemented in the

5http://graal.ens-lyon.fr/6http://graal.ens-lyon.fr/DIET7http://www-NeOS.mcs.anl.gov/NeOS/8http://www.cs.utk.edu/netsolve9http://icl.cs.utk.edu/netsolve

http://graal.ens-lyon.fr/

http://graal.ens-lyon.fr/DIET

http://www-NeOS.mcs.anl.gov/NeOS/

http://www.cs.utk.edu/netsolve

http://icl.cs.utk.edu/netsolve


C language. There are different flavor of Nimrod: Nimrod-G, EnFuzion, Nimrod-O,Active Sheets and Nimrod Portal. EnFuzion 10 and all other flavors of Nimrod 11 areavailable. Nimrod is a tool for performing parametric studies and for job submissionsacross loosely coupled workstations.

2.3.1.5 Ninf

Ninf [6] is a collaboration product of AIST, University of Tsukuba, Tokyo Institute ofTechnology, Real World Computing Partnership, Kyoto University and NTT SoftwareInc. Ninf 12 projected in the year 1994. Ninf is built using C and C++ languages.There are three flavors of Ninf: Ninf Portal, Ninf-G and Ninf-C. The Ninf Portal is anautomatic generation tool for Gird Portals. Ninf-G directly accesses resources on gridmanaged by the Globus Toolkit. Ninf-C itself does not have any facility to directlyaccess grid, although it can indirectly access via Condor-G. Ninf is a client-server RPCbased system for large scale scientific computing.

2.3.1.6 PUNCH

Purdue University Network Computing Hubs [55] is developed at Purdue University,USA. PUNCH 13 is compatible with Win 9x/XP and all flavors of Unix. PUNCH is ademand-based network-computing system that allows users to access and run existing softwaretools via standard world-wide web browsers. Tools do not have to be written in any particularlanguage, and access to source and/or object code is not required.

2.3.1.7 WebCom

WebCom [69] is developed at Centre for Unified Computing, Cork, Ireland. Web-Com project was started in the year 1996. The grid enabled version of WebCom isWebCom-G, where "G" stands for "Grid". It is available 14 for authorized users. We-bCom and WebCom-G are implemented in Java and are compatible with Win 9x/XPand all flavors of Unix and Linux.

AnalysisThe development of various NESs began in mid 1990s. These environments consider avariety of application domains and employ multiple programming technologies. Ear-lier systems employed C and C++, recent years Java is becoming a more popular im-plementation language.

10http://www.axceleon.com/11http://www.csse.monash.edu.au/~davida/nimrod.html/downloads.htm12http://ninf.apgrid.org/packages/welcome.shtml13http://punch.purdue.edu/HubInfo/presentations/1999/iupui.html14http://www.cuc.ucc.ie

http://www.axceleon.com/

http://www.csse.monash.edu.au/~davida/nimrod.html/downloads.htm

http://ninf.apgrid.org/packages/welcome.shtml

http://punch.purdue.edu/HubInfo/presentations/1999/iupui.html

http://www.cuc.ucc.ie


2.3.2 Architecture

The necessary components for a basic NES environment are Clients, Servers, Databases,Monitors and Schedulers. Clients provide user interface and submit requests to ex-ecute libraries or applications to servers. Servers receive requests from clients andexecute libraries or applications on their behalf. The Database contains the status (dy-namic and static information) of the monitored resources. Monitors dynamically storeand maintain the status of the available computational resources in the database. TheScheduler selects a potential server from a list of servers maintained in the database byfrequent monitoring and maps client requests to that server. Some NES systems havemerged the functionalities of basic NES components while others have split the func-tionalities of NES components, thus giving rise to a different number of componentsand different naming of components in each NES environment.

2.3.2.1 DIET

DIET architecture is an hierarchy of agents to provide greater scalability for schedulingclient requests on available servers. The collection of agents uses a broadcast / gatheroperation to find and select amongst available servers while taking into account loca-tions of any existing data (which could be pre-staged due to previous executions), theload on available servers, and application-specific performance predictions. The maincomponents of DIET [28] are clients, agents and servers, shown in Figure 2.1.

MA

MA

MA

LALA

LALALA

LA

MA

MA

client

MA

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Figure 2.1: Architecture of DIET.

The client is the application interface by which users can submit problems to DIET.Users can access DIET via different kinds of client interfaces: web portals, PSEs such


as Scilab, or from programs written in C or C++. Many clients can connect to DIETat once and each can request the same or different types of services by specifyingthe problem to be solved. Clients can use either the synchronous interface where theclient must wait for the response, or the asynchronous interface where the client cancontinue with other work and be notified when the problem has been solved. Agentssort the servers according to the servers status information. Agents are categorized intwo types: Master Agent and Local Agents. The agent that can communicate with clientsdirectly to receive tasks and sends a reference to a selected server to the client is calledthe Master Agent (MA). Local Agent (LA) are included in the DIET hierarchy to providescalability and adaptation to diverse network environments. Servers (SeDs) are thecomponent that execute the task submitted to DIET on the behalf of client. While thetop of the tree must be an MA, any number of LAs can be connected in an arbitrarytree and servers can be attached to the MA or to any of the LAs.

2.3.2.2 NeOS

NeOS [14] is an environment for solving optimization problems over the Internet.NeOS have three components: client, server, and solver. NeOS client performs samefunctionality as of the basic NES client, but NeOS server performs the functionality ofdatabase, monitor and scheduler. NeOS solver performs the functionality of basic NESserver. Clients submit optimization problems to the NeOS server via e-mail, the WorldWide Web, or the NeOS Submission Tool. The server locates the appropriate optimiza-tion solver and does all the additional work that is required by solver before and aftersolving the submitted problem. The solver solves the submitted task for optimization.Connectivity of the NeOS components is shown in Figure 2.2.

Java Client

Web Client

Mail Client e−mail

Web

SOLVERS

SINGLE ServerCLIENT

E

n

i

e

n

g

Server

Solver

Daemons

Solver

Daemons

Solver

Daemons

Figure 2.2: The NeOS architecture

2.3.2.3 NetSolve

NetSolve [31] has three main components: client libraries, agent and server. NetSolveclient and server performs the same functionality as of the basic NES client and basic


NS Agent

Applications Users

Resource Discovery Load Balancing

Resource Allocation Fault Tolerance

ServerNS

ServerNS

ServerNS

NSClient Library

Figure 2.3: The NetSolve architecture

NES server respectively. Where as the functionalities of database, monitor and sched-uler is merged in NetSolve agent. Using client libraries, users write applications andsubmit them to NetSolve for execution. Agent gathers information about the serverand depending on the server capability chooses the appropriate server to execute theclient-submitted task. A server is a networked resource that serves up computationalhardware and software resources. Figure 2.3 shows how these components are orga-nized.

2.3.2.4 Nimrod

client

PE

Scheduler

Dispatcher

JW

Figure 2.4: Architecture of Nimrod-G

The architecture of Nimrod [25] is shown in Figure 2.4 with its five key compo-nents: clients, Parametric Engines (PE), a scheduler, a dispatcher and a Job Wrapper


(JW). PE performs the same functionalities as of the basic NES database. Functional-ities of monitor is mutually performed by dispatcher and JW. JW also performs tasksexecution as basic NES server. Clients act as a user-interface for controlling and super-vising an experiment under consideration. They also serve as a monitoring consoleand list the status of all jobs, which a user can view and control. PE acts as a persis-tent job control agent and is the central component from where the whole experimentis managed and maintained. It is responsible for parametrization of the experimentand the creation of jobs, maintenance of the job status, interaction with the clients,the schedule adviser, and the dispatcher. The Nimrod scheduler is responsible for re-source discovery, resource selection, and job assignment and the dispatcher primarilyinitiates the execution of a task on the selected resource as per the scheduler’s in-struction and periodically updates the status of task’s execution to PE. JW interpretsa simple script containing instructions for file transfer and subtask execution. It is re-sponsible for staging the application tasks and data; starting execution of the task onthe assigned resource and sending results back to the PE via the dispatcher. It basicallymediates between the PE and the machine on which the task runs.

2.3.2.5 Ninf

GENERATE

FORK

Client

Server

RegisterRemote Library

Executable

IDL Compiler

Numericallibrary

IDLfile

Figure 2.5: Architecture of Ninf

The basic Ninf [72] system have main two components: a client and a server (per-forms the functionalities of database manager, scheduler and monitor of basic NESenvironment). But to help the working of server other three elements are used: Re-mote Library Executable (RLE), IDL compiler, Register driver. The client facilitatesthe users by providing an easy to use API. RLE, executes numerical operation. RLEcontains network stub which handles the communication between server and clientsand marshals arguments. RLEs are implemented as executable programs with stubroutine as the main, which is managed by the server process. The IDL compiler com-piles interface descriptions and generates ’stub main’ for RLE helping to link the ex-ecutable. The register driver registers remote library executable into the server. Theserver provides library interface information and invokes the RLE. Connectivity of theNinf components is shown in Figure 2.5.


User−SpecificInformation

Tool−SpecificInformation

Site−Specific Information

NetworkInterface

NetworkInterface

Client Unit

Management Unit

Execution Unit

User

Resources

Figure 2.6: Architecture of PUNCH

2.3.2.6 PUNCH

The core architecture of PUNCH [55] has three main units: client unit, managementunit and execution unit. Management unit performs the working of database manage-ment, scheduler and monitor. Execution unit do tasks execution, so work as server.The hierarchical connectivity of the units is shown in Figure 2.6. The client unit actsas channels for command, preferences and directives specified by the user. It providesinterface related input and output to the system and is integrated into the local envi-ronment, giving transparent access to the network computing system. Managementunit acts as a demand-driven scheduling engines for associated software and hard-ware resources introducing access control policies and match making resources to therequirements. The execution unit provides the management unit with consistent ac-cess to the hardware resources.

2.3.2.7 WebCom

WebCom [67] consists of two type of components, a master and an arbitrary numberof clients. The master initiates the computation. WebCom master acts as an agent andclient acts as a server. WebCom have a special feature according to which a WebComclient (server) can be promoted to act as WebCom master (agent).

A WebCom client may be a master or a java enabled web browser capable of exe-cuting atomic instructions. Client connected via web browsers will only be sent atomicinstructions (instructions which cannot be further distributed), where as client mastersmay be sent both atomic and graph instructions.

WebCom deployment consists of number of Abstract Machines (AM). When a com-putation is initiated one AM acts as the root server and the other AMs act as clients orpromotable clients. A promoted client acting as a server accepts connections and exe-cutes tasks. Promotion occurs when task, representing Condensed Graph (CG), passedto the client can be partitioned for further distribution. Hence in WebCom environ-


ment clients are promoted to masters depending on the type of task it receives [70].

AnalysisIt can be seen that most of the architectures surveyed employ a two layered clien-t/server model. The exception are DIET and WebCom which are hierarchical; tospread the computational load across multiple subnets. This facilitates the exploitationof underlying shared resources. In WebCom the hierarchical structure is dynamicallyreconfigurable. NeOS is aimed at optimization issues, Nimrod at parametric compu-tations. PUNCH is oriented at WWW browser enabled NES computing and NetSolveis multilayered but does not support a hierarchical structure.

NES environments should adapt hierarchical structure because it provides scala-bility to the environments by removing the bottleneck from a central point and dis-tributing the load at different levels.

2.3.3 Initialization

As shown in the previous section each NES environment has a different number ofcomponents and each component has a different functionality. So the order of com-ponent deployment also varies with each NES environment. In the following sectionsthe initialization steps of the surveyed NES environments are outlined.

2.3.3.1 DIET

The DIET platform is constructed following the hierarchy of agents and SeDs. Initiallythe naming service should be deployed, so that all the launched elements can easilyfind each other. To find the element of interest, the searching element needs onlyto know the port at which the naming service can be found, hostname and a string-based name for the element of interest. As a benefit of this approach, multiple DIETdeployments can be launched on the same group of machines without conflict as longas the naming service for each deployment uses a different port and/or a differentmachine.

Figure 2.7 provides an overview of steps of the DIET’s element deployment. Afterthe naming service, the MA is launched; the MA is the root of the DIET hierarchyand thus does not register with any other elements. Then either an LA or a SeD can bedeployed as a child of the MA. If an LA is connected to the MA then another LA or SeDcan be connected to this LA. The bottom of the DIET hierarchy consists of SeDs. Oncethe DIET system is deployed, clients can connect themselves to the MA and submittheir task for execution.

2.3.3.2 NeOS

As it is a static environment to solve optimization problem, (NeOS servers are ini-tialized by the NeOS administrator) / (users cannot initialize the NeOS server, they


MAMA

LA LA LA

LA

LA

LA

MA MA MA

(3) (4) (5)

Client

MA

(1) (2)

Figure 2.7: Launch of a DIET platform.

can only submit their job for optimization to NeOS server). Thus there are no specificinitialization steps that can be mentioned here.

2.3.3.3 NetSolve

First the agent is launched. Then, servers register with it by sending a list of problemsthat they are able to solve, the speed, workload of the machine on which they arerunning, and the network’s speed (latency and bandwidth) between them and agent.Once this initialization step is performed, a client can call upon the agent to solve aproblem. NetSolve initialization [33] steps are shown in Figure 2.8.

A

(1) (2) (3)

A A

Client

SS

Figure 2.8: NetSolve initialization steps

2.3.3.4 PUNCH

The initial step of PUNCH [55] system is the creation of the input files for the relevantsimulation in case of VLSI design and computer architecture followed by user inputparameters for the simulation program. Finally the simulation is started via a browserinterface.


2.3.3.5 WebCom

A WebCom 15 binary distribution file can be downloaded from the official site by theauthorized users. Install it on the local machine. Depending on the requirements, onecan configure it to act as a master server or a client to one of the WebCom portals.Users can launch the WebCom IDE, build applications on the fly and launch to run onthe local machine or target them to run at the remote machine or they can use the com-mand line execution facility. The initialization takes place within the backplane whichchecks for modules that are loaded during the startup (that is why backplane is alsoreferred as bootstrap program). Once the modules are loaded communication is initi-ated, clients connect and instruction queues are created for each. To each client, workis being distributed by Condensed Graph engine which is scheduled by the scheduleralong with the help of load balancer.

AnalysisComponents of NES environments can have different functionality and each can bedeployed independently. It can be seen that some systems use agents for initializ-ing (NetSolve), others are initialized upon submission of the tasks via their clients orbrowser (NeOS), where WebCom system has to be configured initially but the statecan be changed dynamically. Initialization reflect the type of work that each systemperforms.

2.3.4 Task execution

The granularity and type of tasks being executed by the NES environment varies. Eachof these environments have its own model of task execution. General working of a ba-sic NES environment is described below. The Monitors frequently monitor the status ofthe available resources (server, network, latency, CPU load, etc.) and register the infor-mation to the Database. Clients demand the scheduler to provide them with a suitablecomputational server for their request to be executed with the help of client APIs (e.g.C, Java, MATLAB, Fortran, Web, e-mail, etc.) or tools constructed using client APIs.The scheduler queries the database and selects suitable computing resources based oncertain algorithms and return the selection to the client. Clients remotely invoke theselibraries or applications on the selected server. The server does the computation andreturns the resulted data to the client.

2.3.4.1 DIET

Task execution in DIET proceeds in two phases. In the scheduling phase the DIET client(1) submits the problem (request) to the MA. Agents maintain a simple list of sub-treescontaining a particular service; thus the MA (2) check for all the mandatory informa-tion of the task so as (3) broadcasts the request on all sub-trees containing that service.(4) Other agents in the hierarchy perform the same operation, forwarding the request

15http://portal.webcom-g.org

http://portal.webcom-g.org


on all eligible sub-trees. When the request reaches the server-level, each server callstheir local FAST [43] service to (5) predict the execution time of the request on thisserver. These predictions are then (6) passed back up the tree where each level in thehierarchy (7) sorts the responses to minimize execution time and (8) passes a subsetof responses to the next higher level. Finally, (9,10) the MA returns one or severalproposed servers to the client. In the service phase the client (11) directly connects theselected server and provides the data for task execution. Once the data is received, aconnection between client and server is re-directed. (12) After finishing the task ex-ecution, the server re-establishes the connection with the client and the result is sentback.

MA

Client

LA

12

3

4

56

7

89

10

46

5

12

11

Figure 2.9: DIET task execution steps.

This description of the agent/server architecture focuses on the common-case us-age of DIET. An extension of this architecture is also available [29] that uses JXTA [11]peer-to-peer technology to allow the forwarding of client requests between MAs, andthus the sharing of work between otherwise independent DIET hierarchies.

2.3.4.2 NeOS

A client sends a job to the NeOS server. The server assigns each job to a solver scriptthat executes application software via requests to a communications daemon assignedto the application and solver station. The server can then delegate to the solver ma-chines the tasks of interpreting data and executing software. The NeOS Comms Toolis used to download the user files from the NEOS server to solver, to invoke the solverapplication, and to return the solver’s results to NEOS. The NEOS Server, upon re-ceipt of a job for a particular solver, connects to the machine running the Comms Tooldaemon for that solver, uploads the user’s data, and requests invocation of the remotesolver. The Comms Tool daemon, upon receipt of this message, downloads the user’sdata and invokes the application software. Intermediate job results can be streamedback over the connection to the server and from there, in turn, streamed back to theuser. The Comms Tool daemon sends final results to the NEOS server upon comple-tion of the job, and the server formats and forwards the results to the user. A detailed


working of NeOS with the submission flow for three jobs arriving simultaneously isillustrated in [44].

2.3.4.3 NetSolve

123 4

675

A

Client

Figure 2.10: NetSolve task execution steps

A NetSolve session works as shown in Figure 2.10. (1) A client sends its task tothe agent. During a NetSolve deployment, the severs provide all the information re-garding its status and the tasks that they can execute to the agent. Then (2) an agentselects the appropriate server for the client request and (3) gives the server a referenceto the client. Then agent increases the occupation status of this server by some factor(NetSolve use some mechanism [23] for it). (4) Clients directly connects to the serverand provide data for task execution to the server. When (5) tasks is executed (6) resultis sent back to the client by the server. (7) the server conveys its current status to agentafter finishing the task, so that agent can remove the factor that the agent added tothe server occupation status. Detailed technical explanation of NetSolve componentworking is available in [23].

2.3.4.4 Nimrod

Working of Nimrod [2] is shown in Figure 2.11. (1) Client sends a request to a PE. (2)PE prepare data for client particular execution. (3) It asks the scheduler to order the re-quest. (4) The scheduler arrange the request according to some scheduling algorithms,and (5) sends the ordered requests back to the PE. (6) The PE forwards the request tothe dispatcher, (7) which selects the appropriate Job Wrapper (JW). (8) The JW receivesthe request from dispatcher. (9) It executes the program for a given set of parametersand (10) sends back the result to dispatcher, which returns the result to PE (11). Re-duction of executed data if required is done by the PE. Finally PE gives back the result(12) to the client. Steps (1) and (12) are performed once per experiment, while steps (2)to (11) are run for each distinct parameter set.


client

PE

1

2

3 5 1011

12

Dispatcher

JW4

67

8

9Scheduler

Figure 2.11: Nimrod: Work procedure

2.3.4.5 Ninf

Task execution by Ninf is shown in Figure 2.12. (1) Client sends an interface requestto server. (2) Server replies the interface requests. Then (3) client calls the libraryto invoke the executable. (4) Server searches the Ninf executable associated with thename that the client calls for and (5) upon successful search executes the library and (6)setups a communication link with the client with the help of stubs to return the result.In short, clients ask server to execute his request and server executes the requestedlibraries and sends back the result.

213 6

GENERATE4FORK

Client

Server

RegisterRemote Library

Executable

IDL Compiler

Numericallibrary

IDLfile

5

Figure 2.12: Working of Ninf

2.3.4.6 PUNCH

Figure 2.13 shows PUNCH’s working. (1) User access PUNCH system via standardWWW browsers. (2) Network Desktop processes and responds to all “non-application-invocation” requests. (3) Application invocation requests are forwarded to an appro-priate management unit. (4) Management unit authenticates requests, determines re-source requirements and selects execution unit. (5) Management unit sends the re-quest to the execution unit. (6) Execution unit processes requests with the help ofavailable resources and (7) notifies management unit upon completion.


Client Unit

Management Unit

Execution Unit

User

Resources

1

2

3

4

5

6

7

Figure 2.13: Working of PUNCH

2.3.4.7 WebCom

Working of WebCom is shown in Figure 2.14. (1) Computation begins with 2-tier hi-erarchy. Graph execution begins on the server. (2) Server sends a CG task to a client,causing it to be promoted. (3) The promoted client requests additional clients to be as-signed to it by the server. This request might not be serviced. If not the client continuesexecuting the graph on its own. (4) The server issues a redirect directive to a numberof its clients. (5) Clients redirected from the server connect as clients to the promotedclient. (6) The promoted client directs its new clients to connect to each other forminga local peer to peer network. (7) Finally, when the tasks are executed results are sentback to the parent server.

AnalysisThe major difference in task execution among NES environments is based on the workdone by its components. As in NetSolve the work is done by the server on behalf ofits client, but in WebCom the work is done by the clients and in NeOS by the solvers.

In NetSolve, agents search for potential servers, where as in system like WebCom,servers search for potential clients for faster execution. Even the granularity of thetasks executed is different in most of the surveyed systems. In some systems (NeOS,PUNCH) atomic (single task) tasks have to be submitted where as in some system(WebCom), tasks can be submitted as condensed (task within task) tasks.

2.3.5 Scheduling and dynamic load balancing

Scheduling is a necessity of all existing systems. A good scheduling is the one thatcan provide fairness in task distribution, have good policy enforcement and balancea system when its fully busy while meeting the deadlines of the given task. NESenvironments also use scheduling to fully utilize the resources, but each of them has adifferent scheduling scheme.

When processor load cannot be determined statically, dynamic load balancing tech-


Figure 2.14: Working of WebCom

niques can be used for effective speedups. In this section we present the schedulingtechniques for each of the NES environments and dynamic load balancing for Web-Com.

2.3.5.1 DIET

Default DIET uses a Round Robin algorithm, which is based on CPU capacity of theresources to select the appropriate server. In DIET users can use a plug-in schedulerfor SeD selection. The DIET plug-in scheduler [84] works in two steps. First an “eval-uation table” is generated by each machine. This evaluation table contains the SeDstatus information, as different parameters and values of these parameters (as shownin table 2.3.5.1). In second step, a selection of the best SeD is done based on the “se-lection function”. This selection function is given by the client and according to thisfunction, a comparison of evaluation tables of different SeDs is done. For example, ifone client needs more memory space and has some particular deadline for its request,the selection function given by client will be based upon two parameters, execution


Parameter ValuePre_Exe_T ime 3.9

Free_Mem 5000CPU_load 12

......

Table 2.1: Example of a plug in scheduler parametric table.

time predicted (using FAST [43]) and the free memory space. The SeD that is rankedbest based on this selection function will be selected to execute the client request.

2.3.5.2 NetSolve

NetSolve use a theoretical model [13] to estimate the performance, given the raw per-formance and the CPU load. This model gives (p) the estimated performance, as afunction of the CPU load (w), the raw performance (P ), and the number of processors(n) on the machine:

p = P×nw/100+1

The hypothetical best machine is the one yielding the smallest execution time Tfor a given problem. It incorporates on-line scheduling heuristics that attempt to opti-mize service turnaround time. A server has the capability to service simultaneously asmany requests as the operating system will allow it to fork processes, but the admin-ister can specify the maximum number of requests it is willing to service at one time;requests for services received once this limit is met will rejected. Similarly, a servercan specify its CPU load threshold and, once requests received after this threshold isattained will be refused.

2.3.5.3 Nimrod

Nimrod-G provides a persistent and programmable task-farming engine (TFE) thatenables ’plugging’ of user-defined schedulers and customized applications or Prob-lem Solving Environments (e.g., ActiveSheets) in place of default components. TFE isa coordination point for processes performing resource trading, scheduling, data andexecutable staging, remote execution, and result collation. The Nimrod-G system au-tomates the allocation of resources and application scheduling on the grid using eco-nomic principles in order to provide some measurable quality-of-service to the enduser.


Run−SpecificInformation

Tool−SpecificInformation

Resource RequirementsAvailable

Resrouce poolCurrent Resource

Status

Figure 2.15: Demand-based Scheduling [55]

2.3.5.4 Ninf

Ninf (like NetSolve) makes scheduling decisions based on the dependencies of theinput data: any groups of remote calls are analyzed for dependency relationships, andthose that may be executed without waiting for another call to finish are immediatelysent off to a remote resource.

Although a programmer can define blocks of related function calls, the actual selec-tion of a remote resource and the migration of data is handled by the Ninf Metaserver(like the NetSolve agent). The user provides a problem description to the Metaserver,which tries to match as closely as possible the correct resources to the submitted job,based on a database of resource loads and performance.

2.3.5.5 PUNCH

Client, management and execution units are the three important units in the PUNCHsystem. Each unit has a specific task to accomplish and the management unit acts asdemand-based scheduling engine (Figure 2.15) for respective software and hardwareresources. Depending on the requirements of the user requests, the management unitanalyzes the requirements and then matches the requirements with related resources.Here the management unit acts as a matchmaker, mapping the requests to the appro-priate resources.

2.3.5.6 WebCom

Decisions about when the tasks are to be run once they are ready to executed. Thefunctionality of this module is very trivial because as soon the task are ready theywill be soon executed. WebCom is a condensed graph based model providing manyscheduling mechanisms (namely critical path analysis, speculative computations, re-duced priorities and others).

The load balancing module [69] is one of the core modules of WebCom which isplugged into the BackPlane. This module decides where the tasks are to be executed.


It works using the Round Robin method to distribute tasks to the clients. When asuitable client is known, this module requests the Backplane to send the instructionto the selected client, if the selected client is not found, it re-schedules the task for alocal execution. WebCom requires all the variants of load balancing modules to com-municate directly with Backplane and depends on the status information of the net-work provided by the communication module. Load balancing modules developedby the users, may employ different strategies with respect to security, resource access,resource reliability and others.

AnalysisMost of the systems studied have a fixed scheduling scheme, like NetSolve, Ninfand PUNCH. But others like DIET, Nimrod and WebCom allows plug-in-scheduling,where users can add scheduling algorithms according to their requirements. All of thesystems surveyed except WebCom employs a static load balancing scheme.

Plug-in scheduler is a good option to facilitate the implementation of user specifiedschedulers. Plug-in scheduling facilities the application-specific definitions of appro-priate performance metrics, enables an extensible measurement system and ease thetunable comparison/aggregation routines for scheduling.

2.3.6 Fault tolerance

The fault tolerance is a significant and complex issue in grid computing systems. Asgrid resources are gather by using WLAN or even through Internet, so any resourcecan leave or join the grid at anytime. If any resource used by the NES environment getcrash, according to the type NES component launched on the resource, different levelof computational loss occurs. In the following sections fault tolerance mechanismsused by surveyed NES environments and after effect of each component crash forNES environments is stated.

2.3.6.1 DIET

Fault tolerance in DIET 16 is implemented by using Chandra and Toueg and Aguilerafailure detector [34, 5]. This fault detector detect the network parameters and recon-figure itself according to changing network delay or message loss.

As mentioned earlier that DIET uses a decentralized hierarchical infrastructure,thus there is no centralized fault detector : each client is in charge of observing serversrunning its own RPC calls. DIET fault tolerance is a part of the DIET library includedin both client and server part.

To save the large amount of computation time lost during a SeD crash/fault, check-pointing mechanism is used. Two types of checkpointing is available in DIET, auto-matic and service based. In service based checkpointing, the service includes its owncheckpoint mechanisms. It just have to register some files to be included to the DIET

16http://graal.ens-lyon.fr/DIET/fault.html

http://graal.ens-lyon.fr/DIET/fault.html


checkpoint, and notify DIET when these files are ready to be stored. In automaticcheckpointing it is linked with a checkpoint library (such as the Condor StandaloneCheckpoint Library). Then SeD automatically manages to periodically checkpointthe service. Checkpoint data are stored on computing resources using JuxMEM [9].JuxMEM uses data replication to ensure data persistence over failures.

When a client detects a failure, it gets a new compatible SeD from the MA, and issuea restart command on this SeD. The SeD then retrieve checkpoint data from persistentdata storage and restarts.

To cover the fault at agent level in DIET, each agent keeps a list of its nearest an-cestors in the hierarchy. If the parent of the agent is get faulted, it tries to reconnect tothe nearest alive ancestor. This algorithm ensures that every alive agent is connectedto the hierarchy.

2.3.6.2 NetSolve

The NetSolve system ensures that a user request will be completed unless every singleresource capable of servicing the request has failed [32]. When a client sends a requestto a NetSolve agent, it receives a sorted list of computational servers to try. When oneof theses servers has been successfully contacted, the numerical computation starts.When one of these servers fail during the computation, then another server is con-tacted and the computation restarts. This whole process is transparent to the user.Each time a computational server malfunction (server unreachable, server stopped,failure during computation, etc.) is detected by a client, this client notifies the failureto the agent. The agent update its tables and takes the necessary measures. If all theservers fail, then the user is notified that the computation cannot be performed at thattime. In NetSolve, if agent is crashed, NetSolve system cannot be used.

2.3.6.3 Nimrod

The PE takes the experiment plan as input, described by declarative parametric mod-eling language (the plan can also be created using the Cluster GUI) and manages theexperiment under the direction of schedule adviser. It then informs the dispatcher tomap an application task to the selected resource. The PE maintains the state of thewhole experiment and ensures that the state is recorded in persistent storage. Thisallows the experiment to be restarted if the node running Nimrod goes down [24]. PEcrash leads to unavailability of Nimrod environment.

2.3.6.4 Ninf

When a server fails to complete a task, the task is simply rescheduled to a new re-source. Ninf integrates with the Condor system for checkpointing to enable fault tol-erance for computation.


2.3.6.5 PUNCH

PUNCH system’s front end have logical accounts one per user on any given man-agement unit. The management unit contains logical accounts for each authorizedPUNCH user who has access to the front end. PUNCH can handle single point failureswith the help of management unit via scalability. Fault tolerance is achieved throughdistributing the software resources among appropriate number of management units.If management unit crashes, requests cannot be executed by PUNCH.

2.3.6.6 WebCom

The fault tolerance module [69] is another core module of WebCom which is pluggedinto the BackPlane. It employs numerous algorithms to check point and re-schedulethe task in case of failure or unavailability of a client/machine which is performing thetask. Whenever a failure occurs in the execution hierarchy of WebCom, fault tolerancemodule kicks-in and reschedules the task to a different client/machine with the helpof the Communication module.

AnalysisNetSolve use fault tolerance mechanism based on timers. The DIET and web basedsystems employ a checkpoint and restart mechanism. WebCom system includes afault survival mechanism which enables it to re-incorporate results from the failedexecution into recovered process.

Fault tolerance is an important factor that should be considered by all NES envi-ronments, because NES is used to compute application that require large computationtime or large storage space so if any disturbance occurred while executing any appli-cation, then it will be a large waste of time and resources.

2.3.7 Data management

Data management is an important component in grid computing due to different ap-plication needs, for example some applications are data intensive and some applica-tions are CPU intensive, further there might be data dependency between the com-ponents of a system. In any of these cases, data or applications have to be moved torespective location to avoid the overhead of bandwidth.

2.3.7.1 DIET

Data management is provided to allow persistent data to stay within the system for fu-ture re-use. This feature avoid unnecessary communication when dependencies existbetween different requests.

DIET has two approaches for data management based on Data Tree Manager (DTM)[42] and Juxtaposed Memory (JuxMem) [8] integration with DIET.

DIET components have their own DTM. Initially data for the task is provided tothe server. When a task is completed, resulting data can be kept in the environment,


thus implementing data persistency. Depending on the clients request, data resultscan be sent back to the client or kept at the server DTM or both. In the second case, thereference of the data is forwarded to the client. When a DIET server wants to accesssome data it uses the data reference given by client to request the data from the DTM.

With the JuxMem module, when a task is executed its data is sent to JuxMemand its data reference is sent to the client. The client or server can directly accessthe JuxMem to pull the required data using the data reference.

DTM is integrated in DIET and so is easy to use, JuxMem is based on P2P, and thusis more reliable.

2.3.7.2 NetSolve

The NetSolve approach provides two services to manage data inside the platform: theDistributed Storage Infrastructure (DSI) and the Request Sequencing Infrastructure(RSI).

DSI used to provide data transfer and storage. RSI used to decrease network trafficamongst client and servers [42]. Data references are managed inside RSI or by file de-scriptors in the DSI solution. In NetSolve request sequencing approach, the sequenceof computations needs to be processed by an unique server. In this case, a client has tohave the knowledge of the services provided by a server in order to use this approach.

Data is removed from server DSI depot after request computation, to handle thedata persistency. To reduce the data transfer time, DSI depot should be near to thecomputational servers.

2.3.7.3 Nimrod

Nimrod database is containing routing information that is constructed, accessed, andacted upon by the routing functions. An endpoint association should be stored closeto the endpoint so that other entities in the vicinity can quickly locate the endpoint. Anendpoint associations should also be stored at authoritative sources so that distant en-tities can find the endpoint independently of its current location. Node representativescollect maps from component nodes and build detailed maps from this information.Abstract map built by abstraction of service information in detailed map or by config-uration of measurement of service information. Map updates distributed to all noderepresentatives and all route agents for the enclosing node. Route agents use maps togenerate routes and may request lower-level maps based upon those received. Routesare expressed as sequences of node locators, together with the labels for those nodeservice attributes used in route selection. Routes must include at least the source anddestination node locators.

2.3.7.4 Ninf

Ninf uses Condor to avail some of the facilities provided by the Condor. Ninf basedon Condor is called Ninf-C. Ninf-C uses the condor data management techniques. In


Condor data placement is handled by Strok, a Data Placement Scheduler (DaPs).Stork [60] maintains a library of pluggable “data placement” modules. These mod-

ules get executed by data placement job requests coming to Stork. They can per-form interprotocol translations either using a memory buffer or third party transferswhenever available. In order to transfer data between systems for which direct inter-protocol translation is not supported, two consecutive Stork jobs can be used instead.The first Stork job performs transfer from the source storage system to the local diskcache of Stork, and the second Stork job performs the transfer from the local disk cacheof Stork to the destination storage system.

2.3.7.5 WebCom

Data management in WebCom is simple and dynamic. The kernel of the WebCom sys-tem is Condensed Graph model. The Condensed Graphs unify lazy, imperative anddata-driven evaluations of job execution. These evaluations depend on the qualityof service requirement; whether job execution has to be completed slowly (lazy com-putation) or in a systematic manner (like a workflow) or an on-demand (user has tointerfere at this point to give some input or analyze the results) user interactive basis.According to the quality data flow will match the dynamic evaluation techniques. We-bCom has a dynamic way of switching different evaluation techniques on need basis.

AnalysisSome NES environments that need data management have their own data manage-ment models, for eg., NetSolve has DSI. Others use third party data management.Ninf uses Condor data management techniques. Data management in WebCom isperformed explicitly by the user in the form of workflow graphs.

2.3.8 Security

Applications are run within the NES environments (commodity resources and on mul-tiple administrative domains) for the users/clients who are not identified in person.Security is a must for client code that is running in an NES environment and for theowner of the resource on which the code is running. Security with respect to ad-ministrator tampering the clients code and client.s trust of running valid code on theproviders machine. Security is one of the main concern in distributed systems.

2.3.8.1 DIET

DIET does not have any inbuild security. DIET uses VPN (Virtual Private Networks)and CORBA (Common Object Request Broker Architecture) for secure connectionsbetween its client and server applications.


2.3.8.2 NeOS

Security in NeOS is not inbuild. VPN security is used by NeOS for secure connectionsbetween the application sent and the server.

2.3.8.3 NetSolve

Security in NetSolve 17 is enabled via Kerberos support. NetSolve servers have twotype of installation with Kerberized and non-Kerberized installation. In both the casesclients send request to the server, Non-kerberized installation the server will returna status code indicating the acceptance and Kerberized installation will return an au-thentication error response to the request. So the clients have to send their Kerberoscredentials to the server before they proceed, this is how NetSolve server authenticatesits client. The Kerberized server maintains a list of Kerberos principal names to im-plement access control. Validity of the clients connecting to the server is checked withthe access list. There is no encryption or integrity of data stream exchanged betweenits components.

2.3.8.4 Nimrod

Nimrod does not have any inbuilt security. Nimrod-G which is Globus enabled usesthe security provided by Globus (Grid Security Infrastructure).

2.3.8.5 Ninf

Ninf system does not have any built in security mechanisms. Ninf-G uses the securityprovided by Grid Security Infrastructure (GSI).

2.3.8.6 PUNCH

There is no implicit security built into the PUNCH system. But the front end to in-terface with the PUNCH environment gives the required security. As PUNCH envi-ronment and their tools can only be accessed WWW enabled browsers. Users are notallowed to change, install, run code on this environment. They are only allowed to usethe tools provided by the PUNCH environments to do their experimental research.

2.3.8.7 WebCom

WebCom’s distributed computing architecture is used to distribute application com-ponents for execution. WebCom uses KeyNote-based authorization credentials [47] toassert whether a WebCom server is authorized to schedule and whether a WebComclient is authorised to execute, distributed application components. Secure WebCom

17http://icl.cs.utk.edu/netsolvedev/documents/ug/html/additions.html

http://icl.cs.utk.edu/netsolvedev/documents/ug/html/additions.html


provides a meta-language for bringing together the components of a distributed ap-plication in such a way that the components need not concern themselves with secu-rity issues. WebCom also supports Dynamic Administrative Coalitions(DAC) [46],where in users holding the administrative powers deligate administrative tasks con-strained by some set of rules. This phenomenon is used in the integration of securityand workflow in a decentralised administrative architecture to provide fault survivalagainst failures.There is no encryption or integrity of data stream exchanged betweenits components.

AnalysisOnly WebCom has a in-build security system. Where as others use third party securitylike security provided by VPN. Security to some of the environments is provided byGSI.

2.3.9 Deployment and visualization tools

It is difficult to see the status of a working production grid system without a cus-tomized monitoring system. The status includes many details of the system such assystem running information, performance changes, system/software failures, securityissues and so forth. Most grid middleware provide a simple monitoring tool for theirsystems, or provide simple tools to check the status of the system.

2.3.9.1 DIET

A deployment of DIET is made using the GoDIET tool [85]. GoDIET provides auto-matic configuration, staging, execution and management of DIET platform. An XMLfile describing the hierarchy and requirements is used as an input to GoDIET. For eachcomponent to be launched, configuration file is written on local disk including parentagent, naming service location, hostname and/or port endpoint. Configuration filestaged remote disk (scp) and remote command launched (ssh). GoDIET18 is writtenin Java.

VizDIET [20] is a tool that provides a graphical view of the DIET deployment anddetailed statistical analysis of a variety of platform characteristics such as the perfor-mance of request scheduling and servicing. VizDIET provides good scalability forDIET platform. VizDIET shows the communication between agents, state of SeD andavailable services. CPU, memory and network load can also be seen for each node ofthe DIET platform. To provide real-time analysis and monitoring of a running DIETplatform, VizDIET can register as a listener to LogService and thus receives all plat-form updates as log messages sent via CORBA. Alternatively, to perform visualizationand processing post-mortem, VizDIET uses a static log message file that is generatedduring run-time by LogService and set aside for later analysis. Figure 2.16 presents ascreenshot of VizDIET 19.

18http://graal.ens-lyon.fr/DIET/godiet.html19http://graal.ens-lyon.fr/DIET/vizdiet.html

http://graal.ens-lyon.fr/DIET/godiet.html

http://graal.ens-lyon.fr/DIET/vizdiet.html


Figure 2.16: Screen shot of VizDiet.

2.3.9.2 NetSolve

NetSolve has a visualization tool called VisPerf written in Java. This tool is used todisplay the status of the resources interacting with the NetSolve system. It is alsoused to monitor the activity of the NetSolve grid. The overview of VisPref is shownFigure 2.17.

2.3.9.3 WebCom

WebCom Integrated Development Environment (IDE) is both a deployment and vi-sualization tool. It is a (1) dynamic visualization tool to view the services hosted bythe resources on which WebCom is installed, (2)an application development tool and(3) an application launcher and a targeting tool. The Menu Bar (Figure 2.18) used forGraph opening, editing, running operations. Palette hosts a list of active services, li-braries of WebCom environment. These can be incorporated by just dragging anddropping on to the canvas and building a graph application. The Graph Canvas/Draw-ing Area is to build your graph applications on the fly. A Properties Panel hosts proper-ties pertaining to graphs, nodes, edges present on the canvas.

AnalysisNot all surveyed NES environments have their deployment and /or visualizationtools. Only WebCom has both tools. WebCom uses IDE for both deployment andvisualization. NetSolve has only a visualization tool. The other systems surveyed

2.4. COMPARISON OF SYSTEMS 37

Figure 2.17: Overview of VisPerf Monitoring System [63]

have no visualization tools.

2.4 Comparison of systems

Table 2.2 shows the summary of the environments surveyed and their associated char-acteristics. This summary can be used to identify the major driving forces behind thedevelopment of each system.

Only two of the surveyed environment provides deployment tools to facilitate thework of user for environment deployment. Three of the environment surveyed arecapable of dynamically visualizing task execution. In the other environments, tasksa processed in a "fire and forget" manner. Support to WebService is available in allenvironments except one.

It can be seen that five environments surveyed have scheduling mechanism andone among the left deals with optimization so scheduling is not that important fac-tor for it. Dynamic load balancing has not received any big attention from the envi-


Figure 2.18: Description: WebCom Integrated Development Environment

ronments. Static load balancing is provide by five environments but dynamic loadbalancing is provided by only one surveyed environment.

Most of the environments use basic timer mechanism for fault tolerance. Check-pointing concept is available only in one among the surveyed environments and thushave in-built fault tolerance. Some of the rest environments use the fault tolerancemechanism provided by the third party like Globus or Condor.

It can be seen, for example, that security has not been a major consideration. Onlyone of the seven environments surveyed has an integrated security mechanism, how-ever, others do make use of the security features of their third parties component sub-systems.

Three environments can potentially exploit the integration of loosely coupled re-sources. Of these, two require the use of the Globus middleware; the others, in con-trast, are capable of acting as stand alone middleware.

The environments that do not have security or fault tolerance can be important tothe user, who is interested in easy deployment and simple task execution. Becausethe added mechanisms makes the environments heavy and bit complicated to be usedby the end users. So according to different needs, users can select appropriate NESenvironments.

2.5 Conclusion

We did a survey of most prevalent NES environment, so as to obtain a depth knowl-edge of existing NES environments. We have presented seven NES environments

2.5. CONCLUSION 39

(DIET, NeOS, NetSolve, Nimrod, Ninf, PUNCH, and WebCom) under various head-ings, enabling an objective comparison to be made. By objectively comparing thesesystems, an attempt is made to enable potential NES users’ to choose an appropriateenvironment to best suit their needs. In addition to these NES environments we alsopresented some other systems which are popular grid middleware and grid systems.These systems work in same space as NES environments but are different becausesome of them are based on cycle steeling concept unlike NES environments, whichare dedicated servers. And some other systems which are dedicated does not providescheduling system, job management services, queueing mechanisms as done by NESenvironments.

We have selected DIET to validate our deployment planning because it has a hi-erarchical structure which provides more scalability to the system and a hierarchy isa simple and effective distribution approach which has been chosen by a variety ofmiddleware environments as their primary distribution approach [28, 38, 51, 75].

DIET components are launched statically, means once a component is launchedon selected resource the functionality of the launched component cannot be changed,as happen in WebCom. Another point to select DIET is its easy deployment tool(GoDIET), which automatically performs different tasks, like it generates the configu-ration files for DIET components, performs remote clean-up, etc.

40C

HA

PT

ER

2.N

ET

WO

RK

EN

AB

LED

SE

RV

ER

EN

VIR

ON

ME

NT

S

DIET NeOS NetSolve Nimrod NINF-G PUNCH WebCom-GOrganization LIP,INRIA,

ENS Lyon,France

Argonne Na-tional Laborato-ry/MCS

University ofTennessee,Knoxville

Monash Uni-versity

collaborativework

Purdue Uni-versity

Centre for Uni-fied Computing

Version DIET v1.1 NeOS v5 NetSolve v2.0 Nimrod-Gv.3.0

Ninf-G 4.0.0 PUNCH v1.2 Anyware,WebCom-G,WebCom(v0.1)

OS support Unix/Linux,Mac OSX

Win 9x/XP,Unix/Linux,Mac OS, Solaris

Win XP,Unix/Linux

Win XP,Unix/Linux

Win XP,Unix/Linux,Mac

Win 9x/XP,Unix/Linux,Mac OS,Solaris

Win 9x/XP,Unix/Linux,Mac OS, Solaris

Requirements OmniORB,gcc

WWW browser gcc gcc gcc, JavaenabledMachine

WWWbrowser

Java enabledMachine

Written in C, C++ Tcl/Tk, Java,Kestrel

C C C, C++ no particularlanguage

Java

Scheduling Yes N/A Yes Yes Yes Yes YesFault Tolerance Yes N/A No Yes Yes, Condor No Yes(Checkpointing)Dynamic Job No N/A No No No No YesMigrationSecurity Yes, VPN and

CORBAN/A No No Yes, Globus No Yes, Trust man-

agement andKeynote

Visualization Yes, VizDIET No Yes, VisPerf No No No Yes, WebCom-IDE

Deployment Yes, GoDIET No No No No No Yes, WebCom-IDE

WebService No Yes Yes Yes Yes Yes YessupportProject Link graal.ens-

lyon .fr/DIETwww-neos.mcs.anl.gov/neos

www.cs.utk.edu/netsolve

www.csse.monash.edu.au/∼davida/nim-rod

ninf.apgrid.org/packages

punch.purdue.edu

www.cuc.ucc.ie

Table 2.2: Table of Comparison.

Chapter 3

Deadline Scheduling with Priority oftasks on Network Enabled ServerEnvironments

In previous chapter, seven most common Network Enable Server (NES) environmentshave been presented, so as to have a good knowledge of the mostly used NES. As men-tioned earlier NES endeavor to provide users an easy access to distributed resourcesby hiding the complicated job distribution and resource allocation strategies from theuser. Thus, the two important factors that an NES should efficiently handle are: (a)job distribution and (b) resource allocation. This chapter presents job distributionalgorithms on an NES environment. Presented algorithms provide non-preemptivescheduling of sequential tasks on NES environment.

3.1 Introduction

Currently, NES systems use the Minimum Completion Time (MCT) [64] on-line sched-uling algorithm where-by all jobs are scheduled immediately or refused. This ap-proach can overload interactive servers in high load conditions and does not allowadaptation of the schedule to task dependencies. In this chapter we consider an al-ternative model based on the deadline and priority of the task submitted to NES. Theexpected response time of each task’s execution as a scheduling parameter is calleddeadline. The priority for a task indicates how important the tasks is. Higher the task’spriority, more important is the task. We assume that each task, when it arrives in thescheduler, has a given static deadline (possibly given by the client). If a task completesits execution before its fixed deadline, it meets the deadline. Otherwise the task fails.Importance is first given to the task’s priority and then the task is allocated to serverthat can meet the task’s deadline. So if arrived task have higher priority than the tasksthat are waiting for execution will be shifted back in the waiting queue. This maycause that some already allocated tasks, that are waiting for their execution turn maymiss their deadline. We augment the benefits of scheduling algorithms with fallback

41

42 CHAPTER 3. DEADLINE SCHEDULING WITH PRIORITY OF TASKS ON NES

mechanisms (if task cannot be executed then it is resubmitted by the client to system)and load measurements (calculation of the current load on a server before assigning ita new task).

A deadline scheduling strategy is given in [79] for multi-client multi-server caseon a grid platform. The algorithm presented aims at minimizing deadline misses. Itis also augmented with load correction and fallback mechanisms to improve its perfor-mance. The first optimization is done by taking into account load changes (due toprevious scheduling decisions) as soon as possible. The fallback mechanisms makessome corrections to the schedule at the server level. If the server finds that a task willnot meet the deadline due to prediction errors, the task is re-submitted to the system.Our work is complementary, as we present the scheduling algorithm that take intoaccount both the deadline and priority of task while scheduling. We have also testedour algorithm with the fallback mechanism.

3.2 Scheduling algorithms for NES environments

The presented scheduling algorithms aim at scheduling sequential tasks on an NESenvironment. This kind of environment is usually composed of an agent receivingtasks and finding the most efficient server that will be able to execute a given task onbehalf of the client. Servers can have different computing power. The client submitsthe request to an agent of NES, which is responsible for allocating the resources andscheduling the tasks for execution. Based on different criteria the scheduling of tasks isdone. Servers execute the scheduled tasks one after the other. Scheduling algorithmspresented below are based on the servers’ load, and task’s deadline and priority.

3.2.1 Client-server scheduler with load measurements

As servers varies in computing power, each task can have a different execution timeon each server. Let TaSi

be the execution time for the task Ta on server Si. This timeincludes the time to send the data to server, task execution time and time to send theresult to client:

TaSi=

Wsend

Psend

+Wrecv

Precv

+WaSi

PSi

where Wsend is the size of the data transmitted from the client to the server, Wrecv

denotes to the data size transmitted from the server to the client, WaSiis the number of

floating point operations of the task, Psend denotes the predicted network throughputsfrom the client to the server, Precv denotes the predicted network throughputs from theserver to the client and PSi

is the server performance (in floating point operations persecond).

Psend should be replaced by the network throughput value measured just beforethe task. This value can be calculated by any predicted forecasting tool like FAST [43] .Precv is estimated using previous measurements of network throughput from server to

3.2. SCHEDULING ALGORITHMS FOR NES ENVIRONMENTS 43

the client. The CPU performance is also dynamic and depends on other tasks runningon the target processors. Thus, forecasting tool can be used to provide a forecast ofCPU performance, so as to take into account the actual CPU workload.

Algorithm 3.1 Straightforward algorithm: client-server scheduler with load measure-ments.

1: List=NULL2: repeat3: for all server Si do4: if can_do(Si, Ta) then5: TaSi

= Wsend

Psend+ Wrecv

Precv+

WaSi

PSi

6: List=sort_insert(List,TaSi, Ta, Si)

7: num_submit=task_ack(List,2×Wsend

Psend)

8: Tr=task_submit(List[num_submit])9: until the end

Algorithm 3.1 gives a straightforward algorithm to get a sorted list of servers thatare able to compute the client’s task. It assumes that the client takes the first availableserver, which is most efficient, from the list. actual execution time is denoted by Tr.

For sake of simplicity we define four functions for Algorithm 3.1:

can_do This function returns true if server Si has the resource required to computetask Ta. This function takes into account the availability of memory and diskstorage, the computational library etc.

sort_insert This function sorts servers by efficiency. As an input, we have thecurrent List of servers, the time predicted TaSi

, the task name Ta and the servername Si. Its output is the List of ordered servers with respect to TaSi

.

task_ack This function sends the data and the computation task. To avoid a dead-lock due to network problems, the function chooses the next server in the list ifthe time to send the data is greater than the time given in the second parameter.The output of this function is the server where the data are sent (index numberin the array List).

task_submit This function performs the remote execution on the server List [num_submit].

3.2.2 Client-server scheduler with a forecast correction mecha-nism

The previous algorithm assumes that forecasting tool always returns an accurate fore-cast. But in real world there is always a gap, may be very small, between the actual ex-ecution time and the time estimated by the forecasting tool. In this section, we presentan algorithm that corrects the prediction of the forecasting tool by taking into account


the gap between the performance prediction and the actual execution time of eachtask.

Algorithm 3.2 Scheduling algorithm with forecast correction mechanism.1: CorrecPredic = 1002: nb_exec = 03: for all server Si do4: if can_do(Si, Ta) then5: TaSi

= Wsend

Psend+ Wrecv

Precv+

WaSi

PSi

6: TaSi=

TaSi×CorrecPredic

100

7: List=sort_insert(List, TaSi, Ta, Si)

8: num_submit=task_ack(List,2×Wsend

Psend)

9: Tr=task_submit(List[num_submit])

10: CorrecPredic =nb_exec×CorrecPredic+ 100×Tr

TaSi

nb_exec+1

11: nb_exec++

The function task_submit is upgraded to return the time of the remote execution.Thus, we can modify the next predictions from the monitoring system as follows toobtain the corrected load values:

TaSi=

TaSi× CorrecPredic

100

where CorrecPredic is an error average between the prediction time and the actualexecution time. This value is updated at each execution as follows:

CorrecPredic =nb_exec × CorrecPredic + 100×Tr

TaSi

nb_exec + 1

where Tr is the actual execution time, TaSithe time predicted and nb_exec is a

counter for number of algorithm’s execution.

3.2.3 Client-server scheduler with a priority mechanism

Until now, for client-server systems, either the deadline scheduling or the prioritybased scheduling [7, 21, 74] are considered. Here we give an algorithm that utilizesboth criteria to select a server for a submitted task.

In Algorithm 3.3, tasks have a predefined priority and deadline. A task is allocatedon the server task queue according to its priority. If a task can meet its deadline thenit is sent to the server for execution. For Algorithm 3.3, we define some new variables.TDa is deadline and TPa is priority of task Ta. It may happen that new arrived taskhave higher priority than already tasks in the server queue, then the execution timeof tasks with lower priority will be changed. Because tasks’ are allocated accordingto their priority. Figure 3.1 shows an example of this case. Thus, we consider one

3.2. SCHEDULING ALGORITHMS FOR NES ENVIRONMENTS 45

variable, TFaSi, which denotes the changed execution time of task Ta on server Si

after placing a new task on the server task queue. To simplify the explanation of thealgorithm, we define five functions:

Algorithm 3.3 Client-server scheduler with priority mechanism.1: repeat2: for all server Si do3: if can_do(Si, Ta) then4: TaSi

= Wsend

Psend+ Wrecv

Precv+

WaSi

PSi

5: if TaSi< TDa then

6: CFTi=count_fallback_tasks(Ta, TaSi, TPa, TDa)

7: if TFaSi< TDa then

8: best_server_name=best_server(CFTSi, TFaSi

)9: Tr=task_submit(best_server_name,Ta)

10: re-submission(task_name)11: until the end

can_do This function returns true if server Si has the resource required to computetask Ta. This function takes into account the availability of memory and diskstorage, the computational library etc.

count_fallback_tasks This function counts the fall-backed tasks (CFTSi) if new

task will be allocated to server Si. Tasks that cannot meet their deadline after theinsertion of the new task are called fall-backed tasks. Task Ta is placed accordingto its priority TPa on the server task queue, which may change the executiontime of other tasks on server queue.

best_server This function selects the best server (best_server_name) among theservers that can execute the task within the deadline time. The best server isselected by comparing the number of fall-backed tasks. Server with less fall-backed tasks is selected for the task execution. If the servers have same numberof fall-backed tasks, then the time to compute the task is compared and the serverthat takes less time is selected.

task_submit This function performs the remote execution on the server given bybest_server. The argument of the function is one server, not a list of serversas in Algorithm 3.1.

re-submission This function submits the fall-backed tasks to the servers, for re-computing the execution time. If a server can meet the task’s deadline then thetask is allocated to that server.


Execution timeon server

Task Priority Deadline S1 S2 S3

1 3 15 3 5 62 5 10 5 12 93 2 30 11 20 154 4 20 10 np 175 5 15 12 14 np

Table 3.1: Priority, deadline and computation time of each task.

T1

T1T2

T1T2

T2

T2

4T T1

T5

T3

T3

T3T14T

3

Server1 Server2 Server3

5 8

85

5

5

15 5

14

15

15

6 2115

Task

Exectution Time

Task_id

Figure 3.1: Example for priority scheduling algorithm with fallback mechanism. Taskid and execution time is written diagonally in each box.

Figure 3.1 shows an example to explain the behavior of Algorithm 3.3. Lets con-sider 3 servers with different capacities and 5 tasks. The priority, deadline, and com-putation time of each task on each server is shown in Table 3.1. Execution time isthe time taken by the dedicated server to compute the task when the server is free.Computation value np denotes that the task cannot be executed on the server, whichmay be due to the type of the task, memory requirement etc. A task is allocated to theserver while checking the execution time on each server, its priority and deadline.

Server S1 takes minimum time to execute task T1, so task T1 is allocated to serverS1. Task T2 can be executed faster on server S1 so it is allocated to server S1. But as itspriority is higher than task T1, it shifts the task T1. Thus, the execution time of T1 haschanged to 8 unit. If task T3 is placed on server S1, it will take less execution time butdue to its priority the execution time will be changed to 19 unit. So task T3 is placedon server S3 where its execution time is 15 unit. Task T4 is placed on server S1, butwhile doing so task T1 is fall-backed because execution time of task T1 has changedto 18 unit which crosses the deadline time, 15 unit. Re-submission of task T1 is doneand it is allocated to server S2. Task T5 is placed on server S2. As its priority is higherthan task T1, execution time of T1 changed and task T1 has again fall-backed. Againre-submission is done and task T1 is placed on server S3.

3.3. SIMULATION RESULTS 47

Figure 3.2: Priority based tasks are executed without fallback mechanism.

3.3 Simulation results

To simulate the deadline algorithm with priority and the impact of fallback mecha-nism with this model we use a simulation toolkit called SimGrid [30]. SimGrid pro-vides a framework for setting up a simulation where decisions are taken by a singlescheduling process [26].

For experiment, we took 100 homogeneous servers to execute the submitted tasks.All servers are directly connected to an agent. Tasks are submitted to servers by theagent. Each task is associated with a priority and deadline. We randomly generate thepriority between 1 and 10 and considered tasks deadline to be 5 times of the computa-tion amount needed by the task on dedicated server. Deadline of a task is taken 5 timesof the computation time after doing some testing with different values. If we considerthe deadline time very close to computation time then number of fallback tasks is veryhigh and if difference between the computation time and deadline time is large thentasks can be executed without any limit.

Algorithm 3.3 has been implemented to do the experiments as this algorithm haveconsidered both task’s priority and deadline of task’s execution. Experiments havebeen done by fixing the priority of the tasks and varying the priority depending ontasks’ size. Figure 3.2 shows when tasks with same priority are submitted the numberof executed tasks is less than the tasks executed with random priority. When the num-ber of tasks is less the impact of task priority is negligible. But as the number of tasksincreases, tasks’ priority plays an important role for increasing the number of tasksexecuted.

To check the impact of fallback mechanism on the number of tasks executed under


different criteria, Algorithm 3.3 has been used and results are presented in Figure 3.3and Figure 3.4. Tasks of different size and priority are submitted to the system. Task’ssize is measured on the basis of execution time that it took when executed on dedicatedserver. Figure 3.3(a) shows that the fallback mechanism has little effect if the tasks havethe same priority. But in Figure 3.3(b), it can be seen that when tasks with differentpriority are submitted, fall-back mechanism increases the number of executed tasks.To see the effect of fall-back mechanism on different size of the tasks experiments withdifferent tasks size is performed and results are shown in Figure 3.4(a), and 3.4(b).It can be seen that the fallback mechanism is very useful for any task size. Fall-backmechanism has increased the total number of executed tasks.

3.4 Conclusions

In this chapter we have presented the first step to use NES systems efficiently by pro-viding algorithms for scheduling the sequential tasks on NES systems.

We have shown that the load correction mechanism and fallback mechanism canincreases the number of executed tasks. We have presented and implemented an al-gorithm that considers both priority and deadline of the tasks to select a server. Weshowed through simulation that the number of tasks that can meet their deadlines canbe increased by 1) considering tasks’ priorities and by 2) using a fallback mechanismto reschedule tasks that were not able to meet their deadline on the selected servers.

Having efficient algorithms to schedule tasks on NES system is an auspicious be-ginning towards the achievement of best throughput from NES systems. However,resource allocation according to the NES’s components is also a very important factorto influence the performance of any NES system. In next Chapter 4 we present theoverview of some existing tools that are used for resource allocation.

3.4. CONCLUSIONS 49

(a) Tasks without priority

(b) Tasks priority vary between 1-10

Figure 3.3: Comparison of tasks executed with and without fallback based on taskspriority.


(a) Tasks with execution time less than 15 minutes on dedicated servers

(b) Tasks with execution time greater than 4 hours on dedicated server

Figure 3.4: Comparison of tasks executed with and without fallback based on tasksexecution duration.

Chapter 4

Deployment Tools

As mentioned earlier the two main factors to use an NES system efficiently are (a)job distribution and (b) resource allocation. In previous Chapter 3 we have presentedscheduling algorithms to schedule sequential tasks according to the task’s priority anddeadline. The algorithms [87] have to be implemented on well deployed NES systemon grid resources. But as end users can be biologists, mathematicians, astrophysicists,etc., for them selecting an appropriate resource (resource selection) for their applica-tion, telling the type and location of the selected resource (resource localization) andmapping their application files (executable, libraries etc.) on to the selected resource isvery complex task. To help the end users in availing the most from computation grid,an automatic deployment process of applications/ middleware (NES components) oncomputation grid is required.

Due to the scale of grid platforms, as well as the geographical localization of re-sources, middleware approaches should be distributed to provide scalability and adapt-ability. Much work has focused on the design and implementation of distributed mid-dleware. To benefit most from such approaches, an appropriate mapping of middle-ware components to the distributed resource environment is needed. However, whilemiddleware designers often note that this problem of deployment planning is impor-tant, only a few algorithms exist [22, 39, 52, 56, 59] for efficient and automatic deploy-ment.

In this chapter we present some of the existing tools that are used to deploy somespecific software/ middleware on set of available resources.

4.1 Introduction

A deployment is the mapping of a platform and middleware (applications/software)across many resources. Deployment provides reliability by allowing a greater controlof the distribution process and increases security as it ensures that only authorizedpersonnel have access to the grid. The tool which deploy the middleware (or applica-tion) on to grid or cluster according to deployment plan is referred as deployment tool.Deployment tools require a deployment plan as an input. Deployment plan describes

51

52 CHAPTER 4. DEPLOYMENT TOOLS

which resources should be used and how they should be connected (directly, hierar-chically, etc.) with each other. Deployment is done in phases and order of deploymentphases is not rigid. If any problem occurs like selected resource is not reachable, trans-formation arisen from deployment scheme to the deployment plan etc., then somephases are repeated. But in general the order of deployment phase is as mentionedbelow.

Initially Resource discovery phase, discovers the resources that are compatible withthe middleware (application) requirement. Deployment planning phase, generates thedeployment plan based on middleware (application) description and decides whichresource can be used to deploy which middleware component, total how many re-sources are needed and how these resources should be connected to each other. InResource selection phase, resources are selected according to deployment plan (gen-erated in deployment planning phase) that will host the middleware component (orexecution of application). In Remote files installation phase , the different parts of eachcomponent have to be mapped on some of the selected resources. Pre-configurationphase checks the files and other requirements that should be installed and configuredbefore launching the application. Launch phase executes the deployment tasks whilerespecting the precedences described in the deployment plan. Post-configuration phasegenerally launch script(s) for cleaning and killing of processes that were launched.

There exist [10, 22] some deployment tool that takes a user defined deploymentplan as input. Selecting a good deployment plan manually according to the middle-ware (application) and resource description is a complex and very time consumingtask. Only expert can do it efficiently in reasonable time. To facilitate the generation ofdeployment plan a deployment planning tool is required. Till now very few deploymentplanning tools exist and that are also specific for some application (or middleware).The input to the deployment planning tool is a description of the middleware compo-nents (or application) and a description of the available resources. Deployment plan-ning tool generates deployment plan by selecting compute nodes and mapping themiddleware components (or application processes) onto the selected compute nodes.

In this chapter the two deployment categories: software deployment and system de-ployment are presented with some of the existing tools. The comparison of the tools ispresented in conclusion section.

4.2 Software deployment

Software deployment maps and distributes a collection of software components ona set of resources. Software deployment includes activities such as releasing, config-uring, installing, updating, adapting, de-installing, and even de-releasing a softwaresystem. The complexity of these tasks is increasing as more sophisticated architec-tural models, such as systems of systems and coordinated distributed systems, be-come more common. Many tools have been developed for software deployment basedon different approaches. Deployment using mobile agents (eg., Software Dock [52],JADE [22], SmartFrog [50] ), or by using AI planning techniques (eg., Pegasus [39, 41],

4.2. SOFTWARE DEPLOYMENT 53

Sekitei [57]) or based on software architecture (eg., JDF). The sequence of some de-ployment software that are presented in this chapter is in an alphabetic order ADAGE,GoDIET, JADE, JDF, Pegasus, Sekitei.

4.2.1 Automatic Deployment of Applications in a GridEnvironment

Automatic Deployment of Applications in a Grid Environment (ADAGE) [62] is a pro-totype middleware. It currently deploys only static applications on the resources ofa computational grid. Deploying applications can be distributed applications (likeCORBA component assembly), parallel applications (like MPICH-G2) or combinationof the two applications.

The middleware requires two pieces of information: application description, andresources description. These descriptions are represented as XML documents. Re-source description includes compute nodes and their characteristics (operating sys-tem, architecture, storage space, memory size, CPU speed and number, etc.) as well asnetwork information (topology, performance characteristics). Using those two piecesof information, ADAGE automatically selects resources which will run the applica-tion and maps the application processes onto the selected resources. ADAGE useseither of the two basic scheduling algorithms, round-robin or random. The ADAGEdeployment planner selects a job submission method (ssh or Globus2) based on itsavailability and control parameters. Finally it automatically launches the applicationprocesses remotely using the Globus Toolkit (version 2) as a grid access middleware,and initiates the application execution.

Analysis: ADAGE working phases are presented in Figure 4.4. ADAGE seemsvery interesting tool for the deployment of grid applications. ADAGE contains a de-ployment model that specifies how a particular component can be installed, config-ured and launched on a machine. However, ADAGE has no intelligent deploymentplanning algorithm. Deployment planners can be implemented in ADAGE as plug-in.Currently, there are two very basic planners implemented: round-robin and random.For remote installation ADAGE uses scp or GridFTP. Adage_deploy launches the pro-cesses using the specified remote process creation method (e.g., ssh, Globus, etc.) andconfigure the application.

4.2.2 GoDIET

GoDIET1 [85] is an automated approaches for launching and managing hierarchies ofDIET [28] agents and servers across computational grids. Key goals of GoDIET in-cluded portability, the ability to integrate GoDIET in a graphically-based user tool forDIET management, and the ability to communicate in CORBA with LogService [28].GoDIET is implemented in Java as it satisfies all of these requirements and provides

1http://graal.ens-lyon.fr/DIET/godiet.html

http://graal.ens-lyon.fr/DIET/godiet.html


rapid prototyping. Input to GoDIET is an XML file describing the deployment plan todeploy the DIET platform.

Remote cleanupof launched processes

Launches services

logging services)(name service,

Generate configurationfile for each DIET

element

Destroy

deployed platform

Client gets reference of selected server

Client submits job toselected server

Server executes client’sjob and send back the result to client

Client submits request

to deployed platform

(DIET’s platform description)

XML File

GoDIET

DIET’s platform utilization steps

Launches DIETelements

(Deploy DIET platform)

Figure 4.1: Working steps of GoDIET.

GoDIET automatically generates configuration files for each DIET element whiletaking into account of user configuration preferences and the hierarchy defined by theuser, launches complimentary services (such as a name service and logging services),provides an ordered launch of components based on dependencies defined by the hi-erarchy, and provides remote cleanup of launched processes when the deployed plat-form is to be destroyed. Figure 4.1 shows the working step of GoDIET and interactionsteps of a client with DIET deployed platform.

Analysis: GoDIET automatically provides an ordered launch of DIET componentsbased on dependencies defined by the user in DIET hierarchy. But user should havewrite permission on the machines that will be used by GoDIET for DIET platform de-ployment because GoDIET write configuration file for each DIET component and thenremove them in clean phase. GoDIET can deploy the platform only if DIET is alreadyinstalled on the selected machines. Even, path of the binaries of DIET components

4.2. SOFTWARE DEPLOYMENT 55

have to be mentioned in the input XML file, which is submitted as input to GoDIET.

4.2.3 JXTA Distributed Framework

JXTA Distributed Framework (JDF) [10] is a framework for automated testing of JXTA-based systems from a single node refereed as control node. JDF provides a genericframework allowing to define custom tests, deploy all the required resources on adistributed testbed and run the tests with various configurations of the JXTA [11] plat-form.

To use JDF each node should have Java (1.4.x), JXTA 2.2.1 (or before), bourne shelland ssh (preferably OpenSSH supporting ssh v2) or rsh. As JDF is dependent onssh and bourne shell, it only runs on Unix platforms currently.

JDF requires all information about distribution before hand. A network descriptionfile defines the requested JXTA-based network and notion of nodes profile. A filecalled “node file” presents the list of physical nodes and the path of the JVM on eachphysical node. File transfers and remote control are handled using either ssh/scp orrsh/rcp. A node (control node) has to be selected to run JDF from. All other physicalnodes should be visible from the control node. User submit the JXTA platform thatshould be tested and the path to the files in the form of an XML file to JDF. JDF is runthrough a regular shell script which launches a distributed test. This script executes aseries of elementary steps: (a) install all the needed files (b) initialize the JXTA network(c) run the specified test (d) collect the generated log and result files (e) analyze theoverall results and remove the intermediate files. Killing of all the remaining JXTAprocesses can also be done using a kill script.

Analysis: JDF is a deployment tool specific to JXTA. It is a very basic deploymenttool, as it has no deployment planning method and the resource selection XML file hadto be given by users. JDF uses scp to remote file installation and ssh for job launchon deployed platform.

4.2.4 Pegasus

Pegasus [41] is a workflow mapping system that maps an abstract workflow descrip-tion onto grid resources. Pegasus uses an AI-based planner (Prodigy Planner [78]) toperform the mapping. Abstract workflow is composed of tasks (application compo-nents) and their dependencies (reflecting the data dependencies in the application).Abstract workflow is specified in XML in the form of a DAX (DAG XML description).

Once Pegasus [40] get an abstract workflow, it consults various grid informationservices. Monitoring and Discovery Service (MDS) provides information about thenumber and type of available resources, static characteristics such as the number ofprocessors and dynamic characteristics such as the amount of available memory. Repli-ca Location Service (RLC) maintains and provides access to mappings between logicalnames of data items and their target names. Transformation Catalog (TC) determinesthe location where the computations can be executed.


Chimera

ExecutableWorkflow

Workflow Editors

RLS TC MDS

Pegasus DAGManConcrete

Workflow

Abstract

Workflow

Jobs Condor−Gpools

Individual

hosts

TeraGridhosts

Cluster

Clustermanaged by

LSF

PBSmanaged by

Figure 4.2: Components of a workflow generation, mapping and execution system.

Pegasus combine the information from the TC with the MDS information, to makedecision about resource assignment. Pegasus generates concrete workflow in a formof files which is submitted to DAGMan [49] for execution. Submitted files indicatethe operations to be performed on a given remote systems and the order in which theoperations need to be performed. DAGMan is responsible for enforcing the depen-dencies between the jobs defined in the concrete workflow. Jobs can be mapped andexecuted on a variety of platforms: Condor pools, clusters managed by Load SharingFacility (LSF) 2 or Portable Batch System (PBS) [53], TeraGrid 3 hosts, and individualhosts. Figure 4.2 outline the above explanation.

Analysis: Pegasus uses Prodigy planner [78] to perform the mapping and gridservices like RLS, TC, and MDS for resource selection, remote file installation and pre-configuration. DAGMan performs the launch according to the submitted concreteworkflow files. Pegasus seems interesting but its efficient utilization depends on thegrid information service providers.

4.2.5 Sekitei

Sekitei [57] is an AI planner for constrained component deployment in wide area net-works. Sekitei is developed to solve the Component Placement Problem (CPP). CPPis concerned with finding a valid component deployment, i.e., a set of components,linkages between them, and a mapping of the resulting DAG onto links and nodes ofa wide area network, so that client requirements are satisfied.

The CPP can be viewed as an AI planning problem with resource constraints. Thestate of the system is described by the availability of interfaces on nodes and placement

2http://www.epcc.ed.ac.uk/DIRECT/grid/node45.html3http://www.teragrid.org/

http://www.epcc.ed.ac.uk/DIRECT/grid/node45.html

http://www.teragrid.org/

4.3. SYSTEM DEPLOYMENT 57

of components on nodes. Component framework dictates a CPP in XML. This CPP istranslated into a planning problem using a compiler. The planner solves the planningproblem and gives out a plan. The plan is then decompiled into a deployment planand fed back onto the component framework. The component framework then usesthis deployment plan to deploy the actual components in the network.

Figure 4.3: Process flow graph for solving CPP [58].

Analysis: Sekitei provides a deployment planner to solve the CPP. Sekitei providesa deployment plan based on CPP and planning problem. Sekitei algorithm constructsand checks many logically correct plans that may fail during symbolic execution dueto resource restrictions. Sekitei cannot support incremental replanning in case of achange in resource availability and it cannot support decentralized planning.

4.3 System deployment

System deployment involves two steps, physical and logical. In physical deploymentall hardware is assembled (network, CPU, power supply, etc.), whereas logical deploy-ment is organizing and naming whole cluster nodes as master, slave, etc. As deployingsystems can be a time consuming and cumbersome task, tools such as Dell Deploy-ment Toolkit [1], Kadeploy [65] and Warewulf 4 have been developed to facilitate thisprocess.

4.3.1 Dell OpenManage Deployment Toolkit

Dell OpenManage Deployment Toolkit (DTK) performs automated and scripted de-ployments of Dell PowerEdge servers. DTK comprises a set of MS-DOS utilities, batchscripts, and configuration files.

DTK’s utilities can be used as stand-alone tools for configuring individual compo-nents such as BIOS, or they can be integrated into scripts for one-to-many mass system

4http://www.warewulf-cluster.org/cgi-bin/trac.cgi

http://www.warewulf-cluster.org/cgi-bin/trac.cgi


deployments. These utilities need at least MS-DOS 6.22 or higher. The DTK’s scriptsdetermines the overall execution flow: what executes when, where to find the requiredfiles and returns error codes when problems occur. These script are optimized for MS-DOS 7.1 or higher. DTK can now deploy two operating systems, windows (2000 ad-vanced server and 2003 enterprise edition) and Red Hat (Linux 9, Enterprise Linux 2.1and 3).

DTK deploy operating system in eight steps. (a) sysinfo.exe utility acquires systeminformation (configuration and hardware information). (b) Information returned bysysinfo.exe is used by the scripts to perform system specific tasks. (c) BIOS falsh up-dates and configuration is performed by using the DTK’s utilities. biosflsh.exe utilityupdates the system BIOS to a version specified by .hdr file. bioscfg.exe utility is usedfor configuring system BIOS settings and save the settings to a file. (d) Administratoruse the saved file with the biosrep.bat utility to replicate the model BIOS configurationto multiple systems. (e) The utilities racadm.exe and racconf.exe are used to configureDell Remote Assistance Cards (RACs). (f) RAID controllers are configured by usingraidcfg.exe utility. This utility also creates containers that are used by pertcfg.exe util-ity to create disk partitions that can be used during the OS installation. (g) mountutility made available disk partition to MS-DOS deployment tools. (h) rhinst.bat andwininst.bat scripts are used for scripted OS deployment.

Analysis: DTK is specific for deployments of Dell PowerEdge servers. DTK usestools to configure each component and scripts for the flow of overall execution. Utilitycorresponding to each phase of DTK’s working is shown in Figure 4.4.

4.3.2 Kadeploy

Kadeploy [65] provides a set of tools for configuration, installation and deploymentof an operating system on a set of nodes. It deploys Linux, BSD, Windows, Solaris oncomputers.

To use Kadeploy, systems should have privileges to execute Kadeploy, nodes’ hard-ware (network interface) must be PXE compliant, this means that the BIOS/EFI shouldallow to boot from the ethernet card, DHCP server, TFTP server that meets PXE stan-dard and Perl. Kadeploy installs the database and uses it to maintain the persistenceof the information about the cluster composition and current state.

Kadeploy performs complete deployment in five steps. (a) reboot on the deploy-ment kernel via PXE protocol. (b) pre-installation actions (partitioning and/or filesystem building if needed etc.) by using preinstallation script (executed before send-ing the system image). (c) Copy the system environment (system image). (d) post-installation by using post-installation scripts (executed after having sent the systemimage) and (e) finally, reboot on the freshly installed environment. If the deploymentfails on some nodes, these are rebooted on a default environment.

Kadeploy uses different tools to facilitate the deployment. Initially cluster shouldregister current state of its software and hardware composition in database. Karecor-denv tool is used to register the cluster “software” composition. If any environmentis already installed on some partitions of some nodes, it is necessary to register it by

4.3. SYSTEM DEPLOYMENT 59

giving the appropriate information (image name, image location, kernel path etc.) tothe karecordenv tool. kaaddnode is used to register the cluster "hardware" composition.The tool registers the information (name and addresses of nodes, disk type and size,partition number and size etc.) in the database. After gathering all information aboutthe cluster nodes, system is ready for deployments.

kaarchive creates an environment image and kacreateenv registers the environmentimage in deployment system and make it available for deployment. Environment islaunched by Kadeploy which deploys the registered environment on nodes. Finally,nodes can be rebooted according to requested reboot type using Kareboot.

Analysis: Kadeploy is used for configuration, installation and deployment of anoperating system. It is used to manage nodes on Grid’5000 5. Selection of resourcesfor deployment is done by user. Installation and configuration of the files and launchis done some specific Kadeploy tools. Specific tools for each deployment phase arementioned in Figure 4.4.

4.3.3 Warewulf

Warewulf is a cluster implementation toolkit that manages and distributes Linux toany number of nodes by using a simple master/slave relationship technique.

Warewulf facilitates the process of installing a cluster and long term administrationfrom one master. Warewulf requires at least one network interface for cluster use.This interface will reside on a private physical network that all nodes are connectedto and thus automate the distribution of the node file system during node boot. Itallows a central administration model for all slave nodes and includes the tools neededto build configuration files, monitor, and control the nodes. Warewulf has severaldependencies 6 (other software that is required on the system for Warewulf to runproperly).

Before doing any Warewulf configuration, master node should be fully configured.Everything should be working (hardware, software, added correct drivers and con-figured all network addresses etc.) on master node before slaves (other nodes of thecluster) configuration and installation can be started. The single image is propagatedto the nodes via network or CD ROM booting.

Warewulf deploy Linux on cluster in six steps. (a) Install GNU/Linux on the mas-ter node (b) Network boot image is created using a standard chroot’able file systemreferred to as a Virtual Node File System (VNFS) (c) Download and install Warewulfpackages. (d) Run Warewulf configuration tools; Masterconf :: configures the masternode, Nodeconf :: configures the VNFS, Dhcp-build :: build dhcpd configuration file(e) Start Warewulfd process “/etc/init.d/warewulfd start”. (f) Node use etherboot toobtain a boot image via DHCP and TFTP.

Using Warewulf, High Performance Computing (HPC) packages such as LAM-MPI/MPICH, SGE, PVM, etc., can be deployed throughout the cluster. Warewulf is

5https://www.grid5000.fr6http://www.warewulf-cluster.org/cgi-bin/trac.cgi/wiki/DepInstall

https://www.grid5000.fr

http://www.warewulf-cluster.org/cgi-bin/trac.cgi/wiki/DepInstall


also used for Web clusters (Both HA and load balanced), management of workstationsin LAB (similar to the LTSP Project), disk IO clusters, databases, security (active/inlineintrusion detection and vulnerability scanning clusters), etc.

Analysis: Warewulf uses master/slave technique for operating system installationon specific nodes. Using the concept of VNFS, the network boot image is created andthen from one specific node the image is propagated to other machines. Warewulfuses some tools to configure master node and to build DHCP configuration file.

4.4 Conclusion

No general deployment tool exists, mostly deployment tools deploy specific middle-ware or application components. For all tools, the information about resources has tobe given, as none of the deployment tool have automatic resource discovery mecha-nism. Only two tools have the deployment planning mechanism. Resource selectionis mostly user defined or all the discovered resources are selected as to be the partof the platform that is to be deployed. Among presented software deployment tools,ADAGE seems interesting as it consider general description of input data, like generaldescription of applications and general description of resources. Still ADAGE is in itsinitial phase of development because ADAGE has no intelligent deployment plannerfor it. It has implemented only two (round-robin and random) very basic planners.

System deployment tools have not automatic resource discovery and planningmechanism. Automatic resource discovery means to discover automatically new ma-chines that joined the pool which is used for the deployment. Planning mechanismgenerates the deployment plan that is submitted as input to the deployment tool. Evenresource selection is predefined. The selected resources are defined manually either byuser or administrator. Presented system deployment are different from each other inmany ways. For example, Kadeploy can deploy many operating systems (Linux, So-laris, Windows) where as Warewulf can deploy only Linux and that also in a masterslave fashion. In Warewulf master node provides interactive logins and job queuingto slaves and act as gateway between Internet and cluster network. It provides centralmanagement for all nodes. Slave nodes are only available on private cluster networkand optimized for computation. Main drawback of Warewulf is a single point failureon master node. Where as Kadeploy and DTK deploy operating system on cluster bycopying the boot image on each node of the cluster. Kadeploy deploy operating sys-tem on any type of cluster node where as DTK deploy operating system on only Dellserver clusters.

From the information presented in this chapter it is clear that a generic deploymenttool and a deployment planning tool is needed to transparently avail the power ofcomputational grid and facilitate the end users to easily access computing power ofgrid.

4.4.C

ON

CLU

SIO

N61

LaunchAutomatic ResourceDiscovery Planning

SelectionResource Remote File

InstallationPre−configuration Post−configration

SekiteiProgression PhaseRegression Phase Extraction Phase

DTK

Dep

loym

ent

Sys

tem

Administratorbiosrep.batbioscfg.exe rhinst.bat

wininst.bat

raidcfg.exepertcfg.exemount utilitydefined

KadeployKaaddnodeKarecodenv

kaarchivekacreateenv Kadeploy Karebootuser defined

Warewulf One node VNFSMasterConfNodeConfDhcp−build

Launch script SendBoot image

Sof

twar

e D

eplo

ymen

t

Prodigyplanner

Grid Services (RLS, TC, MDS) DAGManPegasus

GoDIET

JDF Jar files and SSHXML givenby user

XML givenby user SCP SSH, CORBA

ADAGE SCP, GridFTPAdage_deployctrl_param.xml and application.zip)(Input: grid.xml,

CCM descriptionResource descriptionfrom MDS2

clean up script

Figure

4.4:Com

parison

ofsome

dep

loymenttools.

Chapter 5

Heuristic to Structure HierarchicalScheduler

To efficiently use the computing power of distributed resources, middleware systemsare developed. However, the method to deploy the middleware elements on dis-tributed resources is mostly defined by user or by administrator. Deployment toolsmentioned in Chapter 4 deploy a given platform. None of the tool can select appropri-ate resources for deployment from a pool of resources. Selection of a resource among aset of resources for a particular middleware element is based on the knowledge aboutthe middleware and the resource. As the throughput of deployed platform dependson the type of application being submitted to the platform. So for end users it becomesvery cumbersome task to select a good middleware deployment depending upon thecharacteristics of the application and power of the resources. An expert may select agood deployment in reasonable time.

In this chapter we presents an heuristic to deploy middleware on homogeneousresources. We call this process as deployment planning, a planning needed to arrangethe resources in such a manner that when deployment of middleware componentswill be done, maximum number of requests can be processed by this arrangement(platform). Validation of our deployment planning heuristic is done by implementingthe heuristic to structure a hierarchical middleware.

5.1 Introduction

The issue of deployment planning for arbitrary arrangements of distributed sched-ulers is too broad. We focus on hierarchical arrangements, because a hierarchy is asimple and effective distribution approach and has been chosen by a variety of mid-dleware environments as their primary distribution approach [28, 38, 51, 75]. Beforetrying to optimize deployment planning on arbitrary, distributed resource sets, wetarget a subproblem: “what is the optimal hierarchical deployment on a cluster withhundreds to thousands of nodes?”. This problem is not as simple as it may sound:one must decide “how many resource should be used in total?”, “how many should

63

64 CHAPTER 5. HEURISTIC TO STRUCTURE HIERARCHICAL SCHEDULER

be dedicated to scheduling or computation?”, and “which hierarchical arrangement ofschedulers is more effective i.e., more schedulers near the servers, near the root agent,or a balanced approach?”.

In practice, running a throughput test on all deployments to find the optimal de-ployment is unmanageable; even testing a few samples is time consuming. In thischapter we provides efficient and effective planning for identifying a good deploy-ment that will perform well in homogeneous cluster environments. We consider thata good deployment is one that maximizes the steady-state throughput of the system, i.e.,the number of requests that can be scheduled, launched, and completed by the serversin a given time unit. Deployment planning heuristic determines that in which hierar-chical organization nodes should be arranged so that the goal to maximize steady-statethroughput can be achieved for each node. We also provides algorithms to modify theobtained hierarchy to limit the number of resources used to the number of available.

Deployment planning heuristic to deploy NES on homogeneous cluster is basedon the constraint to maximize the throughput of agent with respect to its children.We used DIET to validate our deployment heuristic. First we develop detailed per-formance models for DIET system and validate these models in a real-world environ-ment. We then present real-world experiments demonstrating that the deploymentsautomatically derived by our approach are in practice nearly optimal.

5.1.1 Operating models

The architectural model that we consider is a tree-shaped platform. Processors in theplatform can perform three operations in parallel or serial manner. Depending uponthe operating way of the processor authors in [17] have mentioned six different archi-tectural models, shown in Figure 5.1, where “r” stands for receive, “s” stands for sendand “w” stands for work, i.e. compute. In Figure 5.1, when two squares are placed nextto each other horizontally, it means that only one of them can be accomplished at atime, while vertical placement is used to indicate that concurrent operation is possi-ble. “‖” (respectively “,”) is used to indicate parallel (sequential) order of operationsin the models.

M(r∗‖s∗‖w): Full overlap, multiple-port - In this first model, a processor node cansimultaneously receive data from its parent, perform some (independent) com-putation, and send data to all of its children. This model is not realistic if thenumber of children is large.

M(r‖s‖w): Full overlap, single-port - In this second model, a processor node can si-multaneously receive data from its parent, perform some (independent) compu-tation, and send data to one of its children. At any given time-step, there areat most two communications taking place, one from the parent and/or one to asingle child.

M(r‖s, w): Receive-in-Parallel, single-port - In this third model, as in the next two,a processor node has one single level of parallelism: it can perform two actions

5.2. ARCHITECTURAL MODEL 65

M(r*||s*||w) M(r||s||w) M(r||s,w) M(w||r,s) M(r,s,w)M(s||r,w)

w w

w w

w

w

r

r

r

r rs

s

s s

r*

s* s

Figure 5.1: Classification of the operating models.

simultaneously. In the M(r‖s, w) model, a processor can simultaneously receivedata from its parent and either perform some (independent) computation, orsend data to one of its children. The only parallelism inside the node is the pos-sibility to receive from the parent while doing something else (either computingor sending to one child).

M(s‖r, w): Send-in-Parallel, single-port - In this fourth model, a processor node cansimultaneously send data to one of its children and either perform some (in-dependent) computation, or receive data from its parent. The only parallelisminside the node is the possibility to send to one child while doing something else(either computing or receiving from the parent).

M(w‖r, s): Work-in-Parallel, single-port - In this fifth model, a processor node cansimultaneously compute and execute a single communication, either sendingdata to one of its children or receiving data from its parent.

M(r, s, w): No internal parallelism - In this sixth and last model, a processor nodecan only do one thing at a time: either receiving from its parent, or computing,or sending data to one of its children.

We have considered the sixth model “No internal parallelism“ to present our heuris-tic. We have selected this model because we considered that each NES’s componentcan perform only one operation at a time. Single operation at a time may be due to thelimit of the machine’s network card for sending and receiving data sequentially. Evencomputing time is used while sending or receiving data, may be if it is negligible.

5.2 Architectural model

The target architectural framework is represented by a weighted graph G = (V, E, w, B).Each P ∈ V represents a computing resource with computing power w measuredin MFlop/second. There are one or more client nodes Pc that generate requests for


the hierarchy. Each link between two nodes is labelled by the bandwidth value B inMb/second.

Platform nodes: Available nodes can be divided in two types, agents and servers.A is a set of agents. Agents are the scheduler nodes that coordinate incoming requestsand communicate the requests to the appropriate servers. S is a set of servers. Serversprovide services to clients. Each node P ∈ V can be assigned to either one server orone agent. When responses are generated by the servers the agents may coordinatethe response. Client nodes are extra than the available nodes for the construction ofhierarchy. Thus, |A| + |S| = |V |.

Task processing: The size of the request forwarded by the agent to its children isSin and the size of the response (reply of the request) by each node is Sout. The mea-suring unit of these quantities is Mb/request. The amount of computation needed byP ∈ A to process one incoming request is denoted by Win and the amount of computa-tion needed by P ∈ A to merge the responses from its children is denoted by Wout. Inshort Wout is the time needed for sorting the servers. WXser is the amount of computa-tion needed by P ∈ S for Xser service. Wpred is the amount of computation needed byP ∈ S for predicting the computation time of each incoming request.

5.3 Deployment constraints

We consider the steady-state techniques concept [19] for our model because we areinterested in maximizing steady-state throughput. In steady-state techniques, the per-formance of the startup and shutdown phases are not considered. The initial integerformulation is replaced by a continuous or rational formulation. The main idea is togenerate the hierarchy when each resource is performing all (sending, receiving andcomputation) tasks during each time-unit.

The overall throughput of the system depends on the bandwidth of the links, mes-sage size of requests, time spent computing by each node on each request, and thecomputing power of the nodes. Therefore, we have the following constraints:

Computation constraint for agent: As an agent uses its computing power for treat-ing two type of jobs (requests and responses), the throughput of agent according to itscomputing power can be derived from Equation (5.1).

∀P ∈ A :w

W in + W out(5.1)

So we re-organize Equation (5.1) to represent the number of requests computed byan agent as Nodecomp in Equation (5.2). The following equation states that the sub-treethroughput cannot be larger than the throughput of the agent.

Nodecomp ≤w

Win + Wout, ∀P ∈ A (5.2)

Communication constraint for agent: Depending on the physical network, band-width link is either shared by incoming and outgoing data or bandwidth is fully du-plex. Depending on the type of bandwidth utilization either of the two equations rep-

5.4. DEPLOYMENT CONSTRUCTION 67

resented as Equation (5.3) will be used to calculate the number of requests transmittedper time-unit along each link.

Commreq ≤B

Sin + Sout||min(

B

Sin,

B

Sout), ∀links (5.3)

Computation constraint for server: As server also use its computing power intreating two jobs (prediction and real execution), so the number of requests that canbe calculated by a P ∈ S in a time step is given by Equation (5.4).

Servercomp ≤w

WXser+Wpred

, ∀P ∈ S (5.4)

Servers are the leave of the tree and servers have no child to communicate. Agentcommunication constraint has considered the communication with its children andparent, thus server communication has been considered with its parent agent.

5.4 Deployment construction

For hierarchical deployment construction, we consider that when the response sent byan agent and a server is same in format and size then an agent can have same numberof children irrespective of children type (server or agent). But if the format and size ofthe reply from agents and servers is different then number of servers supported by anagent will be different from the number of agents supported by an agent. For example,if server sends the task’s prediction with its status as reply to its parent agent and anagent just sent the list of sorted servers to its parent agent then the number of agentssupported by the parent agent will be different from the number of server childrensupported by it. The maximum nodes that can be connected to agent nodes withoutmaking it bottleneck can be calculated from the constraints mentioned in Section 5.3as shown below.

Number of servers supported by an agent: MSPA (Maximum Servers Per Agent)represents the maximum number of servers that can be supported by a node P ∈ A.

MSPA =Nodecomp

min(Servercomp, Commreq)(5.5)

Number of agents supported by another agent: The maximum number of agentsthat can be added to another agent is given as MAPA (Maximum Agents Per Agent).

MAPA =Nodecompi

min(Nodecompj, Commreq)

, Pi, Pj ∈ A (5.6)

We differentiate Pi and Pj as two different nodes even though they are from thesame set because this equation can be used to calculate the number of intermediatenodes supported by the other nodes by differing the set of Pi and Pj for some NES likeDIET.


As the throughput of the platform can be a rational, so the values of MAPA andMSPA are not necessarily integers, they could be rational too. To make these valuesintegers we use either round up or round down. The simple method that we consideris according to the available number of nodes. If we have enough available nodes weround up, otherwise we round down.

The Level of the hierarchy is denoted by l. The total number of nodes required forthe hierarchy is n_req and can be calculated as follows.

n_req =

l∑

k=0

MAPAk + MAPAl × MSPA (5.7)

The hierarchy is constructed using Algorithm 5.1. This algorithm uses Algorithm 5.2and Algorithm 5.3 to use no more than the number of available nodes.

Algorithm 5.1 Construction Algorithm1: calculate MSPA using Equation (5.5)2: calculate MAPA using Equation (5.6)3: l = 0, n_req=04: while n_req < n do5: n_reqlow = n_req6: calculate n_req using Equation (5.7)7: l + +8: l −−, n_reqhigh = n_req9: if (n − n_reqlow) > (n_reqhigh − n) then

10: call Make_Hierarchy(l,MAPA,MSPA)11: call Node Removal Algorithm 5.2 with extra_nodes = n_reqhigh − n12: else13: l −−14: call Make_Hierarchy(l,MAPA,MSPA)15: call Node Addition Algorithm 5.3 with extra_nodes = n − n_reqlow

In Algorithm 5.1 few variables and a procedure are added to simplify the presen-tation of the algorithm. n represents the total number of nodes available for hierarchyconstruction. n_reqlow is the number of nodes used while calculating the nodes re-quired to construct an hierarchy. n_reqhigh is the number of nodes required to constructa complete hierarchy, here complete means when all agents have maximum number ofchildren that an agent can support. But by this, may be, total node considered for theconstruction of the hierarchy be more than the available nodes n. extra_nodes denotesthe nodes that are left over from n, total available nodes, for the construction of hier-archy. Algorithms 5.2 and 5.3 use some of the variables of Algorithm 5.1 in additionto their own vairables. Number of agents that has to be removed after calculation inAlgorithms 5.2 are stored in variable remove_agent. If after removing calculated nodesthere are still some nodes that have to be removed are represented by left_nodes vari-

5.5. MODEL IMPLEMENTATION FOR DIET 69

able. In Algorithm 5.3 a variable add_agent is used to represent the calculated agentsthat should be added in the hierarchy.

Procedure: Make_Hierarchy(l,MAPA,MSPA)1: add a node2: while l > 0 do3: add MAPA agents to (lowest level) agent without children4: l −−5: add MSPA servers to (lowest level) agent without children

This algorithm is explained with the help of a small example. Let us consider n =10. Suppose request reply data from agent and server is of same size, then using step1 and 2, we get MSPA equals to MAPA. We consider an arbitrarily value (3) for MAPAand MSPA. Step 3, initialize the value of level l and number of node required n_reqto zero. So, n_reqlow=0. According to Step 6, we calculate n_req = 4. As n_req isless than n, we repeat Steps 5,6 and 7. So with l=1 node required, n_req, is 13. Nowonce we have calculated the maximum nodes required to construct the hierarchy, wemake the heuristic using procedure “Make_Hierarchy”. If we have used more nodeswhile calculating the nodes required for hierarchy construction then we use NodeRemoval Algorithm 5.2 otherwise Node Addition Algorithm 5.3. We have Step 9 in thealgorithm so that less number of nodes should be added/removed depending uponthe condition that is compared. So according to Step 9, we call the procedure. First anode is added, which will act as an agent, then will add 3 (value of MAPA) nodes tothis agent. The level is reduced to 0, so now 3 (value of MSPA) nodes are added tolowest level agent node. So in total 13 nodes are used in this hierarchy. Now we willuse Node Removal Algorithm 5.2 according to Step 11 to remove 3 extra nodes.

Node Removal Algorithm 5.2, removes the extra nodes that are considered whileconstructing the hierarchy. So extra_nodes is equal to three according to above exam-ple. As value of extra_nodes is equal to the MSPA value Step 5 of the algorithm willbe executed. We remove an agent because an agent cannot exist without any children.In this example we could have used the agent node as a server but as an agent cansupport same number of agents as servers, but if the request reply of the agent andserver would be different then this change cannot be possible. So we prefer to removean agent if it has no children, irrespective of the value of MAPA and MSPA. Now thefinal hierarchy uses 9 nodes.

5.5 Model implementation for DIET

We make some assumptions to use DIET for heuristic validation. The MA and LAare considered as having the same performance. We assume that an agent can con-nect either agents or servers but not both. Root of the tree is always an MA. Thus allclients will submit their request to the DIET hierarchy through one MA. When client


Algorithm 5.2 Node Removal Algorithm1: while (extra_nodes > 0) do2: if extra_nodes < MSPA then3: remove extra_nodes from the one of the bottom agents. Exit4: else if (extra_nodes == MSPA || extra_nodes == MSPA + 1 ) then5: remove one bottom agent with all its servers. Exit6: else7: remove_agent = extra_nodes - mod (MSPA)8: if remove_agent > (MAPA × MSPA) + MAPA then9: remove MAPA agents with all its servers.

10: add MSPA /* servers to the agent that now have neither agent nor server connected toit*/

11: left_nodes = extra_nodes − ((MAPA × MSPA) + MAPA) + MSPA12: extra_nodes = left_nodes13: else14: remove remove_agent agents with all its servers15: left_nodes = extra_nodes − ((remove_agent × MSPA) + remove_agent)16: extra_nodes = left_nodes

Algorithm 5.3 Node Addition Algorithm1: while (extra_nodes > 1) do2: if all bottom-level agents have maximum number of servers then3: add one node to the bottom agent4: add parent servers as its servers5: extra_nodes −−6: else7: if extra_nodes ≤ MSPA then8: add one agent to the agent having child agents < MAPA9: add extra_nodes − 1 servers to this newly added agent. Exit

10: else11: add_agent = extra_nodes - mod (MSPA)12: if extra_nodes ≥ add_agent + add_agent × MSPA then13: while (add_agent > 0) do14: add one agent to the agent having child agents < MAPA15: add MSPA servers to this new added agent16: add_agent −−17: extra_nodes = extra_nodes − add_agent + add_agent × MSPA18: else19: add_agent −−20: go to line 13

5.6. EXPERIMENTAL RESULTS 71

requests are sent to the agent hierarchy, DIET is optimized such that large data itemslike matrices are not included in the problem parameter descriptions (only their sizesare included). These large data items are included only in the final request for compu-tation from client to server. Since we are primarily interested in modeling throughputfor the agent hierarchy as an example of a hierarchical middleware system, we assumethat large data items are already in-place on the server.

5.6 Experimental results

In this section we presents experiments designed to test the ability of deploymentheuristic to correctly identify good real-world deployments that are appropriate to agiven resource environment and workload. We presents the experimental design, thecorresponding parametrization, experiments testing the accuracy of our deploymentperformance heuristic; the accuracy is key to provide confidence in the deployment al-gorithms. These algorithms are tested with experiments comparing the performanceof the best deployment identified by our deployment algorithms against the perfor-mance of other intuitive deployments.

5.6.1 Experimental design

Software: DIET is used for all deployed agents and servers. The deployment arrange-ments are defined according to experimental goals and will be described in the follow-ing sections. Once a deployment has been chosen, GoDIET [85] is used to perform theactual software deployment.

Job types: Since our performance model and deployment approach focus on max-imizing steady-state throughput, our experiments focus on testing the maximum sus-tained throughput provided by different deployments. As an initial study we considerDGEMM, a simple matrix multiplication provided as part of the Basic Linear AlgebraSubprograms (BLAS) package [35].

Workload: Measuring the maximum throughput of a system is non-trivial: if toolittle load is introduced the maximum performance may not be achieved, if too muchload is introduced the performance may suffer as well. Thus we use small quantitiesof steady-state load in the form of a client script that uses a continual loop to launchone request, wait for the response, and then sleep 0.05 seconds. With one client scriptthere is at most a single request in the system. We then introduce load gradually bylaunching one client script every five seconds; four scripts are launched on each of 35client machines. A full throughput test thus takes 700 seconds.

Resources: For these experiments we used a 55-node cluster at the École NormaleSupérieure in Lyon, France. Each node provided dual AMD Opteron 246 processors @2 GHz, a cache size of 1024 KB, and 2 GB of memory. All nodes are connected by both aGb Ethernet and a 100 Mb/s Ethernet; unless otherwise noted all communicates weresent over the Gigbit network. As measured with the Network Weather Service [82],available bandwidth on this network is 909.5 Mb/s and latency is 0.08 msec.


Model parametrization: To collect the values needed to calculate MSPA and MAPA,we measure the performance for a benchmark task (Xser) using a small hierarchy ofone agent and one server. We assume that the maximum throughput of an agent ora server can be calculated as the inverse of the time required by that element to treata single request. Thus we measured the time required to treat a request at the serverlevel and at the agent level. At the agent-level the processing time depends on thenumber of children attached to the agent so we collected benchmarks for a variety ofsizes of star-based hierarchies. We then used a linear fit of the results to generate a sim-ple model for agent-level throughput as a function of number of children. To measuredata transfer within the hierarchy, we used tcpdump to monitor all data transferredbetween elements and ethereal to analyze the data. With these benchmarks we ob-tained predictions of all the variables that were not known such as WXser

, Win and Wout.The benefit of this estimation approach is that measurements of the time required totreat a request are fairly easy to collect and each benchmark can be run in a few min-utes with only a single client script. However, this approach assumes that there is nooverhead to running requests in parallel and that all resources can be perfectly shared.Estimates generated in this fashion will therefore tend to overestimate the throughputof servers and agents.

0 40 80 1200

50

100

150

200

Number of clients

Req

uest

s / s

econ

d

(a)

1 SeD2 SeDs

Predicted Measured0

50

100

150

200

Thr

ough

put (

req/

sec)

(b)

1 SeD2 SeDs

Figure 5.2: Star hierarchies with one or two servers for DGEMM 150x150 requests. (a)Real-world platform throughput for different load levels. (b) Comparison of predictedand actual maximum throughput.

5.6.2 Performance model validation

The usefulness of our deployment approach depends heavily on the ability to predictthe maximum steady-state throughput of each element in a deployment. This sectionpresents experiments designed to answer this question.


0 40 80 120 1600

200

400

600

800

1000

1200

Number of clients

Req

uest

s / s

econ

d

(a)

1 SeD2 SeDs

Predicted Measured0

400

800

1200

Thr

ough

put (

req/

sec)

(b)

1 SeD2 SeDs

Figure 5.3: Star hierarchies with one or two servers for DGEMM 10x10 requests. (a)Throughput at different load levels. (b) Comparison of predicted and actual maximumthroughput.

0 40 80 1200

200

400

600

800

1000

Number of clients

Req

uest

s / s

econ

d

(a)

Gb/s100 Mb/s

Predicted Measured0

200

400

600

800

1000

Thr

ough

put (

req/

sec)

(b)

Gb/s100 Mb/s

Figure 5.4: Star hierarchies with two servers using a Gb/s or a 100 Mb/s network.Workload was DGEMM 10x10. (a) Throughput at different load levels. (b) Comparisonof predicted and actual maximum throughput.


The first test, shown in Figure 5.2, uses a workload of DGEMM 150x150 to comparethe performance of two hierarchies: an agent with one server versus an agent withtwo servers. For this scenario, the model predicts that both hierarchies are limitedby server performance and therefore, performance will roughly double with the addi-tion of the second server. These predictions are accurate: the model correctly predictsthe absolute performance of these deployments and also predicts that the two-serverdeployment will be the better choice.

Figure 5.3 uses a workload of DGEMM 10x10 to compare the performance of thesame one and two server hierarchies. The model correctly predicts that both deploy-ments are limited by agent performance and that the addition of the second serverwill in fact hurt performance. The error in the magnitude of the performance predic-tion is about 20-30%. We believe that the error in these predictions arises from thefact that some overhead is introduced by trying to run so many requests in parallel onthe agent machine. Figure 5.4 compares performance using a Gb/s network versus a100 Mb/s network; two-server hierarchies are used in both cases with a workload ofDGEMM 10x10. The model correctly predicts that the network is not the limiting factoron steady-state performance.

In summary, our deployment performance model is able to predict server through-put and can predict the impact of adding servers to a server-limited or agent-limiteddeployment. The deployment algorithm accuracy could be obtained by adjusting ourmodel parametrization to account for the results of this section. However, adding thisadjustment phase to model parametrization greatly complicates the parametrizationphase and we prefer to maintain an approach that could be applied rapidly to othersoftware and resource environments.

0 50 100 150 2000

200

400

600

800

Thr

ough

put (

req/

sec)

Number of clients

Automatic ........ Max = 689 req/secLarger ............. Max = 631 req/secSmaller ............ Max = 540 req/sec

Figure 5.5: Comparison of automatically-generated hierarchy with hierarchies con-taining twice as many and half as many servers.


7

A

A A

AA

A A A A

4 4 3 3

n

Agent node

n server nodes

link between nodes

14

7

Automatically−defined by model Star graph Balanced platform

A

Figure 5.6: Types of platforms compared.

0 40 80 120 160 200 2400

100

200

300

400

500

600

700

Thr

ough

put (

req/

sec)

Number of clients

Automatic ....... Max = 689 req/secBalanced ........ Max = 673 req/secStar ................ Max = 620 req/sec

Figure 5.7: Comparison of automatically-generated hierarchy with intuitive alterna-tive hierarchies.

5.6.3 Deployment selection validation

In this section we test the ability of the deployment algorithm to select an appropriatearrangement of agents and servers for deployment. We were not able to find alterna-tive deployment algorithms that could be applied to our scenario as a comparison; inour experience, given the lack of automatic tools users typically define deploymentsby hand using intuition about the scenario. Thus we compare the model-defined de-ployment against several alternatives that, in our experience, would be reasonableintuitive choices with users.

We wish to find the deployment on our 50-machine cluster for a workload ofDGEMM 150x150. Our deployment algorithm predicts that the deployment uses a top-level agent and two middle-level agents where each middle-level agent can support 7servers. This deployment thus contains 14 servers and 17 machines in total. To checkwhether the algorithm selected the correct number of servers, Figure 5.5 compares


this Automatic deployment against deployments with the same two-level hierarchyof agents but with different numbers of servers. The Larger deployment containstwice as many servers (14 on each middle-level agent so 28 servers in total) whilethe Smaller deployment contains roughly half as many servers (4 on each middle-level agent so 8 servers in total). The Automatic deployment provides a significantlyhigher maximum throughput than the others.

Although it is important that the deployment approach can select an appropriatenumber of resources, it is also important that it select an appropriate hierarchy. Wetherefore design two alternative deployments that we have found to be typical hierar-chical styles for users. To remove the effect of the number of servers, we use exactly14 servers, the number chosen by our approach, for these deployments. The Star de-ployment uses 14 servers attached directly to the top-level agent. The Balanced triesto balance work amongst the agents by using the same branching factor at all levels ofthe hierarchy; for 14 servers the most balanced hierarchy uses a top-level agent with4 middle-level agents and 3 or 4 servers attached to each mid-level agent. Figure 5.6shows these three platforms.

Figure 5.7 compares the performance achieved by these three deployments. TheAutomatic approach performs the best and provides a significant advantage overthe Star topology. However, the Balanced approach performs almost as well as theAutomatic approach. This result is not surprising: the two hierarchies are in factfairly similar in structure.

5.7 Conclusion

In this chapter we have presented an heuristic to determine structure for hierarchicalscheduler for an homogeneous resource platform. Heuristic determines how manynodes should be used and in what hierarchical organization with the goal of maximiz-ing steady-state throughput.

The main focus of heuristic is to construct an hierarchy, so as to maximize thethroughput of each node, where this throughput depend on the number of childrena node is connected with, in the hierarchy. Number of children supported by a nodedepends on the type of children (server or agent). Thus exact calculation of MAPAand MSPA is required for correct structure of hierarchical schedulers. Models for cal-culation of MSPA and MAPA is given for DIET hierarchical scheduler system. Ac-cordingly the exact number of nodes required to construct the hierarchy is calculated.Algorithms to use only the available number of nodes in hierarchy construction is alsogiven.

This chapter provides the first step for automatic middleware deployment plan-ning on homogeneous cluster. Next chapter provides an optimal algorithm for auto-matic middleware deployment on homogeneous cluster.

Chapter 6

Automatic Middleware DeploymentPlanning on Clusters

In previous Chapter 5 we have presented an heuristic for hierarchical middlewaredeployment for an homogeneous resource. We experimentally validated the heuristicdeployments that are implemented under limited condition, for example, an agent canhave either server or agents as children but not both. Deployment based on heuristicperformed well as compared to other intuitive deployments.

While implementing the heuristic presented in Chapter 5 on hierarchical NES envi-ronment, DIET, we analyzed that the request reply of an agent and a server is alwayssame. And as the resources are homogeneous, by considering the middleware deploy-ment phase constraints’ we could find an algorithm for optimal middleware deploy-ment on homogeneous resources. Thus, we divided the working of the middlewarein two phases and based on them generated the constraints. And found an optimaldeployment on homogeneous resources.

This chapter explains in detail our optimal middleware deployment planning ap-proach. In optimal middleware deployment planning approach we have shown thatthe optimal arrangement of schedulers is a complete spanning d-ary tree; this resultagrees with existing results in load-balancing and routing from the scheduling andnetworking literature. Even we can automatically derive the optimal theoretical de-gree d for the tree.

Optimal algorithm consider the throughput of each execution phase of the middle-ware. Optimal algorithm is based on the throughput achieved by the scheduling phaseand servicing phase of the middleware. To test our optimal algorithm, we use DIETmiddleware. First we develop detailed performance models for DIET system and val-idate these models in a real-world environment. We then present real-world exper-iments demonstrating that the deployments automatically derived by our approachare in practice nearly optimal and performs significantly better than other reasonabledeployments.

77

78 CHAPTER 6. AUTOMATIC MIDDLEWARE DEPLOYMENT PLANNING ON CLUSTERS

6.1 Platform deployment

6.1.1 Platform architecture

This section defines our target platform architecture; Figure 6.1 provides a useful ref-erence for these definitions.

Scheduling Phase Service Phase

8) Service response

7) Run application& generate response

5) Scheduling response(reference of selected server)

4) Response sorted& forwarded up

3) Request prediction& response generation

2) Forward request down

agentclient server

1) Scheduling request

ss s s

sa

a

aa

a

a a

c

c

c

6) Servicerequest

Figure 6.1: Platform deployment architecture and execution phases.

Software system architecture - We consider a service-provider software systemcomposed of three types of elements: a set of client nodes C that require computations,a set of server nodes S that is provider of computations, and a set of agent nodes A thatprovides coordination of client requests with service offerings via service localization,scheduling, and persistent data management. The arrangement of these elements isshown in Figure 6.1. We consider only hierarchical arrangements of agents composedof a single top-level root agent and any number of agents arranged in a tree below theroot agent. Server nodes are leaves of the tree, but may be attached to any agent in thehierarchy, even if that agent also has children that are agents.

Since the use of multiple agents is designed to distribute the cost of services such asscheduling, there is no performance advantage to having an agent with a single child.The only exception to this policy is for the root-level agent with a single server child;this “chain” cannot be reduced.

We do not consider clients to be part of the hierarchy nor part of the deployment;this is because at the time of deployment we do not know where clients will be located.Thus a hierarchy can be described as follows. A server s ∈ S has exactly one parentthat is always an agent a ∈ A, a root agent a ∈ A has one or more child agents and/orservers and no parents. Non-root agents a ∈ A have exactly one parent and two ormore child agents and/or servers.

Request definition - We consider a system that processes requests as follows. Aclient c ∈ C first generates a scheduling request that contains information about theservice required by the client and meta-information about any input data sets, butdoes not include the actual input data. The scheduling request is submitted to the

6.1. PLATFORM DEPLOYMENT 79

root agent, which checks the scheduling request and forwards it on to its children.Other agents in the hierarchy perform the same operation until the scheduling requestreaches the servers. We assume that the scheduling request is forwarded to all servers,though this is a worst case scenario as filtering may be done by the agents based on re-quest type. Servers may or may not make predictions about performance for satisfyingthe request, depending on the exact system.

Servers that can perform the service then generate a scheduling response. The schedul-ing response is returned upward to the hierarchy and the agents sort and select amongstthe various scheduling responses. It is assumed that the time required by an agent toselect amongst scheduling responses increases with the number of children it has, butis independent of whether the children are servers or agents. Finally, the root agentforwards the chosen scheduling response (i.e., the selected server) to the client.

The client then generates a service request which is very similar to the schedulingrequest but includes the full input data set, if any is needed. The service request issubmitted by the client to the chosen server. The server performs the requested serviceand generates a service response, which is then sent back to the client. A completedrequest is one that has completed both the scheduling and service request phases andfor which a response has been returned to the client.

Resource architecture - The target resource architectural framework is representedby a weighted graph G = (V, E, w, B). Each vertex v in the set of vertices V repre-sents a computing resource with computing power w in MFlop/second. Each edge ein the set of edges E represents a resource link between two resources with edge costB given by the bandwidth between the two nodes in Mb/second. We do not considerlatency in data transfer costs because our model is based on steady-state schedulingtechniques [19]. Usually, the latency is paid once for each of the communications thattake place. In steady-state scheduling, however, as a flow of messages takes placebetween two nodes, the latency is paid only one time (when the flow is initially es-tablished) for the whole set of messages. Therefore latency will have an insignificantimpact on our model and so we do not take it into account.

Deployment assumptions - We consider that at the time of deployment we do notknow the client locations or the characteristics of the client resources. Thus clients arenot considered in the deployment process and, in particular, we assume that the set ofcomputational resources used by clients is disjoint from V.

A valid deployment thus consists of a mapping of a hierarchical arrangement ofagents and servers onto the set of resources V. Any server or agent in a deploymentmust be connected to at least one other element; thus a deployment can have onlyconnected nodes. A valid deployment will always include at least the root-level agentand one server. Each node v ∈ V can be assigned to either exactly one server s, exactlyone agent a, or the node can be left idle. Thus if the total number of agents is |A|, thetotal number of servers is |S|, and the total number of resources is |V|, then |A| + |S| ≤|V|.


6.1.2 Optimal deployment

Our objective is to find an optimal deployment of agents and servers for a set of resourcesV. We consider an optimal deployment to be a deployment that provides the maximumthroughput ρ of completed requests per second. When the maximum throughput canbe achieved by multiple distinct deployments, the preferred deployment is the oneusing the least resources.

As described in Section 6.1.1, we assume that at the time of deployment we do notknow the locations of clients or the rate at which they will send requests. Thus it isimpossible to generate an optimized, complete schedule. Instead, we seek a deploy-ment that maximizes the steady-state throughput, i.e., the main goal is to characterizethe average activities and capacities of each resource during each time unit.

We define the scheduling request throughput in requests per second, ρsched, as therate at which requests are processed by the scheduling phase (see Section 6.1.1). Like-wise, we define the service throughput in requests per second, ρservice, as the rate atwhich the servers produce the services required by the clients. The following lemmaslead to a proof of an optimal deployment shape of the platform.

Lemma 1. The completed request throughput ρ of a deployment is given by the minimum ofthe scheduling request throughput ρsched and the service request throughput ρservice.

ρ = min(ρsched, ρservice)

Proof. A completed request has, by definition, completed both the scheduling requestand the service request phases.

Case 1: ρsched ≥ ρservice. In this case requests are sent to the servers at least as fastas they can be serviced by the servers, so the overall rate is limited by ρservice.

Case 2: ρsched < ρservice. In this case the servers are left idle waiting for requestsand new requests are processed by the servers faster than they arrive. The overallthroughput is thus limited by ρsched. �

The degree of an agent is the number of children directly attached to it, regardlessof whether the children are servers or agents.

Lemma 2. The scheduling throughput ρsched is limited by the throughput of the agent withthe highest degree.

Proof. As described in Section 6.1.1, we assume that the time required by an agent tomanage a request increases with the number of children it has. Thus, agent through-put decreases with increasing agent degree and the agent with the highest degree willprovide the lowest throughput. Since we assume that scheduling requests are for-warded to all agents and servers, a scheduling request is not finished until all agentshave responded. Thus ρsched is limited by the agent providing the lowest throughputwhich is the agent with the highest degree. �

Lemma 3. The service request throughput ρservice increases as the number of servers includedin a deployment increases.


Proof. The service request throughput is a measure of the rate at which servers in a de-ployment can generate responses to client service requests. Since agents do not partic-ipate in this process, ρservice is independent of the agent hierarchy. The computationalpower of the servers is used for both (1) generating responses to scheduling queriesfrom the agents and (2) providing computational services for clients. For a given valueof ρsched the work performed by a server for activity (1) is independent of the numberof servers. The work performed by each server for activity (2) is thus also indepen-dent of the number of servers. Thus the work performed by the servers as a group foractivity (2) increases as the number of servers in the deployment increases. �

For the rest of this section, we need some precise definitions. A complete d-ary treeis a tree for which all internal nodes, except perhaps one, have exactly d children. If nis the depth of the tree, the potential internal node with strictly less than d children isat depth n−1 and may have any degree from 1 to d−1. A spanning tree is a connected,acyclic subgraph containing all the vertices of a graph. We introduce the followingdefinition to aid later discussions.

Definition 1. A Complete Spanning d-ary tree (CSD tree) is a tree that is both a completed-ary tree and a spanning tree.

For deployment, leaves are servers and all other nodes are agents. A degree d ofone is useful only for a deployment of a single root agent and a single server. Notethat for a set of resources V and degree d, a large number of CSD trees can be con-structed. However, since we consider only homogeneous resources, all such CSD treesare equivalent in that they provide exactly the same performance.

Definition 2. A dMax set is the set of all trees that can be built using |V| resources and forwhich the maximum degree is equal to dMax .

Figure 6.2 shows some examples of trees from the dMax 4 and dMax 6 sets.

dMax=4 dMax=6

Figure 6.2: Deployment trees of dMax sets 4 and 6.

Theorem 1. In a dMax set, all dMax CSD trees have optimal throughput.

Proof. We know by Lemma 1 that the throughput ρ of any tree is limited by eitherits schedule request throughput ρsched or its service request throughput ρservice. AsLemma 2 states that the scheduling request throughput ρsched is only limited by thethroughput of the agent with the highest degree, all trees in a dMax set have the same


Algorithm 6.1 Algorithm to construct an optimal tree.1: calculate best_d using Theorem 22: Add the root node3: if best_d == 1 then4: add one server to root node5: Exit6: available_nodes = |V| − 17: level = 08: while available_nodes > 0 do9: if ∃ agent at depth level with degree 0 then

10: toAdd = min(best_d, available_nodes)11: add toAdd children to the agent12: available_nodes = available_nodes − toAdd13: else14: level + +15: if ∃ agent with degree 1 then16: remove the child17: convert all leaf nodes to servers

scheduling request throughput ρsched. Thus, to show that any dMax CSD tree is an op-timal solution among the trees in the dMax set, we must prove that the service requestthroughput ρservice of any CSD tree is at least as large as the ρservice of any other tree inthe dMax set.

By Lemma 3, we know that the service request throughput ρservice increases withthe number of servers included in a deployment. Given the limited resource set size,|V|, the number of servers (leaf nodes) is largest for deployments with the smallestnumber of agents (internal nodes). Therefore, to prove the desired result we just haveto show that dMax CSD trees have the minimal number of internal nodes among thetrees in a dMax set.

Let us consider any optimal tree T in a dMax set. We have three cases to consider.1. T is a CSD tree. As all CSD trees (in the dMax set) have the same number of

internal nodes, all CSD trees are optimal. QED.2. T is not a CSD tree but all its internal nodes have a degree equal to dMax , except

possibly one. Then we build a CSD tree having at most as many internal nodes as T .Using case 1, this will prove the result.

Let h be the height of T . As T is not a CSD tree, seeing our hypothesis on the degreeof its internal nodes, T must have a node at a height h′ < h − 1 which has strictly lessthan dMax children. Let h′ be the smallest height at which there is such a node, let Nbe such a node, and let d be the degree of N . Then remove dMax −d children from anyinternal node N ′ at height h − 1 and add them as children of N . Note that, whateverthe value of d, among N and N ′ there is the same number of internal nodes whosedegree is strictly less than dMax before and after this transformation. Then, through


this transformation, we obtained a new tree T ′ whose internal nodes all have a degreeequal to dMax , except possibly one. Also, there is the same number of leaf nodes in Tand T ′. Therefore, T ′ is also optimal. Furthermore, if T ′ is not a CSD tree, it has oneless node at level h′ whose degree is strictly less than dMax ; if there are zero nodes atlevel h′ whose degree is strictly less than dMax after the transformation then the valueto “h′” is increased. Repeating this transformation recursively we will eventually endup with a CSD tree.

3. T has at least two internal nodes whose degree is strictly less than dMax . Then,let us take two such nodes N and N ′ of degrees d and d′. If d + d′ ≤ dMax , then weremove all the children of N ′ and make them children of N . Otherwise, we removedMax − d children of N ′ and make them children of N . In either case, the number ofleaf nodes in the tree before and after the transformation is non-decreasing, and thenumber of internal nodes whose degree is strictly less than dMax is (strictly) decreas-ing. So, as for the original tree, the new tree is optimal. Repeating this transformationrecursively, we will eventually end up with a tree dealt with by case 2. Hence, we canconclude. �

We select CSD tree because only the dMax CSD tree has minimum height amongall optimal trees in the dMax set.

Theorem 2. A complete spanning d-ary tree with degree d ∈ [1, |V| − 1] that maximizes theminimum of the scheduling request and service request throughputs is an optimal deployment.

Proof. This theorem is fundamentally a corollary of Theorem 1. The optimal degree isnot known a priori; it suffices to test all possible degrees d ∈ [1, |V| − 1] and to selectthe degree that provides the maximum completed request throughput. �

6.1.3 Deployment construction

Once an optimal degree best_d has been calculated using Theorem 2, we can use the Al-gorithm 6.1 to construct the optimal CSD tree. best_d is a counter to select best degreefor CSD tree. available_nodes denotes the available nodes for hierarchy construction.

A few examples will help clarify the results of our deployment planning approach.Let us consider that we have 10 available nodes (|V| = 10). Suppose best_d = 1.Algorithm 6.1 will construct the corresponding best platform - a root agent with asingle server attached. Now suppose best_d = 4. Then Algorithm 6.1 will constructthe corresponding best deployment - the root agent with four children, one of whichalso has four children; the deployment has two agents, seven servers and one unusednode because it can only be attached as a chain.

6.1.4 Request performance modeling

In order to apply the model defined in Section 6.1 to DIET, we must have modelsfor the scheduling throughput and the service throughput in DIET. In this section wedefine performance models to estimate the time required for various phases of request


treatment in DIET. These models will be used in the following section to create theneeded throughput models.

We make the following assumptions about DIET for performance modeling. TheMA and LA are considered as having the same performance because their activitiesare almost identical and in practice we observe only negligible differences in theirperformance.

We assume that the work required for an agent to treat responses from SeD-typechildren and from agent-type children is the same. DIET allows configuration of thenumber of responses forwarded by agents; here we assume that only the best server isforwarded to the parent.

When client requests are sent to the agent hierarchy, DIET is optimized such thatlarge data items like matrices are not included in the problem parameter descriptions(only their sizes are included). These large data items are included only in the finalrequest for computation from client to server. As stated earlier, we assume that we donot have a priori knowledge of client locations and request submission patterns. Thus,we assume that needed data is already in place on the servers and we do not considerdata transfer times.

The following variables will be of use in our model definitions. Sreq is the size inMb of the message forwarded down the agent hierarchy for a scheduling request. Thismessage includes only parameters and not large input data sets. Srep is the size in Mbof the reply to a scheduling request forwarded back up the agent hierarchy. Since weassume that only the best server response is forwarded by agents, the size of the replydoes not increase as the response moves up the tree.

Wreq is the amount of computation in MFlop needed by an agent to process oneincoming request. Wrep(d) is the amount of computation in MFlop needed by an agentto merge the replies from its d children. Wpre is the amount of computation in MFlopneeded for a server to predict its own performance for a request. Wapp is the amount ofcomputation in MFlop needed by a server to complete a service request for app service.The provision of this computation is the main goal of the DIET system.

Agent communication model: To treat a request, an agent receives the request fromits parent, sends the request to each of its children, receives a reply from each of itschildren, and sends one reply to its parent. By Lemma 2, we are concerned only withthe performance of the agent with the highest degree, d. The time in seconds requiredby an agent for receiving all messages associated with a request from its parent andchildren is as follows.

agent_receive_time =Sreq + d · Srep

B(6.1)

Similarly, the time in seconds required by an agent for sending all messages asso-ciated with a request to its children and parent is as follows.

agent_send_time =d · Sreq + Srep

B(6.2)

Server communication model: Servers have only one parent and no children, sothe time in seconds required by a server for receiving messages associated with a

6.2. STEADY-STATE THROUGHPUT MODELING 85

scheduling request is as follows.

server_receive_time =Sreq

B(6.3)

The time in seconds required by a server for sending messages associated with arequest to its parent is as follows.

server_send_time =Srep

B(6.4)

Agent computation model: Agents perform two activities involving computation:the processing of incoming requests and the selection of the best server amongst thereplies returned from the agent’s children.

There are two activities in the treatment of replies: (i) a fixed cost Wfix in MFlops,is the amount of computation needed to process the reply independent of the numberof server supported by it. (ii) a cost Wsel that is the amount of computation in MFlopsneeded to process the server replies, and select the best server. Thus the computationassociated with the treatment of replies is given

Wrep(d) = Wfix + Wsel · d

The time in seconds required by the agent for the two activities is given by thefollowing equation.

agent_comp_time =Wreq + Wrep(d)

w(6.5)

Server computation model: Servers also perform two activities involving com-putation: performance prediction as part of the scheduling phase and provision ofapplication services as part of the service phase.

Let us consider a deployment with a set of servers S and the activities involved incompleting |S| requests at the server level. All servers complete |S| prediction requestsand each server will complete one service request phase, on average. As a whole, theservers as a group require the (Wpre · |S| + Wapp)/w time in seconds to complete the S

requests. We divide by the number of requests |S| to obtain the average time requiredper request by the servers as a group.

server_comp_time =Wpre + Wapp

|S|

w(6.6)

Now we have all the constraints for agent and server communication and com-putation that we use to calculate the throughput of each phase of the middlewaredeployment.

6.2 Steady-state throughput modeling

In this section we present models for scheduling and service throughput in DIET. Ac-cording to Lemma 1 we know that ρ = min(ρsched, ρservice). From above defined request


models we can calculate the throughput of each phase and then the throughput of theplatform.

ρsched =1

agent′s and server′s functionality in servicing phase

and

ρservice =1

server′s functionality in servicing phase

Depending upon the operating model, the functionality of agents and servers canbe done parallelly or sequentially. We consider two different theoretical models for thecapability of a computing resource to do computation and communication in parallel.

Send or receive or compute, single port: In this model, a computing resource hasno capability for parallelism: it can either send a message, receive a message, or com-pute. Only a single port is assumed: messages must be sent serially and receivedserially. This model may be reasonable for systems with small messages as these mes-sages are often quite CPU intensive.

As shown in the following equation, the scheduling throughput in requests persecond is then given by the minimum of the throughput provided by the servers forprediction and by the agents for scheduling.

ρ = min

(

1Wpre

w +Sreq

B +Srep

B

,1

Sreq+d·Srep

B +d·Sreq+Srep

B +Wreq+Wrep(d)

w

,

1

Sreq

B +Srep

B +Wpre+

Wapp

|S|

w

Send || receive || compute, single port: In this model, it is assumed that a computingresource can send messages, receive messages, and do computation in parallel. We still onlyassume a single port-level: messages must be sent serially and they must be received serially.Thus, for this model throughput can be calculated as follows.

ρ = min

1

max(

Wpre

w ,Sreq

B ,Srep

B

) ,1

max(

Sreq+d·Srep

B ,d·Sreq+Srep

B ,Wreq+Wrep(d)

w

) ,

1

max

(

Sreq

B ,Srep

B ,Wpre+

Wapp

|S|

w

)

We will use these models for our experiments to validate that using deployment modelswe can find an optimal platform for an hierarchical middleware.


6.3 Experimental results

In this section we present experiments designed to test the ability of our deployment modelto correctly identify optimal real-world deployments. Since our performance model and de-ployment approach focus on maximizing steady-state throughput, our experiments focus ontesting the maximum sustained throughput provided by different deployments. The follow-ing section describes the experimental design, Section 6.3.2 describes how we obtained theparameters needed for the model, Section 6.3.3 presents experiments testing the accuracy ofour throughput performance models, and Section 6.3.4 presents experiments testing whetherthe deployment selected by our approach provides good throughput as compared to other rea-sonable deployments. Finally, Section 6.4 provides some forecasts of good deployments for arange of problem sizes and resource sets.

6.3.1 Experimental design

Experimental design is same what we plotted in Chapter 5, except the resources that we usenow for the experiments. We used same hierarchical middleware, DIET, and its deploymenttool, GoDIET, to deploy DIET. Submitted job type is same, i.e., DGEMM problems.

The experiments were performed on two similar clusters. The first is a 55-node cluster atthe École Normale Supérieure in Lyon, France. Each node includes dual AMD Opteron 246processors at 2 GHz, a cache size of 1024 KB, and 2 GB of memory. We used GCC 3.3.5 forall compilations and the Linux kernel version was 2.6.8. All nodes are connected by GigabitEthernet. We measured network bandwidth using the Network Weather Service [82]. Usingthe default NWS message size of 256 KB we obtain a bandwidth of 909.5 Mb/s; using themessage size sent in DIET of 850 bytes we obtain a bandwidth of 20.0 Mb/s. So we did smallexperiment (presented in Section 6.3.3) to select the bandwidth for our final experiments.

The second cluster is a 140-node cluster at Sophia Antipolis in France. The nodes are phys-ically identical to the ones at Lyon but are running the Linux kernel version 2.4.21 and allcompilations were done with GCC 3.2.3. The machines at Sophia are linked by 6 differentCisco Gigabit Ethernet switches connected with a 32 Gbps bus.

We have selected the serial model among the theoretical model presented in Section 6.2 onthe basis of experiment presented in Section 6.3.3.

6.3.2 Model parametrization

Table 6.1 presents the parameter values we use for DIET in the models for ρsched and ρservice.Our goal is to parametrize the model using only easy-to-collect micro-benchmarks. In partic-ular, we seek to use only values that can be measured using a few clients executions. The al-ternative is to base the model on actual measurements of the maximum throughput of varioussystem elements; while we have these measurements for DIET, we feel that the experiments re-quired to obtain such measurements are difficult to design and run and their use would provean obstruction to the application of our model for other systems.

To measure message sizes Sreq and Srep we deployed a Master Agent (MA) and a singleDGEMM server (SeD) on the Lyon cluster and then launched 100 clients serially. We collectedall network traffic between the MA and the SeD machines using tcpdump and analyzed the


traffic to measure message sizes using the Ethereal Network Protocol analyzer1. This approachprovides a measurement of the entire message size including headers. Using the same MA-SeD deployment, 100 client repetitions, and the statistics collection functionality in DIET [28],we then collected detailed measurements of the time required to process each message at theMA and SeD level. The parameter Wrep depends on the number of children attached to anagent. We measured the time required to process responses for a variety of star deploymentsincluding an MA and different numbers of SeDs. A linear data fit provided a very accuratemodel for the time required to process responses versus the degree of the agent with a corre-lation coefficient of 0.997. We thus use this linear model for the parameter Wrep. Finally, wemeasured the capacity of our test machines in MFlops using a mini-benchmark extracted fromLinpack and used this value to convert all measured times to estimates of the MFlops required.

Components Wreq Wrep Wpre Srep Sreq

(MFlop) (MFlop) (MFlop) (Mb) (Mb)Agent 3.2 ×10−2 1.3×10−3 + 9.6×10−4·d - 6.4 ×10−3 5.3×10−3

SeD - - 6.4 ×10−3 6.4×10−5 5.3 ×10−5

Table 6.1: Parameter values for middleware deployment on cluster.

6.3.3 Throughput model validation

This section presents experiments testing the accuracy of the DIET agent and server through-put models presented in Section 6.2.

First, we examine the ability of the models to predict agent throughput and, in particular, topredict the effect of an agent’s degree on its performance. To test agent performance, the testscenario must be clearly agent-limited. Thus we selected a very small problem size of DGEMM

10. To test a given agent degree d, we deployed an MA and attached d SeDs to that MA; wethen ran a throughput test as described in Section 6.3.1. The results are presented in Figure 6.3.We verify that these deployments are all agent-limited by noting that the throughput is lowerfor a degree of two than for a degree of 1 despite the fact that the degree two deployment hastwice as many SeDs.

Figures 6.3 (a) and (b) present model predictions for the serial and parallel models, respec-tively. In each case three predictions are shown using different values for the network band-width. Comparison of predicted and measured leads us to believe that these measurementsof the network bandwidth are not representative of what DIET actually obtains. This is notsurprising given that DIET uses very small messages and network performance for this mes-sage size is highly sensitive to the communication layers used. The third bandwidth in eachgraph is chosen to provide a good fit of the measured and predicted values. For the purposesof the rest of this paper we will use the serial model with a bandwidth of 190 Mb/s becauseit provides a better fit than the parallel model. In the future we plan to investigate other mea-surement techniques for bandwidth that may better represent the bandwidth achieved whensending many very small messages as is done by DIET.

Next, we test the accuracy of throughput prediction for the servers. To test server per-formance, the test scenario must be clearly SeD-limited. Thus we selected a relatively large

1http://www.ethereal.com

http://www.ethereal.com


1 2 5 10 200

2000

4000

6000T

hrou

ghpu

t (re

q/se

c)

Agent degree

MeasuredModel w/ 20 Mb/sModel w/ 190 Mb/sModel w/ 909 Mb/s

(a) Serial model

1 2 5 10 200

2000

4000

6000

8000

Thr

ough

put (

req/

sec)

Agent degree

MeasuredModel w/ 20 Mb/sModel w/ 80 Mb/sModel w/ 909 Mb/s

(b) Parallel model

Figure 6.3: Measured and predicted platform throughput for DGEMM size 10; predic-tions are shown for several bandwidths.


1 2 5 10 200

1

2

3

4

5

6

Thr

ough

put (

req/

sec)

Agent degree

MeasuredModel w/ 190 Mb/s

Figure 6.4: Measured and predicted platform throughput for DGEMM size 1000; pre-dictions are shown for the serial model with bandwidth 190 Mb/s.

problem size of DGEMM 1000. To test whether performance scales as the number of serversincreases, we deployed an MA and attached different numbers of SeDs to the MA. The resultsare presented in Figure 6.4. Only the serial model with a bandwidth of 190 Mb/s is shown; infact, the results with the parallel model and with different bandwidths are all within 1% of thismodel since the communication is overwhelmed by the solve phase itself.

6.3.4 Deployment selection validation

In this section we present experiments that test the effectiveness of our deployment approachin selecting a good deployment. For each experiment, we select a cluster, define the total num-ber of resources available, and define a DGEMM problem size. We then apply our deploymentalgorithms to predict which CSD tree will provide the best throughput and we measure thethroughput of this CSD tree in a real-world deployment. We then identify and test a suit-able range of other CSD trees including the star, the most popular middleware deploymentarrangement.

Figure 6.5 shows the predicted and actual throughput for a DGEMM size of 200 where 25nodes in the Lyon cluster are available for the deployment. Our model predicts that the bestthroughput is provided by CSD trees with degrees of 12, 13 and 14. These trees have thesame predicted throughput because they have the same number of SeDs and the throughputis limited by the SeDs. Experiments show that the CSD tree with degree 12 does indeed providethe best throughput. The model prediction overestimates the throughput; we believe that thereis some cost associated with having multiple levels in a hierarchy that is not accounted for inour model. However, it is more important that the model correctly predicts the shape of thegraph and identifies the best degree than that it correctly predicts absolute throughput.


0 5 10 15 20 250

200

400

600

800

1000

Thr

ough

put (

requ

ests

/sec

)

Degree of CSD tree

Model predictionMeasured throughput

Figure 6.5: Predicted and measured throughput for different CSD trees for DGEMM 200with 25 available nodes in the Lyon cluster.

For the next experiment, we use the same DGEMM size of 200 but change the number ofavailable nodes to 45 and the cluster to Sophia. We use the same problem size (200) to demon-strate that the best deployment is dependent on the number of resources available, rather thanjust the type of problem. The results are shown in Figure 6.6. The model predicts that the bestdeployment will be a degree eight CSD tree while experiments reveal that the best degree isthree. The model does however correctly predict the shape of the curve and selects a deploy-ment that achieves a throughput that is 87.1% of the optimal. By comparison, the popular stardeployment (degree 44) obtains only 40.0% of the optimal performance.

For the last experiment, we again use a total of 45 nodes from the Sophia cluster but weincrease the problem size to 310; we use the same resource set size to show that the best deploy-ment is also dependent on the type of workload expected. The results are shown in Figure 6.7.In this test case, the model predictions are generally much more accurate than in the previoustwo cases; this is because ρservice is the limiting factor over a greater range of degrees due tothe larger problem size used here. Our model predicts that the best deployment is a 22 degreeCSD tree while in experimentation the best degree is 15. However, the deployment chosenby our model achieves a throughput that is 98.5% of that achieved by the optimal 15 degreetree. By comparison, the star and tri-ary tree deployments achieve only 73.8% and 74.0% of theoptimal throughput.

Table 6.2 summarizes the results of these three experiments by reporting the percentage ofoptimal achieved for the tree selected by our model, the star, and the tri-ary tree. The tablealso includes data for problem size 10, for which an MA with one SeD is correctly predictedto be optimal, and problem size 1000, for which a star deployment is correctly predicted to beoptimal. These last two cases represent the usage of the model in clearly SeD-limited or clearlyagent-limited conditions.


0 5 10 15 20 25 30 35 40 450

500

1000

1500

Thr

ough

put (

requ

ests

/sec

)

Degree of CSD tree


Figure 6.6: Predicted and measured throughput for different CSD trees for DGEMM 200with 45 available nodes in the Sophia cluster.

0 5 10 15 20 25 30 35 40 450

100

200

300

400

500

Thr

ough

put (

requ

ests

/sec

)

Degree of CSD tree


Figure 6.7: Predicted and measured throughput for different CSD trees for DGEMM 310with 45 available nodes in the Sophia cluster.


DGEMM Nodes Optimal Selected Model Star Tri-arySize |V| Degree Degree Performance10 21 1 1 100.0% 22.4% 50.5%

100 25 2 2 100.0% 84.4% 84.6%200 45 3 8 87.1% 40.0% 100.0%310 45 15 22 98.5% 73.8% 74.0%1000 21 20 20 100.0% 100.0% 65.3%

Table 6.2: A summary of the percentage of optimal achieved by the deployment se-lected by our model, a star deployment, and a tri-ary tree deployment.

XX

XX

XX

XX

XX

X

|V|DGEMM Size

10 100 500 1000

d |A| |S| d |A| |S| d |A| |S| d |A| |S|25 1 1 1 2 11 12 24 1 24 24 1 2450 1 1 1 2 11 12 49 1 49 49 1 49

100 1 1 1 2 11 12 50 2 98 99 1 99200 1 1 1 2 11 12 40 5 195 199 1 199500 1 1 1 2 11 12 15 34 466 125 4 496

Table 6.3: Predictions for the best degree d, number of agents used |A|, and numberof servers used |S| for different DGEMM problem sizes and platform sizes |V|. Theplatforms are assumed to be larger clusters with the same machine and network char-acteristics as the Lyon cluster.

6.3.5 Validation of model for mix workload

In this section we test whether it is better to deploy one DIET hierarchy on all available nodesand submit all (heterogeneous) requests to the hierarchy, or whether it is better to divide theavailable nodes in different partitions with one partition per problem size. For example, if wehave 75 nodes and three different problems, should we deploy one CSD tree using all 75 nodesto solve the three types of problems, or should we deploy three CSD trees of 25 nodes each andsubmit only uniform problems to each CSD tree.

For this experiment we use a total of 75 nodes at the Orsay site and DGEMM problem sizes10, 100, and 1000. First, we divided the available nodes in three sets of 25 nodes each. Then weused our optimal deployment planning algorithm to predict the best value of d to construct aCSD tree with 25 nodes for each problem size. Next, we used the planning algorithm to predictthe best value of d for a CSD tree using 75 nodes and each problem size.

For a deployment of size 25 nodes, for sizes 10 and 100 our algorithm predicts a best degreeof 2, while for size 1000 the best degree is predicted to be 24. For a deployment of size 75 nodes,degree 2 is again predicted to be optimal for sizes 10 and 100 while degree 74 is predicted to beoptimal for size 1000. Next, we tested the makespan of each set of tasks on the three separatedeployments, sending only the appropriate problem size to each deployment. Then we testedthe makespan of the combined set of tasks including all three problem sizes on the 75 nodedeployments. The results are shown in Fig. 6.8.

Fig. 6.8 shows that if we deploy three small CSD tree in parallel it take 560.36 seconds toexecute 3000 requests (1000 request by each CSD in parallel) but if we deploy one big CSD tree


0 100 200 300 400 500 600 700 800

25 nodes

25 nodes

25 nodes

75 nodes

75 nodes

Degree 2, Problem size 10



Degree 2, Problem sizes 10,100,1000

Degree 74, Problem sizes 10,100,1000

Makespan (seconds)

Figure 6.8: Makespan for a group of tasks partitioned to three deployments or sent toa single joint deployment.

but with degree 74, we can save time, as it took only 466 seconds to execute 3000 requests.Thus, it is better to deploy one platform for mix jobs then dividing the available number ofnodes in the set according to the number of types of jobs to be submitted.

6.4 Model forecasts

In the previous section we presented experiments demonstrating that our model is able toautomatically identify a deployment that is optimal. In this section we use our model to fore-cast deployments for a variety of scenarios. These forecasts can then be used to guide futuredeployments at a larger scale than we were able to test in these experiments. Table 6.3 sum-marizes model results for a variety of problem sizes and a variety of platform sizes for a largercluster with the characteristics of the Lyon cluster.

6.5 Conclusion

In this chapter we have presented an approach for determining an hierarchical middlewaredeployment on cluster, homogeneous resources. In optimal middleware deployment planningapproach, we have shown that an optimal deployment for hierarchical middleware systemson clusters is provided by a CSD tree. We presented request performance models followed bythroughput models for agents and servers in DIET. We presented experiments validating the

6.5. CONCLUSION 95

DIET throughput performance models and demonstrating that our approach can effectivelyautomatically build a tree for deployment which is nearly optimal and which performs signif-icantly better than other reasonable deployments.

Now a days the application size is getting bigger and thus execution of the applications re-quire high computing power and large data storage. It is not always possible to solve theapplication on one cluster, homogeneous resources. Thus, users access the distributed re-sources or clusters to execute their applications in reasonable time. Middleware deploymentplanning presented in this chapter is for homogeneous resources and cannot be applied toheterogeneous resources. However, finding the best deployment among heterogeneous re-sources is similar to find the best broadcast tree on a general graph, which is known to beNP-complete [18], because deployed platform act as a broadcast tree to send and receive datafrom its nodes. So next Chapter 7 provides an heuristic for middleware deployment on het-erogeneous resources.

Chapter 7

Automatic Middleware DeploymentPlanning for Grid

In the previous chapter, we have presented an optimal middleware deployment planning forhomogeneous resources. Homogeneous deployment planning cannot be applied to heteroge-neous resources because in-case of homogeneous deployment planning we did not apply anylogic to select parent nodes, nodes that will act as scheduler. In case of heterogeneous node wehave to calculate the number of children supported by each node, as we did in homogeneouscase and also select nodes that will act as parent nodes in the deployment so as to maximizethe throughput of the platform. However, finding the best deployment among heterogeneousresources is a hard problem since it amounts to find the best broadcast tree on a general graph,which is known to be NP-complete [18]. As our goal is to find a deployment on grid so in thischapter we present an heuristic for middleware deployment on heterogeneous resources.

Our objective is to generate the best platform from the available nodes so as to fulfill clientdemand if client demand is at most equal to the maximum throughput (under certain condi-tion, like homogenize the communication between nodes, etc.) that can be achieved by the useof available nodes in a time step. Throughput is the maximum number of requests that can becompleted (real execution is completed) in a time step.

7.1 Platform DeploymentWe target the same platform architecture, shown in Figure 6.1 of Chapter 6. The softwaresystem architecture, client request execution, request definition and deployment assumptionsare same as defined in Section 6.1 of Chapter 6. Even the resource architecture is similar to theone defined in Chapter 6 except that now nodes are heterogeneous, so computing power ofeach resource is different. Variables are listed in Table 7.1.

7.2 Heuristic for middleware deployment on heteroge-neous resources

Our objective is to find a deployment that provides the maximum throughput ρ of completed requestsper second. A completed request is one that has completed both the scheduling and service re-

97

98 CHAPTER 7. AUTOMATIC MIDDLEWARE DEPLOYMENT PLANNING FOR GRID

Variable RepresentationV set of computing resourcesi computing resource

wi computing power of resource i

E set of edgese resource link between two resourcesS set of serverss serverA set of agentsa agent

Table 7.1: A summary of notations used to define platform deployment.

quest phases and for which a response has been returned to the client. When the maximumthroughput can be achieved by multiple distinct deployments, the preferred deployment is theone using the least resources.

We define the scheduling request throughput of node i in requests per second, ρschedi, as

the rate at which requests are processed by the scheduling phase (see Section 6.1.1). Likewise,we define the service throughput of node i in requests per second, ρservicei

, as the rate at whichthe servers produce the services required by the clients.

Lemma 4. The completed request throughput ρ of a deployment is given by the minimum of the schedul-ing request throughput ρschedi

among all resources i and the service request throughput ρservice of thedeployment.

ρ = min(∀vi(ρschedi), ρservice)

From Lemma 4 it is clear that throughput of the deployment platform is dependent onthe scheduling throughput of each node and servicing throughput of the deployment. Butas scheduling throughput and servicing throughput depends on the computing power of thenode, the placement of the nodes in the hierarchy depends indirectly on the computing powerof the nodes. As explained in Section 7.1 all the resources are of different computing power.So selection of the resources for mapping of appropriate middleware element is very crucialfor heterogeneous resources even for homogeneous resources it was not easy as shown inexperimental results in Chapter 1.

For the sake of simplicity we have defined some procedures for the middleware deploy-ment heuristic 7.1.

• add_servers_list is a function to add id of the nodes that will be servers in thehierarchy.

• calc_hier_ser_pow is a function to calculate the servicing power provided by thehierarchy when load is equally divided among the servers of the hierarchy.

• calc_min_val is a function that calculates the minimum throughput possible amongscheduling throughput, servicing throughput of the constructing hierarchy and the clientvolume.

• calc_sch_pow is a function to calculate the scheduling power of each node accordingto the computing power of the node and number of its children.

7.2. HEURISTIC FOR MIDDLEWARE DEPLOYMENT ON HETEROGENEOUS RESOURCES99

• count_child is a function that counts the number of children that a node can supportwithout decreasing scheduling power below the demanded throughput, while calculat-ing the children.

• empty_array is a function to remove all the node ids from the agent and server arrays.

• plot_hierarchy is a function to fill the adjacency matrix. Adjacency matrix is filledaccording to the number of children that each agent (from agent array) can support.

• shift_nodes is a function to shift up the node id in the server array if any server isconverted as an agent.

• sort_nodes is a function to sort the available nodes according to there schedulingpower calculated by function calc_sch_powwith maximum number of children.

• write_xml is a function to generate an XML file according to the adjacency matrix, thatis given as an input to GoDIET to deploy the hierarchical platform.

Variable Representationacount, d, i, ncount, countersscount, select_dadd_child[i] number of children supported by node i

agents array to store the id of nodes that are agents in hierarchyagent_binary_name name of the agent binarychild_added keep track how many children are add for nodeclient_volume throughput demanded by the clientdiff store minimum throughput among, ρsched,ρservice which is calculated

in previous step, and client demandinput_values[i] store parametric values (like Wrep, Sreq,etc.) for node i

min_ser_cv store minimum values among servicing throughput of hierarchyand client demand

nb_sed total number of servers in the hierarchynode_used keep track how many nodes are used in the hierarchy constructionn_nodes total nodes available for hierarchy constructionnodes_hierarchy nodes used in hierarchy constructionthroughput_diff store minimum throughput among, ρsched,ρservice, and client demandvir_max_sch_pow scheduling power of top node in sorted_nodes arrayvir_max_ser_pow servicing power of the hierarchysch_pow[i] scheduling power of node i

sed_binary_name name of the server binaryservers array to store the id of nodes that are servers in hierarchysorted_nodes array to store the id of sorted nodesxml_file_name generated XML file is saved with this name

Table 7.2: Variables used in Heuristic 7.1.

Variables used in the heuristic 7.1 are presented in Table 7.2. The heuristic is based on theexact calculation of number of children supported by each node. As scheduling power of any


agent is limited by the number of children that it can support, so to select a node that can actas agent, we calculate the scheduling power (using Equation 7.13) of each node with childrenequal to available nodes, n_nodes. Actually, its not only the number of children that effects thescheduling power of the parent node but also the computing power of its children. And at thispoint we have no idea which node will be agent, so as to remove from the total nodes, thus wetook children equal to n_nodes.

Then sort the nodes according to scheduling power with n_nodes children in descendingorder. Top node in the sorted list is most suitable to be an agent. Now we calculate the schedul-ing power of each node only with one child, so as to calculate the maximum scheduling powerpossible by the node. In Step 11, we calculate servicing power by using Equation 7.14. Step 12,checks if the scheduling power of the node is less than the client demand then one level hier-archy is deployed with one agent and one server (top two nodes of the sorted list), because ifmore servers are added to the node scheduling power will decrease. Else if scheduling poweris higher than client demand, then servicing power is increased by added new nodes. For thismay be new agents are added by following the Steps 15 to 53.

In Step 22, we calculate, the scheduling throughput of the top node in the sorted list byincrementing number of its children (Step 18). Based on this value, number of children thateach node can support is calculated. If any node can support more than one child then thatnode is added in the hierarchy and the children that the node can support. Again servicingpower is calculated considering the total number of servers in the hierarchy. Steps 15 to 53 arerepeated till all the nodes are used or client demand is fulfilled or throughput of the hierarchystart decreasing.

Then the connection between the nodes is presented in the form of adjacency matrix in Step54. In Step 55 hierarchy is represented written in XML file which is used by the deploymenttool.

7.3 Request performance modelingIn order to apply the heuristic presented in Section 7.2 to DIET, we must have models for thescheduling throughput and the service throughput in DIET. Variables used in the formationof performance models to estimate the time required for various phases of request treatmentare same as defined in Chapter 6, except the variables that are depending on the computingpower of the node. For the simplicity variables are listed and defined in Table 7.3.

Agent communication model: To treat a request, an agent receives the request from itsparent, sends the request to each of its children, receives a reply from each of its children, andsends one reply to its parent. The time in seconds required by an agent i for receiving allmessages associated with a request from its parent and di children is as follows:

agent_receive_time =Sreq + di · Srep

B(7.1)

Similarly, the time in seconds required by an agent for sending all messages associatedwith a request to its di children and parent is as follows:

agent_send_time =di · Sreq + Srep

B(7.2)

Server communication model: Servers have only one parent and no children, so the timein seconds required by a server for receiving messages associated with a scheduling request is

7.3. REQUEST PERFORMANCE MODELING 101

Algorithm 7.1 Heuristic: To find the best hierarchy1: for (i=0;i<n_nodes;i++) do2: add_child[i]=n_nodes;3: sch_pow[i]=calc_sch_pow(add_child[i],input_values[i]);4: sort_nodes(sch_pow, sorted_nodes);5: vir_max_sch_pow = calc_sch_pow(1,input_values[sorted_nodes[0]]);6: acount=0; ncount=0;scount=0;7: agents[acount++]=sorted_nodes[ncount++];8: servers[scount++]=sorted_nodes[ncount++];9: d=1; nb_sed=d; node_used=nb_sed+1; diff=1000000;

10: vir_max_sch_pow=calc_sch_pow(d,input_values[agents[0]]);11: vir_max_ser_pow=calc_hier_ser_pow(input_values[],nb_sed,servers[]);12: min_ser_cv=min(vir_max_ser_pow,client_volume);13: if (vir_max_sch_pow < min_ser_cv) then14: select_d=1; nb_sed=d;nodes_hierarchy=2;15: else16: throughput_diff =|vir_max_sch_pow - min_ser_cv|;17: while (diff>throughput_diff) do18: diff=throughput_diff;19: d++; nb_sed=d; node_used=nb_sed+1;20: empty_arrays(agents,servers);21: acount=0; ncount=0; scount=0; count=1;22: agents[acount]=sorted_nodes[ncount]; ncount++; acount++;23: vir_max_sch_pow=calc_sch_pow(d,input_values[agents[0]]);24: add_servers_list(sorted_nodes[],servers[],d,scount);25: vir_max_ser_pow=calc_hier_ser_pow(input_values[],nb_sed,servers[]);26: throughput_diff=calc_min_val(vir_max_ser_pow,client_volume,vir_max_sch_pow);27: count_child(vir_max_sch_pow,n_nodes,add_child[],input_values[]);28: while ((node_used<n_nodes)&&(vir_max_sch_pow>vir_max_ser_pow)) do29: child_added=0;30: if (add_child[sorted_nodes[ncount]]>1) then31: nb_sed—-; ncount++; agents[acount]=servers[0]; acount++;32: scount=shift_nodes(scount,servers); scount—-;33: servers[scount]=sorted_nodes[d+count];34: count++; scount++; nb_sed++; node_used++; child_added++;35: while (child_added<add_child[sorted_nodes[ncount]]) do36: vir_max_ser_pow=calc_hier_ser_pow(input_values[],nb_sed,servers[]);37: if ((vir_max_ser_pow<client_volume)&&(node_used<

n_nodes)&&(vir_max_ser_pow<vir_max_sch_pow)) then38: servers[scount]=sorted_nodes[d+count]; scount++;39: count++; nb_sed++; child_added++; node_used++;40: vir_max_ser_pow=calc_hier_ser_pow(input_values[],nb_sed,servers[]);41: else42: child_added=add_child[sorted_nodes[ncount]];43: else44: node_used=n_nodes;45: throughput_diff=calc_min_val(vir_max_ser_pow,client_volume,vir_max_sch_pow);46: nodes_hierarchy=nb_sed+acount;47: if (vir_max_sch_pow<vir_max_ser_pow) then48: if (diff<throughput_diff) then49: selected_d=d-1;50: else51: selected_d=d; diff=throughput_diff;52: if (d==n_nodes-1) then53: diff=throughput_diff;54: plot_hierarchy(hierarchy,n_nodes,add_child,sorted_nodes,nodes_hierarchy,selected_d);55: write_xml(nodes_hierarchy,hierarchy,sed_binary_name,agent_binary_name,xml_file_name);


Notation Representationρ throughput of the platform

ρschedischeduling request throughput of resource i

ρservice service request throughputSreq size of incoming requestSrep size of the replyWprei

amount of computation of resource i to mergethe replies from di children

Wseli amount of computation of resource i neededto process the server reply

Wfixifixed cost to process the server reply

Wreqiamount of computation need by resource i

to process one incoming requestWappi

amount of computation needed by a server i

to complete a service request for app servicedi children supported by resource i

B Bandwidth link between resourcesN Number of requests completed by S in a time stepC Constant to denote time

Table 7.3: A summary of notations used to define performance model.

as follows:

server_receive_time =Sreq

B(7.3)

The time in seconds required by a server for sending messages associated with a request toits parent is as follows:

server_send_time =Srep

B(7.4)

Agent computation model: Agents perform two activities involving computation: the pro-cessing of incoming requests and the selection of the best server amongst the replies returnedby the agent’s children.

There are two activities in the treatment of replies: a fixed cost Wfixiin MFlops and a cost

Wseli that is the amount of computation in MFlops needed to process the server replies, sortthem, and select the best server by agent i. Thus the computation associated with the treatmentof replies is given by

Wrepi(di) = Wfixi

+ Wseli · di

The time in seconds required by the agent for the two activities is given by the followingequation.

agent_comp_time =Wreqi

+ Wrepi(di)

wi(7.5)

Server computation model: Servers also perform two activities involving computation:performance prediction as part of the scheduling phase and provision of application servicesas part of the service phase.

7.4. STEADY-STATE THROUGHPUT MODELING 103

We suppose that a deployment with a set of servers S completes N requests in a given timestep. Then each server i will complete Ni requests, a fraction of N such that:

N∑

i=1

Ni = N (7.6)

On average each server i do prediction of N requests and complete Ni service request in atime step. Lets say, the servers as a group require C seconds to complete the N requests, then

C =Wprei

· N + Wappi· Ni

wi(7.7)

From Equation( 7.7), we can calculate the requests completed by each server i as:

Ni =C · wi − Wprei

· N

Wappi

(7.8)

From Equation( 7.6) and Equation( 7.8), we get time taken by the servers to process Nrequests as

C = N ·1 +

∑Ni=1

Wprei

Wappi∑N

i=1wi

Wappi

(7.9)

so, time taken by the servers to process one request is

server_comp_time =1 +

∑Ni=1

Wprei

Wappi∑N

i=1wi

Wappi

(7.10)

Now we use these constraints to calculate the throughput of each phase of the middlewaredeployment.

7.4 Steady-state throughput modelingIn this section we present models for scheduling and service throughput in DIET. We usedM(r, s, w), no internal parallelism model ( 5.1.1) for the capability of a computing resource todo computation and communication.

In this model, a computing resource has no capability for parallelism: it can either senda message, receive a message, or compute. Only a single port is assumed: messages must besent serially and received serially.

Servicing throughput of server i is:

ρservicei=

Ni

C +Sreq+Srep

B · N(7.11)

So according to Equations( 7.6) and ( 7.11), servicing throughput of platform is:

ρservice =

N∑

i=1

Ni

C +Sreq+Srep

B · N=

1

1+PN

i=1

WpreiWappi

PNi=1

wiWappi

+Sreq+Srep

B

(7.12)


The scheduling throughput ρsched in requests per second is given by the minimum of thethroughput provided by the servers for prediction and by the agents for scheduling as shownbelow.

ρsched = min

1

Wprei

wi+

Sreq

B +Srep

B

,1

Wreqi+Wrepi

(di)

wi+

Sreq+di·Srep

B +di·Sreq+Srep

B

∀vi (7.13)

In servicing phase only server take part so, the ρservice is calculated from server’s compu-tation constraint as shown below.

ρservice = min

1

Sreq

B +Srep

B +1+

PNi=1

WpreiWappi

PNi=1

wiWappi

∀vi (7.14)

According to Lemma 4, the completed request throughput ρ of a deployment is given bythe minimum of the scheduling request throughput ρsched and the service request throughputρservice, so using Equations 7.13 and 7.14, we generate the following equation to presents thethroughput of the platform,

ρ = min

1

Wprei

wi+

Sreq

B +Srep

B

,1

Wreqi+Wrepi

(di)

wi+

Sreq+di·Srep

B +di·Sreq+Srep

B

,

1

Sreq

B +Srep

B +1+

PNi=1

WpreiWappi

PNi=1

wiWappi

∀vi (7.15)

We use this formula to calculate the throughput of the hierarchy in our experiments.

7.5 Experimental ResultsIn this section we present experiments designed to test the ability of our deployment modelto correctly identify good real-world deployments. Since our performance model and deploy-ment approach focus on maximizing steady-state throughput, our experiments focus on test-ing the maximum sustained throughput provided by different deployments.

7.5.1 Experimental Design

DIET 2.0 was installed on the available resources. We deployed DIET agents and servers usingGoDIET 2.1.0. Job type is same (DGEMM) as considered in Chapter 5. Workload to hierarchyis incremented gradually via a script that runs a single request at a time in a continual loop.We then introduce load gradually by launching one client script every second. We introduce


new clients until the throughput of the platform stops improving; we then let the platform runwith no addition of clients for 10 minutes. Figure 7.1 shows the manner in which requests aresubmitted to the deployment by the clients launched on machines.

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��Request

Client

Time

Mac

hine

s

Figure 7.1: Explanation of workload introduced by submitting requests with the in-crease in the launch of clients on each machine.

7.5.2 Model Parametrization

Table 7.4 presents the parameter values we use for DIET in the model. Our goal is to parametrizethe model using only easy-to-collect micro-benchmarks. In particular, we seek to use only val-ues that can be measured using a few clients executions. The alternative is to base the modelon actual measurements of the maximum throughput of various system elements; while wehave these measurements for DIET, we feel that the experiments required to obtain such mea-surements are difficult to design and run and their use would prove an obstruction to theapplication of our model for other systems.

To measure message sizes Sreq and Srep we deployed a Master Agent (MA) and a singleDGEMM server (SeD) on the Lyon cluster and then launched 100 clients serially from Lyon clus-ter. We collected all network traffic between the MA and the SeD machines using tcpdump andanalyzed the traffic to measure message sizes using the Ethereal Network Protocol analyzer1.This approach provides a measurement of the entire message size including headers. Usingthe same MA-SeD deployment, 100 client repetitions, and the statistics collection function-ality in DIET, we then collected detailed measurements of the time required to process eachmessage at the MA and SeD level. The parameter Wrep depends on the number of childrenattached to an agent. We measured the time required to process responses for a variety of stardeployments including an MA and different numbers of SeDs. A linear data fit provided avery accurate model for the time required to process responses versus the degree of the agentwith a correlation coefficient of 0.97. We thus use this linear model for the parameter Wrep.Finally, we measured the capacity of our test machines in MFlops using a mini-benchmarkextracted from Linpack and used this value to convert all measured times to estimates of theMFlops required.

1http://www.ethereal.com

http://www.ethereal.com


DIET Wreq Wrep Wpre Srep Sreq

elements (MFlop) (MFlop) (MFlop) (Mb) (Mb)Agent 1.7 ×10−1 4.0×10−3 + 5.4×10−3·d - 5.4 ×10−3 5.3×10−3

SeD - - 6.4 ×10−3 6.4×10−5 5.3 ×10−5

Table 7.4: Parameter values for middleware deployment on Lyon site of Grid’5000

7.5.3 Performance model validation on homogeneous platform

The usefulness of our deployment heuristic depends heavily on the performance model ofthe middleware. This section presents experiments designed to show the correctness of theperformance model presented in previous section. These experiments are performed on Lyonsite.

0 20 40 60 80 100 120 140 160 180 200160

180

200

220

240

260

280

300

Number of clients

Req

uest

s/se

cond

1 SeD

2 SeDs

(a)

Thr

ough

pt (

requ

ests

/sec

ond)

1 SeD

2 SeDs

Measured Predicted

295

1460

283

1052

(b)

Figure 7.2: Star hierarchies with one or two servers for DGEMM 10x10 requests. (a)Measured throughput for different load levels. (b) Comparison of predicted and mea-sured maximum throughput

Experimental results shown in Figure 7.2 uses a workload of DGEMM 10x10 to compare theperformance of two hierarchies: an agent with one server versus an agent with two servers.The model correctly predicts that both deployments are limited by agent performance andthat the addition of the second server will in fact hurt performance. Important is the correctprediction by our model to judge the effect of adding servers than to correctly predict thethroughput of the platform.

Experimental results shown in Figure 7.3, uses a workload of DGEMM 200x200 to comparethe performance of the same one and two server hierarchies. For this scenario, the modelpredicts that both hierarchies are limited by server performance and therefore, performancewill roughly double with the addition of the second server. The model correctly predicts thatthe two-server deployment will be the better choice.

In summary, our deployment performance model is able to accurately predict the impact


0 50 100 150 200 250 30040

50

60

70

80

90

100

Number of clients

Req

uest

s/se

cond

1 SeD

2 SeDs

(a)

MeasuredPredicted

Thr

ough

pt (

requ

ests

/sec

ond)

1 SeD

2 SeDs

35

70

90

45

(b)

Figure 7.3: Star hierarchies with one or two servers for DGEMM 200x200 requests. (a)Measured throughput for different load levels. (b) Comparison of predicted and mea-sured maximum throughput

of adding servers to a server-limited or agent-limited deployment.To verify our heuristic we compared the predicted deployment given by the heuristic with

the experimental results presented in Chapter 6. Table 7.5 presents the comparison by report-ing the percentage of optimal achieved by the deployments selected by different means.

DGEMM Nodes Optimal Homogeneous Heuristic HeuristicSize |V| Degree Degree Degree Performance10 21 1 1 1 100.0%

100 25 2 2 2 100.0%200 45 3 9 10 -310 45 15 22 33 89.0%

1000 21 20 20 20 100.0%

Table 7.5: A summary of the percentage of optimal achieved by the deployment se-lected by our heterogeneous heuristic, optimal degree, and optimal homogeneousmodel.

7.5.4 Heuristic validation on heterogeneous clusterTo validate our heuristic we did experiments using two sites, Lyon and Orsay of Grid’5000,a set of distributed computational resources in France. We used 200 nodes of Orsay for thedeployment of middleware elements and 30 nodes of Lyon for submitting the requests to thedeployed platform.

To convert the homogeneous cluster into heterogeneous cluster we changed the workloadof the reserved nodes by launching different size of matrix multiplication as the background


0 100 200 300 400 500 600 70050

100

150

200

Number of clients

Requ

ests

/sec

ond

Star

Balanced

Automatic

Figure 7.4: Comparison of automatically-generated hierarchy for DGEMM 310x310with intuitive alternative hierarchies.

program on some of the nodes. Different size of matrix utilizes the computing power in vary-ing manner thus the available computing power of the nodes varies. We used the measuredtimes of Lyon nodes ( Section 7.5.2) for Orsay nodes because the machine configuration of twosites is exactly the same. After launching the matrix program in background on machines weused Linpack mini-benchmark to measure the capacity of the nodes in MFlops and used thisvalue to convert all measured times of Lyon nodes to estimates of the MFlops of each node.

We compared two different deployments with the automatically generated deployment byour heuristic. First deployment is the simple star type, where one node act as agent and all restnodes are directly connected to the agent node and act as servers. In second deployment wedeployed a balanced graph, one top agent connected to 14 agents and each agent connected to14 servers except one that could had only 3 servers.

Clients submitted DGEMM problems of two different sizes. First we tested the deploymentswith DGEMM 310x310. Heuristic generated deployment used only 156 nodes and deploymentis organized as: top agent connected with 9 agents and each agent again connected to 9 agents.Two agents are connected with 9 SeDs, 6 agents are connected with 7 SeDs and one with 5 SeDs.Automatically generated deployment performed better than the two comparing deployments.Results of the experimental results are shown in Figure 7.4.

Second experiment is done with DGEMM 1000x1000. Heuristic generated a star deploymentfor this problem size. Results in Figure 7.5 shows that star performed better than the secondcompared deployment.

7.6. CONCLUSION 109

0 50 100 150 200 250 300 350 400 450 5005

10

15

20

25

30

Number of clients

Requ

ests/

seco

nd

Automatic/StarBalanced

Figure 7.5: Comparison of automatically-generated hierarchy for DGEMM 1000x1000with intuitive alternative hierarchy.

7.6 ConclusionWe present a deployment heuristic that predicts the maximum throughput that can be achievedby the use of available nodes. Heuristic predicts good deployment for both homogeneous adheterogeneous resources. Comparison is done to test the heuristic for homogeneous resourcesand heuristic performed up to 90% as compared to the homogeneous optimal algorithm pre-sented in Chapter 6. To validate the heuristic experiments are done on site of Grid’5000. Exper-iments have shown that automatically generated deployment by the heuristic performs betterthan some intuitive deployments.

Chapter 8

Improve Throughput of a DeployedHierarchical NES

It is not always evident to deploy the middleware platform from scratch according to the spec-ifications of new jobs. Sometime it is useful to modify the existing hierarchy with less costand time then to deploy a new platform according to new requests. This chapter presents amodel to analyze existing hierarchical deployments. A mathematical model is used to analyzeexisting hierarchical deployment. It identifies a bottleneck node, and remove the bottleneck byadding resources in the appropriate area of the hierarchy. The solution is iterative and heuris-tic in nature. We validated our model by applying it to improve an user defined deploymentplatform’s for DIET middleware.

8.1 Hierarchical deployment model

We model a collection of heterogeneous resources (a processor, a cluster, etc.) and the com-munication links between them as nodes and edges of an undirected hierarchical graph (tree-shaped). Each node is a computing resource capable of computing and communicating withits neighbors at same/different rates. We assume that one specific node, referred as client, ini-tially generates requests. The client sends the requests to its neighbor node, which is the headof the hierarchy. This node checks whether the request is right (having all the parameters thata request should have), and if so, then the request is flooded to the neighboring nodes down inthe hierarchy. These nodes forward the requests down the hierarchy till the request reaches tothe connected servers. These servers send reply packets to their parent nodes. These packetscontain the status (memory available, number of resources available, performance prediction,...) of the server. A parent node compares the reply packet sent to it, by each of its connectedservers and selects the best server among them. Now the reply packet of the selected serveris sent by the node to its parent. Finally the reply packet is received by the head node. Bestserver (or list of available servers, ranked in order of availability) among the selected serversis being informed by the head node to the client. The client contacts the selected server. Thenclient sends the input data to the server. Finally the server executes the function on behalf ofthe client and returns the results.

The target architectural framework is represented by a weighted graph G = (V,E,w,B).Each Pi ∈ V represents a computing resource of computing power wi, meaning that node Pi

111

112 CHAPTER 8. IMPROVE THROUGHPUT OF A DEPLOYED HIERARCHICAL NES

Client

Head node

Agents

Servers

Links

w2

B01

B12

B37

w0

w1

w3

w7w6w5w4

B13

B25 B36B24

Figure 8.1: Architectural model.

execute wi MFlop/second (so bigger the wi, the faster the computing resource Pi). There isa client node, i.e. a node Pc, which generates the requests that are passed to the followingnodes1. Each link Pi → Pj is labelled by the bandwidth value Bi,j which represents the size ofdata sent per second between Pi and Pj . Measuring unit of bandwidth of link is Mb/second.

The size of the request generated by the client is S(in)i and the size of the request’s reply

created by each node is S(out)i . The measuring unit of these quantities is thus in Mb/request.

The amount of computation needed by Pi to process one incoming request is denoted by W(in)i

and the amount of computation needed by Pi to merge the reply requests of its children is de-noted by W

(out )i . We denote by W

(DGEMM)i the amount of computation needed by Pi to process

a generic problem (for example, BLAS [35] matrix multiplication called DGEMM).The current architectural model does not consider data management, we focus on data

in place applications (due to security problem, or parameter programming). Nevertheless,data management could be easily added like a new level of the modeling. The applicationtarget and the missing evaluate tools to estimate the data movement explain why we does notconsider this aspect. In the same paradigm we consider the result that are very small can beneglected.

8.2 Throughput calculation of an hierarchical NES

Our objective is to compare the maximum number of requests answered per second by a spe-cific type of architecture so that best architecture can be selected. α

(in)i denotes the number

of incoming request (request coming from client) processed by Pi during one time-unit. Notethat this number is not necessarily an integer and may be a rational. In a similar way, α

(out )i

is the number of outgoing requests (selection of the best server based on the reply packets)

1We use only one Client node for sake of simplicity but modeling many different clients with differ-ent problem types can be done easily.

8.2. THROUGHPUT CALCULATION OF AN HIERARCHICAL NES 113

computed during one time-unit by the node Pi. Servers are connected to the local agents at thelast level of the graph. Therefore, α

(DGEMM)i denote the number of problem solved by the node

Pi if Pi is a server.

8.2.1 Find a bottleneckThe number of requests replied in a time step depends on the bandwidth of the link, the sizeof the request, the fraction of request being computed by a processor in a time step and thecomputing power of the processor. Therefore, we have the following constraints:

Computation constraint for agent: ∀Pi :α

(in)i ×W

(in)i + α

(out)i × W

(out)i

wi≤ 1

Note that, it is necessary for each incoming request, that there should be a correspondingreply, and that each request is broadcasted along the whole hierarchy. Thus, there is noneed to make a distinction between the α

(in)i and the α

(out )i , that all are equal to the

maximum throughput of the platform ρ. The previous equation can thus be simplifiedin the following equation:

∀Pi : ρ ×W

(in)i + W

(out)i

wi≤ 1 (8.1)

Communication constraint for agent: ρ requests for computations and ρ replies to these re-quests are transmitted per time-unit along each link Pi → Pj . Therefore, we have:

∀Pi → Pj : ρ ×S

(in)i + S

(out )j

Bi,j≤ 1 (8.2)

Server’s computation constraint: Each server Pi process α(DGEMM)i problem per time-unit. There-

fore, we have

∀Pi s.a Pi is a server :α

(DGEMM)i ×W

(DGEMM)i

wi≤ 1

All these values are linked to ρ by the equation ρ =∑

Pi s.a Pi is a server

α(DGEMM)i . Therefore,

we haveρ ≤

∑


wi

W(DGEMM)i

(8.3)

No internal parallelism: In this model, the computation and other operation performed bythe node is done sequentially, so the summation of all operations performed by an agentshould be less than the time step. Therefore, for all Pi, we have:

ρ

S

(in)parent(i)

Bparent(i),i+

S(out)i

Bparent(i),i

︸︷︷︸

Communications with the parent

+ ρ

∑

Pi→Pj

S(in)i + S

(out)j

Bi,j

︸︷︷︸

Communications with the children

+ ρ

(

W(in)i + W

(out)i

wi

)

︸︷︷︸

Local computations

≤ 1.

(8.4)


It is noteworthy that if on Pi, computations can be performed in parallel with communi-cations, the previous constraints should be changed as:

ρ

S

(in)parent(i)

Bparent(i),i+

S(out)i

Bparent(i),i

︸︷︷︸

Communications with the parent

+ ρ

∑

Pi→Pj

S(in)i + S

(out )j

Bi,j

︸︷︷︸

Communications with the children

≤ 1. (8.5)

Theorem 3. The maximum number of requests that can be processed by the platform in steady state isobtained from constraints (8.1), (8.2), (8.3), (8.4) and (8.5) and is represented by the following formula:

ρ = min

wi

W(in)i + W

(out)i

,Bi,j

S(in)i + S

(out)j

,∑


wi

W(DGEMM)i

,

1

S(in)parent(i)

Bparent(i),i+

S(out)i

Bparent(i),i+∑

Pi→Pj

S(in)i +S

(out)j

Bi,j+

W(in)i +W

(out)i

wi

(8.6)

Theorem 4. When maximizing the throughput, at least one of the constraints (8.1), (8.2), (8.3), (8.4)and (8.5) is tight. This constraint represent the bottleneck of the platform.

8.2.2 Remove the bottleneckEven when neglecting the servers constraints, finding the best topology is a hard problemsince it amounts to find the best broadcast tree on a general graph, which is known to be NP-complete [18]. Note that even when neglecting the request mechanism, as soon as you take inaccount the communications of the problem’s data, the problem of finding the best deploymentbecomes NP-complete too [16].

Even in real life, the topology of the underlying platform is particular and enforce someparts of the deployment. Therefore, we propose to improve the throughput of a given deploy-ment by breaking its bottleneck. Using the previous theorems, we can find the bottlenecks andget rid of them by adding a supporter agent to the parent of a loaded agent so as to divide theload of that particular agent. We add new node according to the greedy Algorithm 8.1.

Algorithm 8.1 Algorithm to add LA.1: while (number of available nodes > 0) do2: Calculate the throughput ρ of structure.3: Find a node whose constraint is tight and that can be split4: if no such node exist then5: The deployment cannot be improved. Exit6: Split the load by adding new node (which will act as an agent) to its parent7: Decrease the number of available nodes

In Algorithm 8.1, line 3 checks whether it is possible to divide the load of a node or not.There may be many reasons for this condition to be false, for example, a node Pi having onlyone child cannot divide its load.

8.3. PARAMETER MEASUREMENT 115

Components S(in)i W

(in)i W

(out)i

Client 0.339 0.014 0MA 0.010 0.159 0.78 e-3LA 0.012 0.079 0.19 e-3

Table 8.1: Parameter values to calculate the throughput of a deployment.

8.3 Parameter measurement

We used Distributed Interactive Engineering Toolbox (DIET) [28] to estimate the values of thedifferent parameters. Name of DIET’s elements Master Agent (MA), Local Agent (LA), andservers (SeDs) will be referred in the explanation of the experiments. Experiments are done onan homogeneous cluster, composed of 16 processors (bi-PIII 1.4Ghz). To observe the effect onthe computation time of each component, to process one request, we did different experimentsby varying the number of LAs and SeDs with one MA and one client. We focus on linearapproximation, as this approximation gives good result for our modeling.

The effect of adding LAs is shown in Figure 8.2, in the form of time taken to execute onerequest by each component. The time taken to compute a request by MA, increases with theaddition of the LAs. MA take approximately 10 ms for an incoming request and the time tocompute an outgoing request is very very small, varies between 0.1 to 0.18 ms. To compute anincoming and outgoing request by an LA is approximately between 6.6 to 7.6 and 0.112 to 0.116ms respectively. The time taken by SeD for computation is very less effected by the increase inLAs.

In Figure 8.3, we have shown the effect of number of servers on the components computingcapacity. For an incoming and outgoing request, computation time taken by MA, LA and SeDincreases linearly. For incoming request MA and LA take approximately 9 to 10 and 8 to 22 msrespectively. In case of outgoing request variation is very small for MA it is between 0.1 to 0.2ms and for LA 0.2 to 0.35 ms. SeD computation time ranges between 0.18 to 0.19 ms.

The time needed by each component to reply is so small that, to estimate which fractionshould be incurred to computations and which fraction should be incurred to communicationsis very difficult. Therefore, we have considered it to be only computations (and therefore S

(out)i

=0) as a very small amount of data is exchanged here.

From these experimental results, it is observed that increasing the number of LAs accord-ingly increases the overall time needed to process one request on a MA or a LA, the fraction oftime spent incurred to computations is almost constant. The behavior is similar when varyingthe number of SeDs.

From these measurements, we estimated the time taken by the components to communi-cate and compute the request. S

(in)i is then calculated by summing the communication time

taken by each component and dividing it by the bandwidth of the local link. Similarly, thevalue of W

(in)i and W

(out)i is calculated by dividing the computation time of incoming request

and the outgoing request by the processing power. Parameter values are summarized in theTable 8.1.


0

0.005

0.01

0.015

0.02

0 2 4 6 8 10 12 14Tim

e ta

ken

for

one

requ

est (

in s

econ

d)

Number of LAs

Behaviour after adding more LAs

(a)

W (in)/w for MA

Linear Approximation of W (in)/w for MA

6e−05

8e−05

0.0001

0.00012

0.00014

0.00016

0.00018

0.0002

0.00022

0 2 4 6 8 10 12 14Tim

e ta

ken

for

one

requ

est (

in s

econ

d)

Number of LAs


(b)

W (out)/w for MA

Linear Approximation of W (out)/w for MA

0.0064

0.0066

0.0068

0.007

0.0072

0.0074

0.0076

0.0078

0 2 4 6 8 10 12 14Tim

e ta

ken

for

one

requ

est (

in s

econ

d)

Number of LAs


(c)

W (in)/w for LA

Linear Approximation of W (in)/w for LA

0.000112

0.0001125

0.000113

0.0001135

0.000114

0.0001145

0.000115

0.0001155

0.000116

0.0001165

0.000117

0 2 4 6 8 10 12 14Tim

e ta

ken

for

one

requ

est (

in s

econ

d)

Number of LAs


(d)

W (out)/w for LA

Linear Approximation of W (out)/w for LA

0.000176

0.000178

0.00018

0.000182

0.000184

0.000186

0.000188

0.00019

0.000192

0 2 4 6 8 10 12 14Tim

e ta

ken

for

one

requ

est (

in s

econ

d)

Number of LAs


time taken by SeD to compute

Linear Approximation of time taken by SeD to compute

(e)

Figure 8.2: Performance calculation by adding LAs.

8.3. PARAMETER MEASUREMENT 117

0.008

0.009

0.01

0.011

0.012

0.013

0.014

0.015

0.016

0.017

0 2 4 6 8 10 12 14

Tim

e ta

ken

for

one

requ

est (

in s

econ

d)

Number of SeDs

Behaviour after adding more SeDs

(a)(a)

W (in)/w for MA

Linear Approximation of W(in)/w for MA

6e−05

8e−05

0.0001

0.00012

0.00014

0.00016

0.00018

0.0002

0 2 4 6 8 10 12 14

Tim

e ta

ken

for

one

requ

est (

in s

econ

d)

Number of SeDs


(b)

W (out)/w for MA

Linear Approximation of W (out)/w for MA

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

0.022

0.024

0 2 4 6 8 10 12 14

Tim

e ta

ken

for

one

requ

est (

in s

econ

d)

Number of SeDs


(c)

W (in)/w for LA

Linear Approximation of W (in)/w for LA

0.00018

0.0002

0.00022

0.00024

0.00026

0.00028

0.0003

0.00032

0.00034

0.00036

0 2 4 6 8 10 12 14

Tim

e ta

ken

for

one

requ

est (

in s

econ

d)

Number of SeDs


(d)

W (out)/w for LA

Linear Approximation of W (out)/w for LA

0.000183

0.000184

0.000185

0.000186

0.000187

0.000188

0.000189

0.00019

0.000191

0.000192

0 2 4 6 8 10 12 14

Tim

e ta

ken

for

one

requ

est (

in s

econ

d)

Number of SeDs


time taken by SeD to compute

Linear Approximation of time taken by SeD to compute

(e)

Figure 8.3: Performance calculation by adding SeDs.


8.4 Simulation resultsTo validate our model, we considered a real heterogeneous network that we consider is al-ready deployed according to the DIET middleware elements. We did simulation using a sim-ple modeling of a real heterogeneous network shown in Figure 8.4. Diagrammatic view ofheterogeneous network as DIET deployed platform is shown in Figure 8.5.

8.4.1 Test-bedDistributed networks are all heterogeneous, i.e., every node has its own computing power(may be different from other nodes) and bandwidth link between two nodes are also mostlydifferent. In the heterogeneous network, we have one client (veloce) and one MA at Roc-quencourt. This MA is connected to two LAs: one at Rennes and another at Grenoble. 40 servers(paraski cluster) are connected to LA at Rennes. LA at Grenoble is connected with two LAs,LA at Grenoble has 200 servers (icluster cluster) and LA at Sophia has 14 servers (galerecluster). The power of the different nodes and the bandwidth of the different link between thenodes are depicted on Figure 8.5.

Figure 8.4: High speed network(2.5Gb/s) between INRIA researchcenters and several other research in-stitutes.

Veloce (21.4)

(137.3)

(137.3)

(137.3)

(137.3)

(292.3)

(292.3)

1 ... 200

1 ... 14

1 ... 40

Galere

Roquencourt

RennesGrenoble

Sophia Antipolis

icluster

(15.6)

(98.1)

Paraski

MA

LA

LA

LA

LA

SeD

SeD

SeD

VTHD

FAST Ethernet

Figure 8.5: Diagrammatic view oftestbed.

On this heterogeneous platform, the wi’s range from 15.6 MFlop/s to 292 MFlop/s and thebandwidth Bi,j range from 10Mb/s to 2.5Gb/s.

8.4.2 Computing a good deploymentReal heterogeneous network (shown in Figure 8.5) have one client node, one MA, four LAsand 255 servers. Client sends request to the deployed platform through MA. 255 servers areused to do the real execution of the request using Equation (8.3), means tasks execution is bal-anced among all servers of the platform. The overall throughput of heterogeneous network is

8.4. SIMULATION RESULTS 119

calculated by using the formula mentioned in section 8.2.1. The performance of the originaldeployment (the natural one which is depicted in Figure 8.5) is rather low since it enable toprocess at most only 4 requests per second. But the throughput of the network can be im-proved by breaking the bottleneck. To improve the throughput of the network, we used theAlgorithm 8.1. Requirement of nodes to break the bottleneck is fulfilled by either of two ways.By considering that we have enough available extra nodes that we can use to improve thethroughput of the hierarchy. Or by using servers as required nodes to improve the throughputof the hierarchy. Adding new nodes to remove the bottleneck is not very feasible as its not al-ways possible that extra nodes are available to improve the throughput of deployed platform.But it is advantages to use new nodes then replacing servers because by replacing the serversas agents, we limit (decrease) the overall performance of servers and thus the throughput ofthe platform.

8.4.2.1 Addition of new nodes

Bottleneck of the deployed platform shown in Figure 8.5, is located at icluster (see Fig-ure 8.6). icluster has 200 servers so, broadcasting all the request and gathering the answeris very time-consuming. Seven nodes have to be added, so that it will not be the bottleneck ofthe platform anymore. The eighth node has to be added at the Rennes’s cluster. The gatewayis very slow and, even if the number of servers is not as important as on the icluster, it hasbecome an issue. Eight more nodes are then added on the icluster and then two again onRennes. At this point, we have added total of 18 nodes and the tight equation is Equation (8.3),which means that all servers are working at full speed and that there is no hope of improvingthe throughput of the platform anymore. Platform with 260 nodes were giving throughput of4 requests/second, which is improved to 35 requests/second buy adding only 18 nodes.

0

5

10

15

20

25

30

35

40

0 2 4 6 8 10 12 14 16 18

Num

ber

of r

eque

sts

per

seco

nd

Number of nodes added

New nodes are added

Bottleneck at Rennes

Limited by servers’ power

Figure 8.6: Throughput of heterogeneous network as more number of LA are added.


8.4.2.2 Rearrangement of platform nodes

As stated earlier, first bottleneck of the deployed platform shown in Figure 8.5, is located aticluster (see Figure 8.6). From 200 servers, seven servers are converted as LAs to share theload of the LA at icluster, so that it will not be the bottleneck of the platform anymore.Next bottleneck occurred at Rennes’s cluster, as the number of servers (40) at Rennes’s clus-ter is more than then number of servers attached to other LAs in the platform. So server atthe Rennes’s cluster is added as eighth LA to share the load of parent LA and remove thebottleneck from Rennes’s cluster. Eight more servers are converted as LAs at icluster andthen one server is converted to LA at Rennes. At this point, total 17 servers are converted toLAs and the tight equation is Equation (8.3). The throughput of deployed platform with 260nodes is improved from 4 requests/second to 33 requests by rearranging the nodes.

0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14 16 18

Num

ber o

f req

uest

s pe

r sec

ond

Number of nodes replaced

Servers are replaced Limited by servers’ power

Bottleneck at Rennes

Figure 8.7: Throughput of heterogeneous network as SeD are converted to LA.

8.5 ConclusionAn efficient deployment for NES environment is very important. We have used theoreticalsteady-state scheduling framework to propose a new model that enables to find bottlenecksin the organization of a hierarchical NES such as the DIET middleware. This model enablesto improve the overall performance of NES by breaking bottlenecks and therefore to performautomatic deployment or redeployment. This model can analyze the effects on performance ifdifferent changes are done in the hierarchy’s configuration such as number of agents or serversin the hierarchy, problem size, bandwidth, nodes’ computing power, parameter values etc.

From Chapter 5 to 8, we have presented deployment planning for middleware on dis-tributed resources and technique to improve the existing deployment. Using these deploy-ment planning techniques and launching tool an automatic middleware deployment can bedeveloped.

Chapter 9

Automatic Deployment Planning Tool

This chapter presents one of the main future work, the initial steps to build a tool for auto-matic deployment planning based on our theoretical concepts. In previous chapters, we havepresented theoretical concepts to generate best deployment plan for a middleware. To presentour work in a simple form as a tool for end users we plan to build this tool.

9.1 Introduction

For simplicity in this chapter we named the deployment planning tool for which initial build-ing steps will be described, as Automatic Deployment Planning Tool (ADePT). ADePT is adeployment planning tool that generates the deployment plan based on resource description,application description, and middleware description (in the form of performance model). Fig-ure 9.1 gives an outline of ADePT working, its input and output. ADePT provides an optimalsolution for hierarchical middleware environments on cluster and implement heuristics for hi-erarchical middleware deployment on grid. ADePT can also determine the bottleneck in thedeployed hierarchy and thus helps to know whether the performance of a given deploymentcan be improved or not.

Resource description provides the information like computing power, storage capacity,etc., about the resources. Resource information can be provided as an XML file to the ADePT.Resource description should define all compute hosts (resources) that a user wants to use.Example of a resource description XML file is shown in Figure 9.2. “Scratch” tag specify alocal pathname that deployment tool can use as scratch space (e.g., temp storage of config files).User must have write access to the mentioned directory. “Storage” tag define all storage spacethat will be needed to run the application. “Bandwidth_link” tag mention the link between tworesources. We have considered that link between resources is homogeneous so only one tagis enough. But as the improvement will be done in the ADePT, the bandwidth links betweenresources can also be submitted as a separate XML file. If resources are homogeneous then only“cluster” tag is enough to simplify the description of large numbers of resources. To define allheterogeneous resources, “compute” tags should be used for individual resources. XML filemust include at least one “scratch" tag, one “storage" tag, one “bandwidth_link” tag and one“compute" or one “cluster" tag.

Application description provides the information about the application that will be sub-mitted to the deployed plan, like size of the scheduling request, size of the response request,

121

122 CHAPTER 9. AUTOMATIC DEPLOYMENT PLANNING TOOL

amount of computation need for the prediction, execution etc., of the application on the dedi-cated resource. An example of application description XML file is shown in Figure 9.3.

Deployed plan is the deployed hierarchy of an NES’s environment whose performancehas to be improved. Deployed hierarchy, that has to be improved in performance should besubmitted as an XML file to ADePT. ADePT will use iterative method presented in Chapter 8 toimprove the hierarchy’s throughput and generate output as an XML that defines the improvedhierarchy.

Performance model is a method to define the functionality of each NES’s component ac-cording to the NES’s working phases. Performance model defines the method to calculatethe throughput of each component. Formulation of performance model according to the NESenvironment is explained in detail in Section 9.2.

Techniques that ADePT use to select the deployment plan is based on the theoretical resultspresented in Chapters 6, 7, and 8. Theoretical results of Chapter 6 has shown that a completed-ary spanning tree provides an optimal deployment for hierarchical middleware environ-ments on homogeneous clusters. Chapter 7 has presented an heuristic to deploy hierarchicalmiddleware on grid, by selecting efficient resources from grid according to the functional re-quirements of the middleware elements. A mathematics model is presented in Chapter 8 thatcan analyze the deployed platform and find a bottleneck node. Using Algorithm 8.1 the bottle-neck can be removed by dividing the load of the node, which is the cause of bottleneck. Loadof loaded node is divided by adding a node to its parent and divide the number of its childrennodes among the loaded node and newly added node.

9.2 Formulation of performance model

ADePT takes performance model of a middleware as an input. Performance models are basedon middleware description (functionality of middleware’s components in the form of mid-dleware working phases, operating model 5.1.1 etc). In this section an idea is given for thegeneration of the performance model for a middleware.

Description of working phases of the middleware

Generally NES environments processes requests in two phases. First user submit an applica-tion, called scheduling request to root node. Root node is the entrance node of the deployment,to which all users submit their application. Root node checks the scheduling request and for-wards it down in the hierarchy. Other parent node in the hierarchy perform the same operationuntil the scheduling request reaches the servers. We assume that the scheduling request is for-warded to all servers, though this is a worst case scenario as filtering may be done by theparents based on scheduling request type. Servers may or may not make predictions aboutperformance for satisfying the request, depending on the exact system.

Servers that can perform the service then generate a scheduling response. The scheduling re-sponse is forwarded back up the hierarchy and the parents sort and select amongst the variousscheduling responses. Finally, the root node forwards the chosen scheduling response (i.e., theselected server) to the user.

User then generates a service request which is very similar to the scheduling request butincludes the full input data set, if any is needed. The service request is submitted by the user

9.2. FORMULATION OF PERFORMANCE MODEL 123

Deployment Plan as an

XML file

Select resources

Calculate exact number of resources required

Generate the arrangement of resources

Find the bottleneck resource

Remove the bottleneck

Check if bottleneck can be removed

ADePTCalculate the performance of each resource

Generate corresponding XML file

DescriptionResource Application

DescriptionPerformance

Model Deployed Plan

Generate a newdeployment plan deployment plan

Improve a given

Figure 9.1: Working of Automatic Deployment Planning Tool (ADePT).

to the chosen server. The server performs the requested service and generates a service response,which is then sent back to the user.

Description of performance modelDepending upon the middleware’s working phases, the performance models can be defined.We define performance models to estimate the time required for the treatment of requests invarious phases by each component of the middleware. According to the middleware’s compo-nent functionality, performance model can be calculated depending upon the different work-ing phases in which a middleware component has taken part. Below a general idea to generatethe performance model for each middleware component is given.

Parent node’s communication model: To treat a request, a node receives the request from itsparent, sends the request to each of its children, receives a reply from each of its children,and sends one reply to its parent. Thus, time required by a parent node is associated withthe size of requests and responses received from its parent and children , bandwidth linkand number of children.

Server node communication model: Servers have only one parent and no children, so thetime required by a server for receiving request and sending replies is associated with thesize of request and/or response and bandwidth link.


<?xml version=" 1 . 0 " standalone="no" ?>< !DOCTYPE re source _ de scr ipt ion SYSTEM " . . / re source _ de scr ipt . dtd "><re source _ de scr ipt ion>

< s c r a t c h dir=" / homePath / user / s c r a t c h _ d i r e c t " / ><stora ge la be l=" g5kBordeauxDisk ">

<s c r a t c h dir=" / homePath / user / scratch_runtime " / ><scp se rve r=" f r o n t a l e . bordeaux . grid5000 . f r " login="pkchouhan" / >

< / s tora ge>..<s tora ge la be l=" g5kOrsayDisk">

<s c r a t c h dir=" / homePath / user / scratch_runtime " / ><scp se rve r=" f r o n t a l e . orsay . grid5000 . f r " login=" pkchouhan" / >

< / s tora ge><bandwidth_link B=" 190 " / ><compute la be l=" node1 " disk=" disk1 ">

<ssh se rve r=" node1 . s i t e . f r " login="pkchouhan" / ><env path=" / homePath / user /demo / bin " LD_LIBRARY_PATH=" / homePath / user /

demo / l i b " / >< / compute>

.

.

.

.

< c l u s t e r la be l=" g5kBordo " disk=" g5kBordeauxDisk " login=" pkchouhan"total_nodes=" 50 ">

<env path=" / homePath / user /demo / bin " LD_LIBRARY_PATH=" / homePath / user /demo / l i b " / >

<node la be l="node−1.bordeaux . grid5000 . f r "><ssh se rve r="node−1.bordeaux . grid5000 . f r " / >< / node>

.

.<node la be l="node−47.bordeaux . grid5000 . f r ">

<ssh se rve r="node−47.bordeaux . grid5000 . f r " / >< / node>

< / c l u s t e r>..

< c l u s t e r la be l=" g5kOrsay" . . >..

< / c l u s t e r>

< / re source _ de scr ipt ion>

Figure 9.2: An example of resource description XML file.

9.2. FORMULATION OF PERFORMANCE MODEL 125

<?xml version=" 1 . 0 " standalone="no" ?>< !DOCTYPE app_description SYSTEM " . . / a p p l i c a t i on _ de sc r ipt . dtd "><app_description>

<compute la be l=" node1 " disk=" disk1 " login="pkchouhan"><wapp=" 0 .0 0 4 6 " / ><wreq=" 0 .0 0 6 4 " / ><wfix=" 0 .0 0 1 " / ><wsel=" 0 .0 0 0 9 " / >< sre s =" 0 .0 0 6 4 " / ><srep=" 0 .0 0 5 3 " / >

< / compute><compute la be l=" node2 " disk=" disk2 " login="pkchouhan">

<wapp=" 0 .0 0 4 4 " / ><wreq=" 0 .0 0 5 4 " / ><wfix=" 0 .0 0 0 9 " / ><wsel=" 0 .0 0 0 8 " / >< sre s =" 0 .0 0 6 4 " / ><srep=" 0 .0 0 5 3 " / >

< / compute>..

<compute la be l="nodeX" . . >..

< / compute>< c l u s t e r la be l=" g5kBordo " disk=" g5kBordeauxDisk " login=" pkchouhan"

total_nodes=" 50 "><wapp=" 0 .0 0 4 6 " / ><wreq=" 0 .0 0 6 4 " / ><wfix=" 0 .0 0 1 " / ><wsel=" 0 .0 0 0 9 " / >< sre s =" 0 .0 0 6 4 " / ><srep=" 0 .0 0 5 3 " / >

< / c l u s t e r>..

< c l u s t e r la be l=" g5kOrsay" . . >..

< / c l u s t e r>< / app_description>

Figure 9.3: An example of application description XML file.


Parent node’s computation model: As parent node perform two activities involving computa-tion: the processing of scheduling requests and the selection of the best server amongstthe replies returned from its children. So parent computation model is based on theamount of computation need for these two activities, parent’s computing power andnumber of its children.

Server node’s computation model: Servers also perform two activities involving computa-tion: performance prediction as part of the scheduling phase and provision of applica-tion services as part of the service phase. So server computation model depends on theamount of computation needed for these two activities and server’s computing power.

As an example, the performance models for DIET can be seen in Chapters 5, 6, and 7.Using the performance model, the throughput of any system can be calculated with differentoperating models (mentioned in 5.1.1).

9.3 Working of ADePT

To generate a deployment plan for a middleware deployment, ADePT needs performancemodels 9.2 for middleware’s components, application description (what is the execution timeof the application on dedicated servers, application parameters, etc.) and resource description(computing power, bandwidth link between nodes, etc.) as input. First, throughput of eachresource is calculated according to the given performance model and then exact number ofresources are selected, that are needed to generate the deployment. Based on the type of re-sources the arrangement of resources is defined. Then ADePT generates an XML file describingthe middleware deployment plan.

To improve a given deployment for a middleware, ADePT needs performance model formiddleware components, and an XML file representing of the deployment plan. First, through-put of each resource is calculated and then the load of resource with minimum throughputis shared either by adding a new resource in hierarchy or by rearranging the hierarchies re-sources. The process of removing the bottleneck is repeated till all the servers in the hierarchywill perform at their full speed. Then the corresponding hierarchy is translated as an XML file.

Generated XML file is submitted to the deployment tool to deploy the middleware. De-ployment tool deploys the hierarchy according to the deployment plan presented as XML file.

9.4 ADePT as a deployment planner for ADAGE

In Chapter 4 we have surveyed some deployment tools. Most of these tools use user defineddeployment plan for deployment. Only two tools (Sekitei and Pegasus) uses AI based plannerfor the generation of deployment plan. The planner used by these tools are specific for theapplication to be deployed.

Among surveyed tools, ADAGE seems very interesting tool for the deployment of grid ap-plications. Even though ADAGE contains a deployment model that specifies how a particularcomponent can be installed, configured and launched on a machine. However ADAGE hasno intelligent deployment planning algorithm. Currently, there are two very basic planners

9.5. CONCLUSION 127

of computing

storage

Descriptionof networktopologyof the grid

parametersControlling

Description

nodes and

Automatic Deployment Planning Tool (ADePT)

General deployment plan

General description of applications

Automatic Deployment of Applications

in a Grid Environment (ADAGE)

Figure 9.4: ADePT as a deployment planner for ADAGE.

implemented: round-robin and random. ADAGE team note a strong need of intelligent de-ployment planning algorithms [61]. As such, ADAGE is a framework where to plug planner.The two provided planner are just toy planner to validate the proposition.

Figure 9.4 gives an outline how ADePT can be merged in ADAGE. The base of this figureis taken from Figure 12.2 of [61]. ADAGE team thought that deployment planner is difficultto conceive. So to minimize the number of implementations of planners for each planningalgorithm they introduced the concept of general description. Thus each box in Figure 9.4 rep-resents the general layout of the information. “Generic application description”, representsan application which is converted to general form of application description by a simple con-verter. The latter is given as input to the deployment planner, here it is ADePT. “Descriptionof computing nodes and storage” provides the information regarding the computing power,storage capacity of grid resources. “Description of network topology of the grids” providesthe description of network topology as high level of abstraction, and thus easy to exploit bythe planner to place the components of the applications. As the “Deployed plan” in Figure 9.1,submit a plan to which components of the middleware can be launched, so we assume that adeployment plan that has to be improved can be submitted to ADePT through “Descriptionof network topology” by the ADAGE with some specific tag that defines that given descrip-tion is an existing deployment. Or it may be better that ADAGE add a new input formatthat can accept a deployed plan and can be used for improvement of the given deploymentor redeployment. As “Performance model” provides a indirect control on the performance ofdeployment plan by controlling the throughput calculation model in ADePT, similarly we canassume that required information by ADePT can be transferred to it by “Controlling parame-ters” of ADAGE with some modification, if required. “General deployment plan” in ADAGEis represented by XML and ADePT also provides its deployment as an XML file.

9.5 Conclusion

In this chapter we have given our point of view to generate an automatic deployment planningtool by implementing our theoretical results.


ADePT is a complementary for the deployment tools as it generates the deployment planautomatically. ADePT saves time of end users, as it generates best deployment, which executesmore user’s submitted applications in a time step. Even tool is not dependent on any particularmiddleware, applications, and resources thus can be used to generate a deployment plan forany middleware based on any type of application using different resources. As most of the de-ployment tool takes XML file as an input its advantageous that ADePT generates deploymentplan in the form of XML file.

We also gave an outline that mention, how ADePT can be merged in ADAGE. The ideasare very initial but we feel that it is a strong step, which will be very helpful to merge ourtheoretical work to be used in other grid tools.

Chapter 10

Conclusion

The work presented in this thesis concerns about the efficient utilization of the resources pro-vided by the grid platform through NES environments. As we saw in the first introductorychapter of this thesis, grid platforms are very promising but are very challenging to use be-cause of their intrinsic heterogeneity in terms of hardware capacities, software environmentand even system administrator orientations. Thus end-users uses grid resources to executetheir applications through NES environments, as these environments hides all the complexi-ties of the grid platform from users by providing easy access to the grid resources.

We did a survey of most prevalent NES environment, so as to obtain a depth knowledge ofexisting NES environments. In Chapter 2, we have presented seven NES environments (DIET,NeOS, NetSolve, Nimrod, Ninf, PUNCH, and WebCom) under various headings, enablingan objective comparison to be made. By objectively comparing these systems, an attempt ismade to enable potential NES users’ to choose an appropriate environment to best suit theirneeds. In addition to these NES environments we also presented some other systems whichare popular grid middleware and grid systems. These systems work in same space as NESenvironments but are different because some of them are based on cycle steeling concept unlikeNES environments, which are dedicated servers. And some other systems which are dedicateddoes not provide scheduling system, job management services, queueing mechanisms as doneby NES environments.

After obtaining the knowledge of NES environments we analyzed the two main factorsthat affects the efficient utilization of the NES environments. Even if we assume that NESenvironment selects the best server for the execution of the user submitted requests, then alsothese factors effect the performance of NES environment. These two factors are:

• scheduling technique used to schedule tasks on selected server

• deployment planning of NES’s components on the distributed resources

Scheduling in client-server scenario

To meet the goal of the thesis, i.e., to use NES environments efficiently, the very first problemwith which we are confronted relates to applications scheduling. It is possible to find the com-puting power and the capacity storage necessary for a task by using the available performanceforecasting tools like NWS [82], FAST [43], etc. However even if once the server is identified,

129

130 CHAPTER 10. CONCLUSION

that is best suited to solve the task, still it remains to determine a scheduling that offers thegreatest possible effectiveness for the execution of tasks on the selected servers.

NES environment generally use the Minimum Completion Time (MCT) on-line schedul-ing algorithm where-by all applications are scheduled immediately or refused. This approachcan overload interactive servers in high load conditions and does not allow adaptation of theschedule to task dependencies. Thus, we first studied the scheduling techniques that can beadopted for scheduling tasks on NES environment. As a result, we presented algorithms forthe scheduling of the sequential tasks on an NES environment in Chapter 3. We mainly dis-cussed a deadline scheduling with priority strategy that is more appropriate for multi-client,multi-server scenario. Experimentally we proved that the deadline scheduling with priorityalong with fallback mechanism can increase the overall number of tasks executed by the NES.

Automatic deployment planning

Once we have good scheduling algorithms to schedule the tasks on servers, the next impor-tant factor that influence the efficiency of the NES environments is the deployment planning tomap the environment’s components on the available resources. Generally components of theseenvironments are mapped on the available resources as defined by the user or environment’sadministrator. Mapping of the components on to the available resources is called “deploy-ment”. Their exist some deployment tools, that deploy the NES’s components on selected re-sources. In Chapter 4 we have presented a survey of some deployment tools (ADAGE, JADE,JDF, Pegasus, Sekitei, SmartFrog, and Software Dock). However, the deployment tools doesnot allow the non-expert users to deploy middleware according to their applications simply ongrids because the input of these tool is a detailed deployment plan that is given by the users.Deployment plan should mention all the pre-requisites required to use the grid resources effi-ciently.

As NES is composed of different components and each component have specific functionto perform. Appropriate resource from a pool of resources should be selected to deploy accord-ing to the component’s functionality. As explained in the introductory chapter of this thesis,that depending on number of available resources many deployments are possible. So gooddeployment is one that can execute maximum number of users’ submitted applications in atime step.

Thus, to remove the second obstacle in the efficient use of NES environment, it is impor-tant to generate an automatic deployment planning algorithms and heuristics. To generate adeployment planning for an NES environment, we divided the deployment planning in threeparts. First, we tried to find an optimal deployment of middleware on cluster (homogeneousresources). We thought of deployment on cluster because even to find a good deployment istime consuming on homogeneous resources. We have proved this by an experiment in Chap-ter 1. Secondly, we tried to find a deployment on heterogeneous resources. Because grid haveheterogeneous resources even if we consider a cluster of cluster, it may be possible that twoclusters’ resources can be different. Finally, we thought to improve the existing deployment,because it is not always evident to deploy the middleware platform from scratch according tothe specifications of new submitting applications. Sometime it is useful to modify the existinghierarchy with less cost and time then to deploy a new platform according to new requests.

131

Optimal deployment planning on cluster

Finding a good deployment for homogeneous resources is not as simple as it may sound: onemust decide “how many resource should be used in total?”, “how many should be dedicatedto scheduling or computation?”, and “which hierarchical arrangement of schedulers is moreeffective i.e., more schedulers near the servers, near the root node, or a balanced approach?”.

Initially we have presented an heuristic in Chapter 5, for hierarchical middleware deploy-ment for homogeneous resources. But the heuristic is implemented under limited condition,for example, a parent node can have either servers or other parent nodes as children but notboth. But later we found that if we consider the middleware working phases, we can obtainan algorithm for optimal middleware deployment on homogeneous resources.

In Chapter 6, we have shown that the optimal deployment on cluster is a Complete Spanningd-ary (CSD) tree; this result conforms with results from the scheduling literature. More impor-tantly, we present an approach for determining the optimal degree d for the tree. A CSD treeis a tree that is both a complete d-ary tree and a spanning tree.

Automatic deployment planning on grid

After finding an optimal solution for the first step of the automatic deployment planning,we move onto second step. However, finding the best deployment among heterogeneous re-sources is a hard problem since it amounts to find the best broadcast tree on a general graph,which is known to be NP-complete [18].

So we presented a deployment heuristic that predicts the maximum throughput that can beachieved by the use of available nodes in Chapter 7. The main focus of heuristic is to constructan hierarchy, so as to maximize the throughput of each node, where this throughput dependson the number of children a node is connected with, in the hierarchy. The given heuristicprovides a deployment that can meet the user requests demand, if user demand is at mostequal to the maximum throughput.

Improvement of existing deployment

Finally, we gave a mathematical model that can analyze an existing deployment and can im-prove the performance of the deployment by finding and then removing the bottlenecks inChapter 8. This is an heuristic approach for improving deployments of NES environments inheterogeneous grid environments. The heuristic is used to improve the throughput of a de-ployment that has been defined by other means. The approach is iterative; in each iteration,mathematical models are used to analyze the existing deployment which identify the primarybottleneck, and remove the bottleneck by adding resources in the appropriate area of the hi-erarchy. Using this model we can evaluate a virtual deployment before a real deployment,provide a decision builder tool (i.e., designed to compare different hierarchies or add newresources) and take into account the hierarchies’ scalability.

Validation of deployment planning

Deployment planning algorithms and heuristic presented in Chapters 5, 6, and 7 were val-idated by implementing them to deploy a hierarchical middleware DIET on different sites ofGrid’5000, a set of distributed computational resources in France. To deploy DIET for experi-ments we used DIET deployment tool called GoDIET [85]. Presented experiments are designed


to test the ability of our deployment planning models, to correctly identify good real-worlddeployments. To implement our deployment planning, we have also presented performancemodels for DIET middleware. Since our performance model and deployment approach fo-cus on maximizing steady-state throughput, our experiments focus on testing the maximumsustained throughput provided by different deployments. We have presented experimentstesting the accuracy of our throughput performance models, and tested whether the deploy-ment selected by our approach provides good throughput as compared to other reasonabledeployments.

A tool based on our automatic deployment planning models has been outlined in Chap-ter 9.

Future WorkThis work has led to opening up the further possibilities of improving the performance of NESenvironments and their easy use.

In the near future one of the principal objective is the implementation of theoretical de-ployment planning techniques as Automatic Deployment Planning Tool (ADePT) presentedin Chapter 9. In this chapter we have presented the initial steps to implement the theoreticalconcepts presented in the thesis as a tool that will help users to easily deploy the middlewareon available resources.

I would like to compare the heuristic for middleware deployment on homogeneous node,presented in Chapter 5 with the theoretically proved optimal deployment planning on homo-geneous nodes, presented in Chapter 6.

Then test the deployment planning for mix job types. Presented experiments were doneby submitting only one type of jobs (DGEMM) at a time to the deployed hierarchy. A verysmall experiment was done in Chapter 6 to check whether the resources should be dividedinto different sets of equal nodes, where number of sets were depend on the number of typesof jobs being submitted to the deployed hierarchy. In this small experiment we found thatdividing available resources in sets is not advantageous. But many things have to be tested,like, in what ratio we should divide the available nodes?, will it be better if we divide availablenodes in sets according to the problem load? or if not then how to calculate the parametricvalues that we use to calculate the throughput of deployment phases for mix jobs? Should wetake weighted average of the mix problems or use some other criteria?

The heuristic presented for heterogeneous resources has a main constraint on the link be-tween the available resources. In heuristic we considered that all the nodes have same band-width link, which is not true in a real world. For heterogeneous resources we planned togenerate different heuristics by fixing one constraint at a time (like communication link in thepresented heuristic) and then compare different heuristics to find out one heuristic that couldwork efficiently for heterogeneous resources.

All the presented heuristics and algorithms for deployment planning consider the through-put of each deployment phase, which is calculated according to the computation parametricvalues. As in DIET we calculate the parametric values (computation amount needed for pre-diction by server, size of incoming request, size of reply, etc.) by deploying a small hierarchyand send the task to the hierarchy for which we would like to select the best deployment. ForDIET it was possible to calculate the parametric values in reasonable time, but for other NESenvironments may be it will not be the same. So, first solution to avoid calculation of the val-ues is that, NES environments should have some other possibility to calculate the throughput

133

of their different deployment phases. If not then, it will be better to develop a tool that cancalculate the parametric values for any NES environment.

It will be interesting to validate our theoretical concept of deployment planning by exper-imenting with other hierarchical middleware like WebCom [68]. We would also like to imple-ment deployment planning for arbitrary arrangements of distributed resources like P2P etc. Asmost of the NES have their own deployment tool, it will also be interesting to use deploymenttools as plug-in to ADePT. For the experiments, we did DIET deployment manually accord-ing to the generated deployment plan. We manually submit the generated hierarchy.xml toGoDIET and launch GoDIET to deploy the hierarchy. Thus, if we use the deployment toolsas plug-in then according to the generated deployment plan for any NES environment, corre-sponding deployment tool can be applied.

For the deployment plan generation, we always give a set of available resources as aninput. But as the grid resources dynamically get connected or disconnected, so generated de-ployment plan may not perform as well as it was estimated, because some selected resourcesmay not be available at the time of user’s request submission. So it will be interesting if deploy-ment planning tool can discover the resources at the time of deployment plan generation. Dueto the use of resource discovery concept, new resources can be known dynamically and in thatcase it will also be interesting to develop redeployment approaches that can dynamically adaptthe deployment to workload levels after the initial deployment. We also envision enablinginteractive deployment and reconfiguration whereby users can dynamically reconfigure por-tions of their deployment. Even resource discovery and redeployment will also provides thefault tolerance to the deployed hierarchy.

Another future work may include a tool for user’s enjoyment. It is true that to generatea deployment plan it is advantageous to use a deployment planning tool like ADePT, but itmay be enjoyable for users to use a visual deployment tool. Visual tool will allow users to builda graphical model of their desired deployment and export this model to a deployment tool forlaunch.

List of Figures

1.1 Some of the possible deployments. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Comparison of requests completed by a centralized DIET scheduler versus a

three agent distributed DIET scheduler. . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Architecture of DIET. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 The NeOS architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 The NetSolve architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 Architecture of Nimrod-G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.5 Architecture of Ninf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.6 Architecture of PUNCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.7 Launch of a DIET platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.8 NetSolve initialization steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.9 DIET task execution steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.10 NetSolve task execution steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.11 Nimrod: Work procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.12 Working of Ninf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.13 Working of PUNCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.14 Working of WebCom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.15 Demand-based Scheduling [55] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.16 Screen shot of VizDiet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.17 Overview of VisPerf Monitoring System [63] . . . . . . . . . . . . . . . . . . . . . 372.18 Description: WebCom Integrated Development Environment . . . . . . . . . . . 38

3.1 Example for priority scheduling algorithm with fallback mechanism. Task idand execution time is written diagonally in each box. . . . . . . . . . . . . . . . . 46

3.2 Priority based tasks are executed without fallback mechanism. . . . . . . . . . . . 473.3 Comparison of tasks executed with and without fallback based on tasks priority. 493.4 Comparison of tasks executed with and without fallback based on tasks execu-

tion duration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1 Working steps of GoDIET. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.2 Components of a workflow generation, mapping and execution system. . . . . . 564.3 Process flow graph for solving CPP [58]. . . . . . . . . . . . . . . . . . . . . . . . . 574.4 Comparison of some deployment tools. . . . . . . . . . . . . . . . . . . . . . . . . 61

5.1 Classification of the operating models. . . . . . . . . . . . . . . . . . . . . . . . . . 65

135

136 LIST OF FIGURES

5.2 Star hierarchies with one or two servers for DGEMM 150x150 requests. (a) Real-world platform throughput for different load levels. (b) Comparison of pre-dicted and actual maximum throughput. . . . . . . . . . . . . . . . . . . . . . . . 72

5.3 Star hierarchies with one or two servers for DGEMM 10x10 requests. (a) Through-put at different load levels. (b) Comparison of predicted and actual maximumthroughput. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.4 Star hierarchies with two servers using a Gb/s or a 100 Mb/s network. Work-load was DGEMM 10x10. (a) Throughput at different load levels. (b) Comparisonof predicted and actual maximum throughput. . . . . . . . . . . . . . . . . . . . . 73

5.5 Comparison of automatically-generated hierarchy with hierarchies containingtwice as many and half as many servers. . . . . . . . . . . . . . . . . . . . . . . . . 74

5.6 Types of platforms compared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.7 Comparison of automatically-generated hierarchy with intuitive alternative hi-

erarchies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.1 Platform deployment architecture and execution phases. . . . . . . . . . . . . . . 786.2 Deployment trees of dMax sets 4 and 6. . . . . . . . . . . . . . . . . . . . . . . . . 816.3 Measured and predicted platform throughput for DGEMM size 10; predictions

are shown for several bandwidths. . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.4 Measured and predicted platform throughput for DGEMM size 1000; predictions

are shown for the serial model with bandwidth 190 Mb/s. . . . . . . . . . . . . . 906.5 Predicted and measured throughput for different CSD trees for DGEMM 200 with

25 available nodes in the Lyon cluster. . . . . . . . . . . . . . . . . . . . . . . . . . 916.6 Predicted and measured throughput for different CSD trees for DGEMM 200 with

45 available nodes in the Sophia cluster. . . . . . . . . . . . . . . . . . . . . . . . . 926.7 Predicted and measured throughput for different CSD trees for DGEMM 310 with

45 available nodes in the Sophia cluster. . . . . . . . . . . . . . . . . . . . . . . . . 926.8 Makespan for a group of tasks partitioned to three deployments or sent to a

single joint deployment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.1 Explanation of workload introduced by submitting requests with the increase inthe launch of clients on each machine. . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.2 Star hierarchies with one or two servers for DGEMM 10x10 requests. (a) Mea-sured throughput for different load levels. (b) Comparison of predicted andmeasured maximum throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.3 Star hierarchies with one or two servers for DGEMM 200x200 requests. (a) Mea-sured throughput for different load levels. (b) Comparison of predicted andmeasured maximum throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.4 Comparison of automatically-generated hierarchy for DGEMM 310x310 with in-tuitive alternative hierarchies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.5 Comparison of automatically-generated hierarchy for DGEMM 1000x1000 withintuitive alternative hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8.1 Architectural model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1128.2 Performance calculation by adding LAs. . . . . . . . . . . . . . . . . . . . . . . . . 1168.3 Performance calculation by adding SeDs. . . . . . . . . . . . . . . . . . . . . . . . 117

LIST OF FIGURES 137

8.4 High speed network (2.5Gb/s) between INRIA research centers and severalother research institutes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8.5 Diagrammatic view of testbed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1188.6 Throughput of heterogeneous network as more number of LA are added. . . . . 1198.7 Throughput of heterogeneous network as SeD are converted to LA. . . . . . . . . 120

9.1 Working of Automatic Deployment Planning Tool (ADePT). . . . . . . . . . . . . 1239.2 An example of resource description XML file. . . . . . . . . . . . . . . . . . . . . . 1249.3 An example of application description XML file. . . . . . . . . . . . . . . . . . . . 1259.4 ADePT as a deployment planner for ADAGE. . . . . . . . . . . . . . . . . . . . . . 127

138 LIST OF FIGURES

List of Tables

2.1 Example of a plug in scheduler parametric table. . . . . . . . . . . . . . . . . . . . 272.2 Table of Comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.1 Priority, deadline and computation time of each task. . . . . . . . . . . . . . . . . 46

6.1 Parameter values for middleware deployment on cluster. . . . . . . . . . . . . . . 886.2 A summary of the percentage of optimal achieved by the deployment selected

by our model, a star deployment, and a tri-ary tree deployment. . . . . . . . . . . 936.3 Predictions for the best degree d, number of agents used |A|, and number of

servers used |S| for different DGEMM problem sizes and platform sizes |V|. Theplatforms are assumed to be larger clusters with the same machine and networkcharacteristics as the Lyon cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.1 A summary of notations used to define platform deployment. . . . . . . . . . . . 987.2 Variables used in Heuristic 7.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997.3 A summary of notations used to define performance model. . . . . . . . . . . . . 1027.4 Parameter values for middleware deployment on Lyon site of Grid’5000 . . . . . 1067.5 A summary of the percentage of optimal achieved by the deployment selected

by our heterogeneous heuristic, optimal degree, and optimal homogeneous model.107

8.1 Parameter values to calculate the throughput of a deployment. . . . . . . . . . . . 115

139

140 LIST OF TABLES

Appendix A

Bibliography

[1] Simplifying system deployment using the Dell OpenManage Deployment Toolkit, Octo-ber 2004. Dell Power Solutions.

[2] D. Abramson, I. Foster, J. Giddy, A. Lewis, R. Sosic, R. Sutherst, and N. White. TheNimrod Computational Workbench: A Case Study in Desktop Metacomputing. In The20th Australasian Computer Science Conference, Macquarie University, Sydney, Australia,February 1997.

[3] D. Abramson, R. Sosic, J.Giddy, and B. Hall. Nimrod: A tool for performing parametisedsimulations using distributed workstations. The 4th IEEE Symposium on High Perfor-mance Distributed Computing, Virginia, August 1995.

[4] D. Abramson, R. Sosic, J.Giddy, and B. Hall. Nimrod: A tool for performing parametisedsimulations using distributed workstations. The 4th IEEE Symposium on High Perfor-mance Distributed Computing, Virginia, 1995.

[5] M. K. Aguilera and S. Toueg. Randomization and failure detection: A hybrid approach tosolve consensus. In Ö. Babaoglu and K. Marzullo, editors, Proceedings of the 10th Interna-tional Workshop on Distributed Algorithms (WDAG96), volume 1151, pages 29–39, Bologna,Italy, September 1996. Springer-Verlag.

[6] K. Aida, A. Takefusa, H. Ogawa, O. Tatebe, H. Nakada, H. Takagi, Y. Tanaka, S. Matsuoka,M. Sato, S. Sekiguchi, and U. Nagashima. Ninf project. APAN Conference 2000, Aug.2000.

[7] S. Angelopoulos and A. Borodin. The power of priority algorithms for facility locationand set cover, 2002.

[8] G. Antoniu, L. Bouge, and M. Jan. JuxMem: Weaving together the P2P and DSMparadigms to enable a Grid Data-sharing Service. In Kluwer Journal of Supercomputing,2004.

[9] G. Antoniu, L. Bougé, and M. Jan. Juxmem: An adaptive supportive platform for datasharing on the grid. In IEEE/ACM Workshop on Adaptive Grid Middleware, held in conjunctionwith 12th Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT 2003), NewOrleans, 2003.

141

142 APPENDIX A. BIBLIOGRAPHY

[10] G. Antoniu, L. Bougé, M. Jan, and S. Monnet. Large-scale Deployment in P2P Experi-ments Using the JXTA Distributed Framework. In In Euro-Par 2004: Parallel Processing,number 3149 of Lect. Notes in Comp. Science, pages 1038–1047, Pisa, Italy, August 2004.

[11] G. Antoniu, M. Jan, and D. A. Noblet. Enabling the p2p jxta platform for high-performance networking grid infrastructures. In Proc. of the first Intl. Conf. on High Perfor-mance Computing and Communications (HPCC ’05), number 3726 in Lect. Notes in Comp.Science, pages 429–440, Sorrento, Italy, September 2005. Springer-Verlag.

[12] D. Arnold, S. Agrawal, S. Blackford, J. Dongarra, M. Miller, K. Sagi, Z. Shi, and S. Vad-hiyar. Users’ Guide to NetSolve V1.4. UTK Computer Science Dept. Technical ReportCS-01-467, 2001.

[13] D. C. Arnold, H. Casanova, and J. Dongarra. Innovations of the NetSolve Grid ComputingSystem. Concurrency and Computation: Practice and Experience, 14(13-15):1457–1479, 2002.

[14] A. S. Artelys, E. D. Dolan, J. P. Goux, R. Fourer, and T. S. Munson. Kestrel: An interfacefrom modeling systems to the NEOS server. Technical report, 2003.

[15] G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti-Spaccamela, and M. Protasi.Complexity and Approximation – Combinatorial optimization problems and their approximabilityproperties. Springer, 1999.

[16] C. Banino, O. Beaumont, A. Legrand, and Y. Robert. Scheduling strategies for master-slave tasking on heterogeneous processor grids. In PARA’02: International Conference onApplied Parallel Computing, LNCS 2367. Springer Verlag, 2002.

[17] O. Beaumont, L. Carter, J. Ferrante, A. Legrand, and Y. Robert. Bandwidth-centric al-location of independent tasks on heterogeneous platforms. In International Parallel andDistributed Processing Symposium IPDPS’2002. IEEE Computer Society Press, 2002.

[18] O. Beaumont, A. Legrand, L. Marchal, and Y. Robert. Optimizing the steady-statethroughput of broadcasts on heterogeneous platforms. Technical Report 2003-34, LIP,June 2003.

[19] O. Beaumont, A. Legrand, L. Marchal, and Y. Robert. Steady-state scheduling on hetero-geneous clusters: why and how? In 6th Workshop on Advances in Parallel and DistributedComputational Models, 2004.

[20] R. Bolze, E. Caron, and F. Desprez. A monitoring and visualization tool and its applicationfor a network enabled server platform. In LNCS, editor, Parallel and Distributed ComputingWorkshop of ICCSA 2006, Glasgow, UK., 8-11 May 2006.

[21] A. Borodin, J. Boyar, and K. S. Larsen. Priority algorithms for graph optimization prob-lems.

[22] S. Bouchenak, F. Boyer, D. Hagimont, S. Krakowiak, A. Mos, N. Palma, V. Quéma, andJ. B. Stefani. Architecture-Based Autonomous Repair Management: An Application toJ2EE Clusters. In 24th IEEE Symposium on Reliable Distributed Systems (SRDS), Orlando,Florida, USA, October 2005.

143

[23] M. Brzezniak, T. Makiela, and N. Meyer. Integration of NetSolve with Globus-BasedGrids. In GRID ’04: Proceedings of the Fifth IEEE/ACM International Workshop on Grid Com-puting (GRID’04), pages 449–455, Washington, DC, USA, 2004. IEEE Computer Society.

[24] R. Buyya. Economic-based distributed resource management and scheduling for gridcomputing. CoRR, cs.DC/0204048, 2002.

[25] R. Buyya, D. Abramson, and J. Giddy. Nimrod/g: An architecture of a resource man-agement and scheduling system in a global computational grid. In HPC Asia 2000, pages283–289, Beijing, China, May 2000.

[26] Y. Caniou and E. Jeannot. Efficient scheduling heuristics for gridrpc systems. In IEEE QoSand Dynamic System workshop (QDS) of International Conference on Parallel and DistributedSystems (ICPADS), pages 621–630, New-Port Beach California, USA, 2004.

[27] F. Cappello, S. Djilali, G. Fedak, T. Hérault, F. Magniette, V. Néri, and O. Lodygen-sky. Computing on large-scale distributed systems: XtremWeb architecture, program-ming models, security, tests and convergence with grid. Future Generation Comp. Syst.,21(3):417–437, 2005.

[28] E. Caron and F. Desprez. DIET: A Scalable Toolbox to Build Network Enabled Servers onthe Grid. International Journal of High Performance Computing Applications, 2006. To appear.

[29] E. Caron, F. Desprez, F. Petit, and C. Tedeschi. Resource Localization Using Peer-To-PeerTechnology for Network Enabled Servers. Research report 2004-55, LIP, 2004.

[30] H. Casanova. SIMGRID: A toolkit for the simulation of application scheduling. In Proceed-ings of the IEEE Symposium on Cluster Computing and the Grid (CCGrid’01). IEEE ComputerSociety, May 2001.

[31] H. Casanova and J. Dongarra. NetSolve: a network server for solving computationalscience problems. In Supercomputing ’96: Proceedings of the 1996 ACM/IEEE conference onSupercomputing (CDROM), page 40, 1996.

[32] H. Casanova and J. Dongarra. Netsolve’s network enabled server: Examples and appli-cations. Technical Report UT-CS-97-392, 1997.

[33] H. Casanova and J. Dongarra. NetSolve version 1.2: Design and implementation. LA-PACK Working Note 140, Department of Computer Science, University of Tennessee,Knoxville, Knoxville, TN 37996, USA, November 1998. UT-CS-98-406, Nov 1998.

[34] T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems.Journal of the ACM, 43(2):225–267, 1996.

[35] A. Chtchelkanova, J. Gunnels, G. Morrow, J. Overfelt, and R. V. de Geijn. Parallel imple-mentation of BLAS: General techniques for level 3 BLAS. Technical Report CS-TR-95-40,University of Texas, Austin, October 1995.

[36] Cobb and E. E. Tp monitors and orbs: A superior client/server alternative. Object Maga-zine 4, February 1995.


[37] S. Dahan, J. M. Nicod, and L. Philippe. Scalability in a GRID server discovery mechanism.In 10th IEEE Int. Workshop on Future Trends of Distributed Computing Systems, pages 46–51,Suzhou, China, May 2004. IEEE Press.

[38] S. Dandamudi and S. Ayachi. Performance of Hierarchical Processor Scheduling inShared-Memory Multiprocessor Systems. IEEE Trans. on Computers, 48(11):1202–1213,1999.

[39] E. Deelman, J. Blythe, Y. Gil, and C. Kesselman. Pegasus: Planning for Execution in Grids.GriPhyN technical report 2002-20, 2005.

[40] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S. Patil, M. H. Su, K. Vahi, andM. Livny. Pegasus: Mapping Scientific Workflows onto the Grid. In European Across GridsConference, pages 11–20, 2004.

[41] E. Deelman, G. Singh, M. H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, and K. Vahi.Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Sys-tems. Scientific Programming Journal, 2006. To appear.

[42] B. Del-Fabbro, D. Laiymani, and L. Philippe. Data Management in Grid ApplicationsProviders. In Procs of the 1st IEEE Int. Conf. on Distributed Frameworks for Multimedia Appli-cations, DFMA’2005, pages 315–322, Besançon, France, February 2005.

[43] F. Desprez, M. Quinson, and F. Suter. Dynamic performance forecasting for networkenabled servers in an heterogeneous environment. In International Conference on Paralleland Distributed Processing Techniques and Applications (PDPTA 2001). CSREA Press, June25-28 2001.

[44] E. Dolan, P. Hovland, J. More’, B. Norris, and B. Smith. Remote Access to MathematicalSoftware. Technical Report ANL/MCS-P889-0601, June 2001.

[45] E. D. Dolan, R. Fourer, J. J. Moré, and T. S. Munson. Optimization on the NEOS Server.Technical report, Mathematics and Computer Science Division, Argonne National Labo-ratory, july 2002.

[46] S. N. Foley, B. P. Mulcahy, and T. B. Quillinan. Dynamic Administrative Coalitions withWebCom DAC. In WeB2004 The Third Workshop on e-Business, Washington D.C., USA,December 2004.

[47] S. N. Foley, T. B. Quillinan, J. P. Morrison, D. A. Power, and J. J. Kennedy. ExploitingKeyNote in WebCom: Architecture Neutral Glue for Trust Management. In Proceedings ofthe Nordic Workshop on Secure IT Systems Encouraging Co-operation, Reykjavik University,Reykjavik, Iceland, 2000.

[48] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. The Interna-tional Journal of Supercomputer Applications and High Performance Computing, 11(2):115–128,Summer 1997.

[49] J. Frey. Condor dagman: Handling inter-job dependencies., 2002.http://www.cs.wisc.edu/condor/dagman/.

145

[50] P. Goldsack and P. Toft. Smartfrog: a framework for configuration. InLarge Scale System Configuration Workshop. National e-Science Centre UK, 2001.http://www.hpl.hp.com/research/smartfrog/.

[51] A. Halderen, B. Overeinder, and P. Sloot. Hierarchical Resource Management in thePolder Metacomputing Initiative. Parallel Computing, 24:1807–1825, 1998.

[52] R. S. Hall, D. Heimbigner, and A. L. Wolf. A Cooperative Approach to Support SoftwareDeployment Using the Software Dock. In Proceedings of the 21st International Conferenceon Software Engineering, pages 174–183. ACM Press, May 1999.

[53] R. Henderson and D. Tweten. Portable batch system: External reference specification.Technical report, NASA, Ames Research Center, 1996.

[54] N. H. Kapadia, J. Fortes, and C. E. Brodley. Application performance modelling in acomputational grid environment. Proceedings of 8th IEE International Symposium onHigh Performance Distributed Computing (HPDCS)(1999).

[55] N. H. Kapadia and J. A. B. Fortes. PUNCH: An architecture for web-enabled wide-areanetwork-computing. Cluster Computing, 2(2):153–164, 1999.

[56] T. Kichkaylo, A. Ivan, and V. Karamcheti. Constrained component deployment in widearea networks using AI planning techniques. In International Parallel and Distributed Pro-cessing Symposium, April 2003.

[57] T. Kichkaylo, A. Ivan, and V. Karamcheti. Sekitei: An AI planner for Constrained Com-ponent Deployment in Wide-Area Networks. Technical report 2002-851, 2004.

[58] T. Kichkaylo, A. A. Ivan, and V. Karamcheti. Constrained component deployment inwide-area networks using AI planning techniques. In 17th International Parallel and Dis-tributed Processing Symposium (IPDPS-2003), pages 3–3, Los Alamitos, CA, 2003. IEEEComputer Society.

[59] T. Kichkaylo and V. Karamcheti. Optimal resource aware deployment planning for com-ponent based distributed applications. In The 13th High Performance Distributed Computing,June 2004.

[60] T. Kosar and M. Livny. Stork: Making data placement a first class citizen in the grid. icdcs,00:342–349, 2004.

[61] S. Lacour. Contribution à l’automatisation du déploiement d’ applications sur des grilles de calcul.Phd thesis, L’Université de Rennes 1, 2005.

[62] S. Lacour, C. Pérez, and T. Priol. Deploying corba components on a computational grid:General principles and early experiments using the globus toolkit. In W. Emmerich andA. L. Wolf, editors, Proceedings of the 2nd International Working Conference on ComponentDeployment (CD 2004), number 3083 in Lect. Notes in Comp. Science, pages 35–49, Ed-inburgh, Scotland, UK, May 2004. Springer-Verlag. Held in conjunction with the 26thInternational Conference on Software Engineering (ICSE 2004).

http://www.hpl.hp.com/research/smartfrog/


[63] D. Lee, J. Dongarra, and R. S. Ramakrishna. visperf: Monitoring tool for grid computing.In International Conference on Computational Science, pages 233–243, 2003.

[64] M. Maheswaran, S. Ali, H. J. Siegel, D. Hensgen, and R. F. Freund. Dynamic mappingof a class of independent tasks onto heterogeneous computing systems. J. Parallel Distrib.Comput., 59(2):107–131, 1999.

[65] C. Martin and O. Richard. Parallel launcher for cluster of PC. In Parallel Computing,Proceedings of the International Conference, September 2001.

[66] S. Matsuoka, H. Nakada, M. Sato, and S. Sekiguchi. Design Issues of Network EnabledServer Systems for the Grid, 2000. Advanced Programming Models Working GroupWhitepaper, Global Grid Forum.

[67] J. P. Morrison, J. J. Kennedy, and D. A. Power. WebCom: A Volunteer-Based Metacom-puter. The Journal of Supercomputing, Volume 18(1): 47-61, January 2001.

[68] J. P. Morrison, J. J. Kennedy, and D. A. Power. WebCom: A Web-Based Distributed Com-putation Platform. Proceedings of Distributed computing on the Web, Rostock, Germany,June 21 - 23, 1999.

[69] J. P. Morrison, J. J. Kennedy, and D. A. Power. WebCom: A Web-Based Distributed Com-putation Platform. Proceedings of Distributed computing on the Web, Rostock, Germany,June 21 - 23, 1999.

[70] J. P. Morrison and D. A. Power. Master Promotion & Client Redirection in the WebComSystem. Proceedings of the international conference on parallel and distributed process-ing techniques and applications (PDPTA 2000), Las Vagas, Nevada, June 26 - 29, 2000.

[71] H. Nakada, M. Sato, and S. Sekiguchi. Design and Implementations of Ninf: towards aGlobal Computing Infrastructure. Future Generation Computing Systems, 15(5-6):649–658,1999.

[72] H. Nakada, H. Takagi, S. Matsuoka, U. Nagashima, M. Sato, and S. Sekiguchi. Utilizingthe Metaserver Architecture in the Ninf Global Computing System. In HPCN Europe 1998:Proceedings of the International Conference and Exhibition on High-Performance Computing andNetworking, pages 607–616, London, UK, 1998. Springer-Verlag.

[73] D. A. Reed, C. L. Mendes, C. da Lu, I. Foster, and C. Kesselman. The Grid 2: Blueprint fora New Computing Infrastructure - Application Tuning and Adaptation. Morgan Kaufman, SanFrancisco, CA, second edition, 2003. pp.513-532.

[74] O. Regev. Priority algorithms for makespan minimization in the subset model. Inf. Process.Lett., 84(3):153–157, 2022.

[75] J. Santoso, G. van Albada, B. Nazief, and P. Sloot. Simulation of Hierarchical Job Man-agement for Meta-Computing Systems. International Journal of Foundations of ComputerScience, 12(5):629–643, 2001.

[76] K. Seymour, H. Nakada, S. Matsuoka, J. Dongarra, C. Lee, and H. Casanova. GridRPC: Aremote procedure call API for grid computing.

147

[77] B. A. Shirazi, K. M. Kavi, and A. R. Hurson, editors. Scheduling and Load Balancing inParallel and Distributed Systems. IEEE Computer Society Press, Los Alamitos, CA, USA,1995.

[78] G. Singh, E. Deelman, G. Mehta, K. Vahi, M. H. Su, G. B. Berriman, J. Good, J. C. Jacob,D. S. Katz, A. Lazzarini, K. Blackburn, and S. Koranda. The pegasus portal: web basedgrid computing. In SAC, pages 680–686, 2005.

[79] A. Takefusa, H. Casanova, S. Matsuoka, and F. Berman. A study of deadline schedulingfor client-server systems on the computational grid. In the 10th IEEE Symposium on HighPerformance and Distributed Computing (HPDC’01), San Francisco, California., 2001.

[80] D. Thain, T. Tannenbaum, and M. Livny. Condor and the Grid. In F. Berman, G. Fox, andT. Hey, editors, Grid Computing: Making the Global Infrastructure a Reality. John Wiley &Sons Inc., December 2002.

[81] R. Veldema, R. van Nieuwport, J. Maassen, H. Bal, and A. Plaat. Efficient remote methodinvocation, 1998.

[82] R. Wolski, N. Spring, and J. Hayes. The Network Weather Service: A distributed resourceperformance forecasting service for metacomputing. The Journal of Future Generation Com-puting Systems, 15(5-6):757–768, 1999.

Appendix B

Publications

International Journal Articles

[83] P.-K. Chouhan, H. Dail, E. Caron, and F. Vivien. Automatic middleware deploymentplanning on clusters. International Journal of High Performance Computing Applications,20(4):517–530, Nov. 2006.

Conference Articles

[84] A. Amar, R. Bolze, A. Bouteiller, P. K. Chouhan, A. Chis, Y. Caniou, E. Caron, H. Dail,B. Depardon, F. Desprez, J.-S. Gay, G. Le Mahec, and A. Su. Diet: New developments andrecent results. In CoreGRID Workshop on Grid Middleware (in conjunction with EuroPar2006),Dresden, Germany, August 28-29 2006. Springer.

[85] E. Caron, P. K. Chouhan, and H. Dail. GoDIET: A Deployment Tool for Distributed Mid-dleware on Grid 5000. In EXPEGRID workshop at HPDC2006, Paris, June 2006.

[86] E. Caron, P. K. Chouhan, H. Dail, and F. Vivien. How should you Structure your Hierar-chical Scheduler? In IEEE International Conference HPDC 2006, Paris, France, June 2006.Poster.

[87] E. Caron, P. K. Chouhan, and F. Desprez. Deadline Scheduling with Priority for Client-Server Systems on the Grid. In R. Buyya, editor, Grid Computing 2004. IEEE InternationalConference On Grid Computing. Super Computing 2004, Pittsburgh, Pennsylvania, October2004. Short Paper.

[88] E. Caron, P. K. Chouhan, and A. Legrand. Automatic Deployment for Hierarchical Net-work Enabled Server. In The 13th Heterogeneous Computing Workshop (HCW 2004), SantaFe. New Mexico, 2004.

Research Reports

[89] E. Caron, P. K. Chouhan, and H. Dail. Automatic middleware deployment planning onclusters. Technical Report 2005-26, Laboratoire de l’Informatique du Parallélisme (LIP),May 2005. Also available as INRIA Research Report RR-5573.

149

150 APPENDIX B. PUBLICATIONS

[90] E. Caron, P. K. Chouhan, and H. Dail. Godiet: A deployment tool for distributed mid-dleware on grid 5000. Technical Report RR-5886, Institut National de Recherche en Infor-matique et en Automatique (INRIA), April 2006. Also available as LIP Research Report2006-17.

[91] E. Caron, P. K. Chouhan, H. Dail, and F. Vivien. Automatic middleware deployment plan-ning on clusters. Research report 2005-50, Laboratoire de l’Informatique du Parallélisme(LIP), 2005. Revised version of LIP Research Report 2005-26.

[92] E. Caron, P. K. Chouhan, and F. Desprez. Deadline scheduling with priority for client-server systems. Research report RR-5335, INRIA, october 2004. Also available as LIPResearch Report 2004-33.

[93] E. Caron, P. K. Chouhan, and A. Legrand. Automatic deployment for hierarchical networkenabled server. Research report 2003-51, Laboratoire de l’Informatique du Parallélisme(LIP), November 2003. Also available as INRIA Research Report RR-5146.

automatic deployment for application service provider ... · np-complete. finally, we gave a...

Documents