windows azure as a platform as a service (paas)

101
DEUTSCH-FRANZÖSISCHE SOMMERUNIVERSITÄT FÜR NACHWUCHSWISSENSCHAFTLER 2011 UNIVERSITÉ D’ÉTÉ FRANCO-ALLEMANDE POUR JEUNES CHERCHEURS 2011 CLOUD COMPUTING : DÉFIS ET OPPORTUNITÉS CLOUD COMPUTING : HERAUSFORDERUNGEN UND MÖGLICHKEITEN 17.7. – 22.7. 2011 Windows Azure as a Platform as a Service (PaaS) Jared Jackson Microsoft Research

Upload: vantruc

Post on 05-Jan-2017

234 views

Category:

Documents


4 download

TRANSCRIPT

DEUTSCH-FRANZOumlSISCHE SOMMERUNIVERSITAumlT

FUumlR NACHWUCHSWISSENSCHAFTLER 2011

UNIVERSITEacute DrsquoEacuteTEacute FRANCO-ALLEMANDE

POUR JEUNES CHERCHEURS 2011

CLOUD COMPUTING

DEacuteFIS ET OPPORTUNITEacuteS CLOUD COMPUTING

HERAUSFORDERUNGEN UND MOumlGLICHKEITEN

177 ndash 227 2011

Windows Azure as a Platform as a Service (PaaS)

Jared Jackson Microsoft Research

Before we begin ndash Some Results

Vanilla 23

Chocolate 13

Strawberry 10

Coffee 3

Banana 3

Pistachio 7

Mango 3

Amarena 3

Malaga 3

Cherry 3

Tiramisu 3

Stratiatella 10

Cheescake 3

Cookies and Cream

3

Walnut 3

Cinamon 3

Favorite Ice Cream

Vanilla 33

Chocolate 11

Strawberry 5

Coffee 2

Cherry 2

Cookies and Cream 4

Butter Pecan 7

Neapolitan 4

Chocolate Chip 4

Other 29

Ice Cream Consumption

Source International Ice Cream Association (makeicecreamcom)

Windows Azure Overview

4

Web Application Model Comparison

Machines Running IIS ASPNET

Machines Running Windows Services

Machines Running SQL Server

Ad Hoc Application Model

5

Web Application Model Comparison

Machines Running IIS ASPNET

Machines Running Windows Services

Machines Running SQL Server

Ad Hoc Application Model

Web Role Instances Worker Role

Instances

Azure Storage Blob Queue Table

SQL Azure

Windows Azure Application Model

Key Components Fabric Controller

bull Manages hardware and virtual machines for service

Compute bull Web Roles

bull Web application front end

bull Worker Roles bull Utility compute

bull VM Roles bull Custom compute role bull You own and customize the VM

Storage bull Blobs

bull Binary objects

bull Tables bull Entity storage

bull Queues bull Role coordination

bull SQL Azure bull SQL in the cloud

Key Components Fabric Controller

bull Think of it as an automated IT department

bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key Components Fabric Controller

bull Think of it as an automated IT department bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquo bull The configuration definition describes the shape of a service

bull Role types

bull Role VM sizes

bull External and internal endpoints

bull Local storage

bull The configuration settings configures a service bull Instance count

bull Storage keys

bull Application-specific settings

Key Components Fabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware) bull Power-on automation devices bull Routers Switches bull Hardware load balancers bull Physical servers bull Virtual servers

bull State transitions bull Current State bull Goal State bull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employee bull Never sleeps bull Doesnrsquot ever ask for raise bull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash Compute Web Roles

Web Front End

bull Cloud web server

bull Web pages

bull Web services

You can create the following types

bull ASPNET web roles

bull ASPNET MVC 2 web roles

bull WCF service web roles

bull Worker roles

bull CGI-based web roles

Key Components ndash Compute Worker Roles

bull Utility compute

bull Windows Server 2008

bull Background processing

bull Each role can define an amount of local storage

bull Protected space on the local drive considered volatile storage

bull May communicate with outside services

bull Azure Storage

bull SQL Azure

bull Other Web services

bull Can expose external and internal endpoints

Suggested Application Model Using queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Before we begin ndash Some Results

Vanilla 23

Chocolate 13

Strawberry 10

Coffee 3

Banana 3

Pistachio 7

Mango 3

Amarena 3

Malaga 3

Cherry 3

Tiramisu 3

Stratiatella 10

Cheescake 3

Cookies and Cream

3

Walnut 3

Cinamon 3

Favorite Ice Cream

Vanilla 33

Chocolate 11

Strawberry 5

Coffee 2

Cherry 2

Cookies and Cream 4

Butter Pecan 7

Neapolitan 4

Chocolate Chip 4

Other 29

Ice Cream Consumption

Source International Ice Cream Association (makeicecreamcom)

Windows Azure Overview

4

Web Application Model Comparison

Machines Running IIS ASPNET

Machines Running Windows Services

Machines Running SQL Server

Ad Hoc Application Model

5

Web Application Model Comparison

Machines Running IIS ASPNET

Machines Running Windows Services

Machines Running SQL Server

Ad Hoc Application Model

Web Role Instances Worker Role

Instances

Azure Storage Blob Queue Table

SQL Azure

Windows Azure Application Model

Key Components Fabric Controller

bull Manages hardware and virtual machines for service

Compute bull Web Roles

bull Web application front end

bull Worker Roles bull Utility compute

bull VM Roles bull Custom compute role bull You own and customize the VM

Storage bull Blobs

bull Binary objects

bull Tables bull Entity storage

bull Queues bull Role coordination

bull SQL Azure bull SQL in the cloud

Key Components Fabric Controller

bull Think of it as an automated IT department

bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key Components Fabric Controller

bull Think of it as an automated IT department bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquo bull The configuration definition describes the shape of a service

bull Role types

bull Role VM sizes

bull External and internal endpoints

bull Local storage

bull The configuration settings configures a service bull Instance count

bull Storage keys

bull Application-specific settings

Key Components Fabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware) bull Power-on automation devices bull Routers Switches bull Hardware load balancers bull Physical servers bull Virtual servers

bull State transitions bull Current State bull Goal State bull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employee bull Never sleeps bull Doesnrsquot ever ask for raise bull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash Compute Web Roles

Web Front End

bull Cloud web server

bull Web pages

bull Web services

You can create the following types

bull ASPNET web roles

bull ASPNET MVC 2 web roles

bull WCF service web roles

bull Worker roles

bull CGI-based web roles

Key Components ndash Compute Worker Roles

bull Utility compute

bull Windows Server 2008

bull Background processing

bull Each role can define an amount of local storage

bull Protected space on the local drive considered volatile storage

bull May communicate with outside services

bull Azure Storage

bull SQL Azure

bull Other Web services

bull Can expose external and internal endpoints

Suggested Application Model Using queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Windows Azure Overview

4

Web Application Model Comparison

Machines Running IIS ASPNET

Machines Running Windows Services

Machines Running SQL Server

Ad Hoc Application Model

5

Web Application Model Comparison

Machines Running IIS ASPNET

Machines Running Windows Services

Machines Running SQL Server

Ad Hoc Application Model

Web Role Instances Worker Role

Instances

Azure Storage Blob Queue Table

SQL Azure

Windows Azure Application Model

Key Components Fabric Controller

bull Manages hardware and virtual machines for service

Compute bull Web Roles

bull Web application front end

bull Worker Roles bull Utility compute

bull VM Roles bull Custom compute role bull You own and customize the VM

Storage bull Blobs

bull Binary objects

bull Tables bull Entity storage

bull Queues bull Role coordination

bull SQL Azure bull SQL in the cloud

Key Components Fabric Controller

bull Think of it as an automated IT department

bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key Components Fabric Controller

bull Think of it as an automated IT department bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquo bull The configuration definition describes the shape of a service

bull Role types

bull Role VM sizes

bull External and internal endpoints

bull Local storage

bull The configuration settings configures a service bull Instance count

bull Storage keys

bull Application-specific settings

Key Components Fabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware) bull Power-on automation devices bull Routers Switches bull Hardware load balancers bull Physical servers bull Virtual servers

bull State transitions bull Current State bull Goal State bull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employee bull Never sleeps bull Doesnrsquot ever ask for raise bull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash Compute Web Roles

Web Front End

bull Cloud web server

bull Web pages

bull Web services

You can create the following types

bull ASPNET web roles

bull ASPNET MVC 2 web roles

bull WCF service web roles

bull Worker roles

bull CGI-based web roles

Key Components ndash Compute Worker Roles

bull Utility compute

bull Windows Server 2008

bull Background processing

bull Each role can define an amount of local storage

bull Protected space on the local drive considered volatile storage

bull May communicate with outside services

bull Azure Storage

bull SQL Azure

bull Other Web services

bull Can expose external and internal endpoints

Suggested Application Model Using queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

4

Web Application Model Comparison

Machines Running IIS ASPNET

Machines Running Windows Services

Machines Running SQL Server

Ad Hoc Application Model

5

Web Application Model Comparison

Machines Running IIS ASPNET

Machines Running Windows Services

Machines Running SQL Server

Ad Hoc Application Model

Web Role Instances Worker Role

Instances

Azure Storage Blob Queue Table

SQL Azure

Windows Azure Application Model

Key Components Fabric Controller

bull Manages hardware and virtual machines for service

Compute bull Web Roles

bull Web application front end

bull Worker Roles bull Utility compute

bull VM Roles bull Custom compute role bull You own and customize the VM

Storage bull Blobs

bull Binary objects

bull Tables bull Entity storage

bull Queues bull Role coordination

bull SQL Azure bull SQL in the cloud

Key Components Fabric Controller

bull Think of it as an automated IT department

bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key Components Fabric Controller

bull Think of it as an automated IT department bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquo bull The configuration definition describes the shape of a service

bull Role types

bull Role VM sizes

bull External and internal endpoints

bull Local storage

bull The configuration settings configures a service bull Instance count

bull Storage keys

bull Application-specific settings

Key Components Fabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware) bull Power-on automation devices bull Routers Switches bull Hardware load balancers bull Physical servers bull Virtual servers

bull State transitions bull Current State bull Goal State bull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employee bull Never sleeps bull Doesnrsquot ever ask for raise bull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash Compute Web Roles

Web Front End

bull Cloud web server

bull Web pages

bull Web services

You can create the following types

bull ASPNET web roles

bull ASPNET MVC 2 web roles

bull WCF service web roles

bull Worker roles

bull CGI-based web roles

Key Components ndash Compute Worker Roles

bull Utility compute

bull Windows Server 2008

bull Background processing

bull Each role can define an amount of local storage

bull Protected space on the local drive considered volatile storage

bull May communicate with outside services

bull Azure Storage

bull SQL Azure

bull Other Web services

bull Can expose external and internal endpoints

Suggested Application Model Using queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

5

Web Application Model Comparison

Machines Running IIS ASPNET

Machines Running Windows Services

Machines Running SQL Server

Ad Hoc Application Model

Web Role Instances Worker Role

Instances

Azure Storage Blob Queue Table

SQL Azure

Windows Azure Application Model

Key Components Fabric Controller

bull Manages hardware and virtual machines for service

Compute bull Web Roles

bull Web application front end

bull Worker Roles bull Utility compute

bull VM Roles bull Custom compute role bull You own and customize the VM

Storage bull Blobs

bull Binary objects

bull Tables bull Entity storage

bull Queues bull Role coordination

bull SQL Azure bull SQL in the cloud

Key Components Fabric Controller

bull Think of it as an automated IT department

bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key Components Fabric Controller

bull Think of it as an automated IT department bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquo bull The configuration definition describes the shape of a service

bull Role types

bull Role VM sizes

bull External and internal endpoints

bull Local storage

bull The configuration settings configures a service bull Instance count

bull Storage keys

bull Application-specific settings

Key Components Fabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware) bull Power-on automation devices bull Routers Switches bull Hardware load balancers bull Physical servers bull Virtual servers

bull State transitions bull Current State bull Goal State bull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employee bull Never sleeps bull Doesnrsquot ever ask for raise bull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash Compute Web Roles

Web Front End

bull Cloud web server

bull Web pages

bull Web services

You can create the following types

bull ASPNET web roles

bull ASPNET MVC 2 web roles

bull WCF service web roles

bull Worker roles

bull CGI-based web roles

Key Components ndash Compute Worker Roles

bull Utility compute

bull Windows Server 2008

bull Background processing

bull Each role can define an amount of local storage

bull Protected space on the local drive considered volatile storage

bull May communicate with outside services

bull Azure Storage

bull SQL Azure

bull Other Web services

bull Can expose external and internal endpoints

Suggested Application Model Using queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Key Components Fabric Controller

bull Manages hardware and virtual machines for service

Compute bull Web Roles

bull Web application front end

bull Worker Roles bull Utility compute

bull VM Roles bull Custom compute role bull You own and customize the VM

Storage bull Blobs

bull Binary objects

bull Tables bull Entity storage

bull Queues bull Role coordination

bull SQL Azure bull SQL in the cloud

Key Components Fabric Controller

bull Think of it as an automated IT department

bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key Components Fabric Controller

bull Think of it as an automated IT department bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquo bull The configuration definition describes the shape of a service

bull Role types

bull Role VM sizes

bull External and internal endpoints

bull Local storage

bull The configuration settings configures a service bull Instance count

bull Storage keys

bull Application-specific settings

Key Components Fabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware) bull Power-on automation devices bull Routers Switches bull Hardware load balancers bull Physical servers bull Virtual servers

bull State transitions bull Current State bull Goal State bull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employee bull Never sleeps bull Doesnrsquot ever ask for raise bull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash Compute Web Roles

Web Front End

bull Cloud web server

bull Web pages

bull Web services

You can create the following types

bull ASPNET web roles

bull ASPNET MVC 2 web roles

bull WCF service web roles

bull Worker roles

bull CGI-based web roles

Key Components ndash Compute Worker Roles

bull Utility compute

bull Windows Server 2008

bull Background processing

bull Each role can define an amount of local storage

bull Protected space on the local drive considered volatile storage

bull May communicate with outside services

bull Azure Storage

bull SQL Azure

bull Other Web services

bull Can expose external and internal endpoints

Suggested Application Model Using queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Key Components Fabric Controller

bull Think of it as an automated IT department

bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key Components Fabric Controller

bull Think of it as an automated IT department bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquo bull The configuration definition describes the shape of a service

bull Role types

bull Role VM sizes

bull External and internal endpoints

bull Local storage

bull The configuration settings configures a service bull Instance count

bull Storage keys

bull Application-specific settings

Key Components Fabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware) bull Power-on automation devices bull Routers Switches bull Hardware load balancers bull Physical servers bull Virtual servers

bull State transitions bull Current State bull Goal State bull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employee bull Never sleeps bull Doesnrsquot ever ask for raise bull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash Compute Web Roles

Web Front End

bull Cloud web server

bull Web pages

bull Web services

You can create the following types

bull ASPNET web roles

bull ASPNET MVC 2 web roles

bull WCF service web roles

bull Worker roles

bull CGI-based web roles

Key Components ndash Compute Worker Roles

bull Utility compute

bull Windows Server 2008

bull Background processing

bull Each role can define an amount of local storage

bull Protected space on the local drive considered volatile storage

bull May communicate with outside services

bull Azure Storage

bull SQL Azure

bull Other Web services

bull Can expose external and internal endpoints

Suggested Application Model Using queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Key Components Fabric Controller

bull Think of it as an automated IT department bull ldquoCloud Layerrdquo on top of

bull Windows Server 2008

bull A custom version of Hyper-V called the Windows Azure Hypervisor bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquo bull The configuration definition describes the shape of a service

bull Role types

bull Role VM sizes

bull External and internal endpoints

bull Local storage

bull The configuration settings configures a service bull Instance count

bull Storage keys

bull Application-specific settings

Key Components Fabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware) bull Power-on automation devices bull Routers Switches bull Hardware load balancers bull Physical servers bull Virtual servers

bull State transitions bull Current State bull Goal State bull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employee bull Never sleeps bull Doesnrsquot ever ask for raise bull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash Compute Web Roles

Web Front End

bull Cloud web server

bull Web pages

bull Web services

You can create the following types

bull ASPNET web roles

bull ASPNET MVC 2 web roles

bull WCF service web roles

bull Worker roles

bull CGI-based web roles

Key Components ndash Compute Worker Roles

bull Utility compute

bull Windows Server 2008

bull Background processing

bull Each role can define an amount of local storage

bull Protected space on the local drive considered volatile storage

bull May communicate with outside services

bull Azure Storage

bull SQL Azure

bull Other Web services

bull Can expose external and internal endpoints

Suggested Application Model Using queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Key Components Fabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware) bull Power-on automation devices bull Routers Switches bull Hardware load balancers bull Physical servers bull Virtual servers

bull State transitions bull Current State bull Goal State bull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employee bull Never sleeps bull Doesnrsquot ever ask for raise bull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash Compute Web Roles

Web Front End

bull Cloud web server

bull Web pages

bull Web services

You can create the following types

bull ASPNET web roles

bull ASPNET MVC 2 web roles

bull WCF service web roles

bull Worker roles

bull CGI-based web roles

Key Components ndash Compute Worker Roles

bull Utility compute

bull Windows Server 2008

bull Background processing

bull Each role can define an amount of local storage

bull Protected space on the local drive considered volatile storage

bull May communicate with outside services

bull Azure Storage

bull SQL Azure

bull Other Web services

bull Can expose external and internal endpoints

Suggested Application Model Using queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Creating a New Project

Windows Azure Compute

Key Components ndash Compute Web Roles

Web Front End

bull Cloud web server

bull Web pages

bull Web services

You can create the following types

bull ASPNET web roles

bull ASPNET MVC 2 web roles

bull WCF service web roles

bull Worker roles

bull CGI-based web roles

Key Components ndash Compute Worker Roles

bull Utility compute

bull Windows Server 2008

bull Background processing

bull Each role can define an amount of local storage

bull Protected space on the local drive considered volatile storage

bull May communicate with outside services

bull Azure Storage

bull SQL Azure

bull Other Web services

bull Can expose external and internal endpoints

Suggested Application Model Using queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Windows Azure Compute

Key Components ndash Compute Web Roles

Web Front End

bull Cloud web server

bull Web pages

bull Web services

You can create the following types

bull ASPNET web roles

bull ASPNET MVC 2 web roles

bull WCF service web roles

bull Worker roles

bull CGI-based web roles

Key Components ndash Compute Worker Roles

bull Utility compute

bull Windows Server 2008

bull Background processing

bull Each role can define an amount of local storage

bull Protected space on the local drive considered volatile storage

bull May communicate with outside services

bull Azure Storage

bull SQL Azure

bull Other Web services

bull Can expose external and internal endpoints

Suggested Application Model Using queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Key Components ndash Compute Web Roles

Web Front End

bull Cloud web server

bull Web pages

bull Web services

You can create the following types

bull ASPNET web roles

bull ASPNET MVC 2 web roles

bull WCF service web roles

bull Worker roles

bull CGI-based web roles

Key Components ndash Compute Worker Roles

bull Utility compute

bull Windows Server 2008

bull Background processing

bull Each role can define an amount of local storage

bull Protected space on the local drive considered volatile storage

bull May communicate with outside services

bull Azure Storage

bull SQL Azure

bull Other Web services

bull Can expose external and internal endpoints

Suggested Application Model Using queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Key Components ndash Compute Worker Roles

bull Utility compute

bull Windows Server 2008

bull Background processing

bull Each role can define an amount of local storage

bull Protected space on the local drive considered volatile storage

bull May communicate with outside services

bull Azure Storage

bull SQL Azure

bull Other Web services

bull Can expose external and internal endpoints

Suggested Application Model Using queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Suggested Application Model Using queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Scalable Fault Tolerant Applications

Queues are the application glue bull Decouple parts of application easier to scale independently bull Resource allocation different priority queues and backend servers bull Mask faults in worker roles (reliable messaging)

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Key Components ndash Compute VM Roles

bull Customized Role

bull You own the box

bull How it works

bull Download ldquoGuest OSrdquo to Server 2008 Hyper-V

bull Customize the OS as you need to

bull Upload the differences VHD

bull Azure runs your VM role using

bull Base OS

bull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Application Hosting

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

lsquoGrokkingrsquo the service model

bull Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Automated Service Management Provide code + service model

bull Platform identifies and allocates resources deploys the service manages service health

bull Configuration is handled by two files

ServiceDefinitioncsdef

ServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Deploying to the cloud

bull We can deploy from the portal or from script

bull VS builds two files

bull Encrypted package of your code

bull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes

bull (which is better than six months)

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Service Management API

bullREST based API to manage your services

bullX509-certs for authentication

bullLets you create delete change upgrade swaphellip

bullLots of community and MSFT-built tools around the API

- Easy to roll your own

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

The Secret Sauce ndash The Fabric

The Fabric is the lsquobrainrsquo behind Windows Azure

1 Process service model

1 Determine resource requirements

2 Create role images

2 Allocate resources

3 Prepare nodes

1 Place role images on nodes

2 Configure settings

3 Start roles

4 Configure load balancers

5 Maintain service health

1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Storage

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Durable Storage At Massive Scale

Blob

- Massive files eg videos logs

Drive

- Use standard file system APIs

Tables - Non-relational but with few scale limits

- Use SQL Azure for relational data

Queues

- Facilitate loosely-coupled reliable systems

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Blob Features and Functions

bull Store Large Objects (up to 1TB in size)

bull Can be served through Windows Azure CDN service

bull Standard REST Interface

bull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlob

bull Get whole blob or a specific range

bull DeleteBlob

bull CopyBlob

bull SnapshotBlob

bull LeaseBlob

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Two Types of Blobs Under the Hood

bull Block Blob

bull Targeted at streaming workloads

bull Each blob consists of a sequence of blocks bull Each block is identified by a Block ID

bull Size limit 200GB per blob

bull Page Blob

bull Targeted at random readwrite workloads

bull Each blob consists of an array of pages bull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to use

bull Use existing NTFS APIs to access a durable drive bull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications to the cloud

bull A Windows Azure Drive is a Page Blob

bull Example mount Page Blob as X bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blob bull Drive made durable through standard Page Blob replication

bull Drive persists even when not mounted as a Page Blob

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Windows Azure Tables

bull Provides Structured Storage

bull Massively Scalable Tables bull Billions of entities (rows) and TBs of data

bull Can use thousands of servers as traffic grows

bull Highly Available amp Durable bull Data is replicated several times

bull Familiar and Easy to use API

bull WCF Data Services and OData bull NET classes and LINQ

bull REST ndash with any platform or language

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message delivery

bull Simple asynchronous work dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Storage Partitioning

Understanding partitioning is key to understanding performance

bull Different for each data type (blobs entities queues) Every data object has a partition key

bull A partition can be served by a single server

bull System load balances partitions based on traffic pattern

bull Controls entity locality Partition key is unit of scale

bull Load balancing can take a few minutes to kick in

bull Can take a couple of seconds for partition to be available on a different server

System load balances

bull Use exponential backoff on ldquoServer Busyrdquo

bull Our system load balances to meet your traffic needs

bull Single partition limits have been reached Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partition Entities ndash TableName + PartitionKey

PartitionKey (CustomerId) RowKey (RowKind)

Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partition Blobs ndash Container name + Blob name

bullAll messages for a single queue belong to the same partition Messages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Scalability Targets

Storage Account

bull Capacity ndash Up to 100 TBs

bull Transactions ndash Up to a few thousand requests per second

bull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey (Category)

RowKey (Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Key Selection Things to Consider

bullDistribute load as much as possible bullHot partitions can be load balanced bullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scans bull Parallelize queries bull Point queries are most efficient

bullTransactions across a single partition bullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Tables Recap bullEfficient for frequently used queries

bullSupports batch transactions

bullDistributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handle continuation tokens

ldquoORrdquo predicates are not optimized

Implement back-off strategy for retries

bullDistribute by using a hash etc as prefix

bullExpect continuation tokens for range queries

bullExecute the queries that form the ldquoORrdquo predicates as separate queries

bullServer busy

bullLoad balance partitions to meet traffic needs

bullLoad on single partition has exceeded the limits

WCF Data Services

bullUse a new context for each logical operation

bullAddObjectAttachTo can throw exception if entity is already being tracked

bullPoint query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Queues Their Unique Role in Building Reliable Scalable Applications

bull Want roles that work closely together but are not bound together bull Tight coupling leads to brittleness

bull This can aid in scaling and performance

bull A queue can hold an unlimited number of messages bull Messages must be serializable as XML

bull Limited to 8KB in size

bull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout) RemoveMessage

Msg 2 Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETE httpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

44

2 1

1 1

C1

C2

Removing Poison Messages

1 1

2 1

3 4 0

Producers Consumers

P2

P1

3 0

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

1 1

2 1

1 0

2 0

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

45

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 1

2 1

2 GetMessage(Q 30 s) msg 2 3 C2 consumed msg 2 4 DeleteMessage(Q msg 2) 7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 1 5 C1 crashed

1 1

2 1

6 msg1 visible 30 s after Dequeue 3 0

1 2

1 1

1 2

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

46

C1

C2

Removing Poison Messages

3 4 0

Producers Consumers

P2

P1

1 2

2 Dequeue(Q 30 sec) msg 2 3 C2 consumed msg 2 4 Delete(Q msg 2) 7 Dequeue(Q 30 sec) msg 1 8 C2 crashed

1 Dequeue(Q 30 sec) msg 1 5 C1 crashed 10 C1 restarted 11 Dequeue(Q 30 sec) msg 1 12 DequeueCount gt 2 13 Delete (Q msg1) 1

2

6 msg1 visible 30s after Dequeue 9 msg1 visible 30s after Dequeue

3 0

1 3

1 2

1 3

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Queues Recap

bullNo need to deal with failures Make message

processing idempotent

bullInvisible messages result in out of order Do not rely on order

bullEnforce threshold on messagersquos dequeue count Use Dequeue count to remove

poison messages

bullMessages gt 8KB

bullBatch messages

bullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to store message data with

reference in message

Use message count to scale

bullNo need to deal with failures

bullInvisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Windows Azure Storage Takeaways

Blobs

Drives

Tables

Queues

httpblogsmsdncomwindowsazurestorage

httpazurescopecloudappnet

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

49

A Quick Exercise

hellipThen letrsquos look at some code and some tools

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

50

Code ndash AccountInformationcs public class AccountInformation private static string storageKey = ldquotHiSiSnOtMyKeY private static string accountName = jjstore private static StorageCredentialsAccountAndKey credentials internal static StorageCredentialsAccountAndKey Credentials get if (credentials == null) credentials = new StorageCredentialsAccountAndKey(accountName storageKey) return credentials

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

51

Code ndash BlobHelpercs public class BlobHelper private static string defaultContainerName = school private CloudBlobClient client = null private CloudBlobContainer container = null private void InitContainer() if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudBlobClient() container = clientGetContainerReference(defaultContainerName) containerCreateIfNotExist() BlobContainerPermissions permissions = containerGetPermissions() permissionsPublicAccess = BlobContainerPublicAccessTypeContainer containerSetPermissions(permissions)

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

52

Code ndash BlobHelpercs

public void WriteFileToBlob(string filePath) if (client == null || container == null) InitContainer() FileInfo file = new FileInfo(filePath) CloudBlob blob = containerGetBlobReference(fileName) blobPropertiesContentType = GetContentType(fileExtension) blobUploadFile(fileFullName) Or if you want to write a string replace the last line with blobUploadText(someString) And make sure you set the content type to the appropriate MIME type (eg ldquotextplainrdquo)

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

53

Code ndash BlobHelpercs

public string GetBlobText(string blobName) if (client == null || container == null) InitContainer() CloudBlob blob = containerGetBlobReference(blobName) try return blobDownloadText() catch (Exception) The blob probably does not exist or there is no connection available return null

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

54

Application Code - Blobs private void SaveToCloudButton_Click(object sender RoutedEventArgs e) StringBuilder buff = new StringBuilder() buffAppendLine(LastNameFirstNameEmailBirthdayNativeLanguageFavoriteIceCreamYearsInPhDGraduated) foreach (AttendeeEntity attendee in attendees) buffAppendLine(attendeeToCsvString()) blobHelperWriteStringToBlob(SummerSchoolAttendeestxt buffToString())

The blob is now available at httpltAccountNamegtblobcorewindowsnetltContainerNamegtltBlobNamegt Or in this case httpjjstoreblobcorewindowsnetschoolSummerSchoolAttendeestxt

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

55

Code - TableEntities using MicrosoftWindowsAzureStorageClient public class AttendeeEntity TableServiceEntity public string FirstName get set public string LastName get set public string Email get set public DateTime Birthday get set public string FavoriteIceCream get set public int YearsInPhD get set public bool Graduated get set hellip

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

56

Code - TableEntities public void UpdateFrom(AttendeeEntity other) FirstName = otherFirstName LastName = otherLastName Email = otherEmail Birthday = otherBirthday FavoriteIceCream = otherFavoriteIceCream YearsInPhD = otherYearsInPhD Graduated = otherGraduated UpdateKeys() public void UpdateKeys() PartitionKey = SummerSchool RowKey = Email

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

57

Code ndash TableHelpercs public class TableHelper private CloudTableClient client = null private TableServiceContext context = null private DictionaryltstringAttendeeEntitygt allAttendees = null private string tableName = Attendees private CloudTableClient Client get if (client == null) client = new CloudStorageAccount(AccountInformationCredentials false)CreateCloudTableClient() return client private TableServiceContext Context get if (context == null) context = ClientGetDataServiceContext() return context

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

58

Code ndash TableHelpercs private void ReadAllAttendees() allAttendees = new Dictionaryltstring AttendeeEntitygt() CloudTableQueryltAttendeeEntitygt query = ContextCreateQueryltAttendeeEntitygt(tableName)AsTableServiceQuery() try foreach (AttendeeEntity attendee in query) allAttendees[attendeeEmail] = attendee catch (Exception) No entries in table - or other exception

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

59

Code ndash TableHelpercs public void DeleteAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return AttendeeEntity attendee = allAttendees[email] Delete from the cloud table ContextDeleteObject(attendee) ContextSaveChanges() Delete from the memory cache allAttendeesRemove(email)

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

60

Code ndash TableHelpercs public AttendeeEntity GetAttendee(string email) if (allAttendees == null) ReadAllAttendees() if (allAttendeesContainsKey(email)) return allAttendees[email] return null

Remember that this only works for tables (or queries on tables) that easily fit in memory This is one of many design patterns for working with tables

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

61

Pseudo Code ndash TableHelpercs public void UpdateAttendees(ListltAttendeeEntitygt updatedAttendees) foreach (AttendeeEntity attendee in updatedAttendees) UpdateAttendee(attendee false) ContextSaveChanges(SaveChangesOptionsBatch) public void UpdateAttendee(AttendeeEntity attendee) UpdateAttendee(attendee true) private void UpdateAttendee(AttendeeEntity attendee bool saveChanges) if (allAttendeesContainsKey(attendeeEmail)) AttendeeEntity existingAttendee = allAttendees[attendeeEmail] existingAttendeeUpdateFrom(attendee) ContextUpdateObject(existingAttendee) else ContextAddObject(tableName attendee) if (saveChanges) ContextSaveChanges()

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

62

Application Code ndash Cloud Tables private void SaveButton_Click(object sender RoutedEventArgs e) Write to table tableHelperUpdateAttendees(attendees)

Thatrsquos it Now your tables are accessible using REST service calls or any cloud storage tool

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

63

Tools ndash Fiddler2

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Using Your VM to the Maximum

Remember bull 1 role instance == 1 VM running Windows

bull 1 role instance = one specific task for your code

bull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of need

bull Multiple ways to use your CPU to the fullest

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Exploiting Concurrency

bull Spin up additional processes each with a specific task or as a unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Finding Good Code Neighbors

bull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live together

bull Example distributed network caches are typically network- and memory-intensive they may be a good neighbor for storage IO-intensive code

Memory Intensive

CPU Intensive

Network IO Intensive

Storage IO Intensive

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Scaling Appropriately

bull Monitor your application and make sure yoursquore scaled appropriately (not over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Storage Costs

bull Understand an applicationrsquos storage profile and how storage billing works

bull Make service choices based on your app profile

bull Eg SQL Azure has a flat fee while Windows Azure Tables charges per transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings in other places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Compressing Content

1 Gzip all output content

bull All modern browsers can decompress on the fly

bull Compared to Compress Gzip has much better compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizes

bull Use Portable Network Graphics (PNGs)

bull Crush your PNGs

bull Strip needless metadata

bull Make all PNGs palette PNGs

Uncompressed

Content

Compressed

Content

Gzip

Minify JavaScript

Minify CCS

Minify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Research Examples in the Cloud

hellipon another set of slides

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Map Reduce on Azure

bull Elastic MapReduce on Amazon Web Services has traditionally been the only option for Map Reduce jobs in the web bull Hadoop implementation bull Hadoop has a long history and has been improved for stability bull Originally Designed for Cluster Systems

bull Microsoft Research this week is announcing a project code named Daytona for Map Reduce jobs on Azure bull Designed from the start to use cloud primitives bull Built-in fault tolerance bull REST based interface for writing your own clients

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

76

Project Daytona - Map Reduce on Azure

httpresearchmicrosoftcomen-usprojectsazuredaytonaaspx

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

77

Questions and Discussionhellip

Thank you for hosting me at the Summer School

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

BLAST (Basic Local Alignment Search Tool)

bull The most important software in bioinformatics

bull Identify similarity between bio-sequences

Computationally intensive

bull Large number of pairwise alignment operations

bull A BLAST running can take 700 ~ 1000 CPU hours

bull Sequence databases growing exponentially

bull GenBank doubled in size in about 15 months

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

It is easy to parallelize BLAST

bull Segment the input

bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)

bull Needs special result reduction processing

Large volume data

bull A normal Blast database can be as large as 10GB

bull 100 nodes means the peak storage bandwidth could reach to 1TB

bull The output of BLAST is usually 10-100x larger than the input

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel pattern bull split the input sequences

bull query partitions in parallel

bull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerations bull Batch job management

bull Task parallelism on an elastic Cloud

Wei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop

on Scientific Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

A simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLAST

bull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance

bull Small partition unnecessary overheads bull NCBI-BLAST overhead

bull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time

bull too small repeated computation

bull too large unnecessary long period of waiting time in case of the instance failure Best Practice

bull Estimate the value based on the number of pair-bases in the partition and test-runs

bull Watch out for the 2-hour maximum limitation

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Task size vs Performance

bull Benefit of the warm cache effect

bull 100 sequences per partition is the best choice

Instance size vs Performance

bull Super-linear speedup with larger size worker instances

bull Primarily due to the memory capability

Task SizeInstance Size vs Cost

bull Extra-large instance generated the best and the most economical throughput

bull Fully utilize the resource

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Web

Portal

Web

Service

Job registration

Job Scheduler

Worker

Worker

Worker

Global

dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database

updating Role

hellip

Scaling Engine

Blast databases

temporary data etc)

Job Registry NCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

ASPNET program hosted by a web role instance bull Submit jobs

bull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry table bull Fault tolerance avoid in-memory states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

R palustris as a platform for H2 production Eric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences) bull Against all NCBI non-redundant proteins completed in 30 min

bull Against ~5000 proteins from another strain completed in less than 30 sec

AzureBLAST significantly saved computing timehellip

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Discovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo query bull The database is also the input query

bull The protein database is large (42 GB size) bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimation bull Based on the sampling-running on one extra-large Azure instance

bull Would require 3216731 minutes (61 years) on one desktop

One of biggest BLAST jobs as far as we know bull This scale of experiments usually are infeasible to most scientists

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

bull Allocated a total of ~4000 instances bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLAST bull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segments bull Each will be submitted to one deployment as one job for execution

bull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

5

0

62 6

2 6

2 6

2 6

2 5

0 62

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

bull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)

bull But based our estimates real working instance time should be 6~8 day

bull Look into log data to analyze what took placehellip

5

0

62 6

2 6

2 6

2 6

2 5

0 62

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523

3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins

3312010 625 RD00155D3611B0 Executing the task 251553

3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins

3312010 644 RD00155D3611B0 Executing the task 251600

3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks

and then came back in a group

This is an Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

35 Nodes experience blob

writing failure at same

time

West Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the

Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

MODISAzure Computing Evapotranspiration (ET) in The Cloud

You never miss the water till the well has run dry Irish Proverb

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

ET = Water volume evapotranspired (m3 s-1 m-2)

Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1)

λv = Latent heat of vaporization (Jg)

Rn = Net radiation (W m-2)

cp = Specific heat capacity of air (J kg-1 K-1)

ρa = dry air density (kg m-3)

δq = vapor pressure deficit (Pa)

ga = Conductivity of air (inverse of ra) (m s-1)

gs = Conductivity of plant stoma air (inverse of rs) (m s-1)

γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a

catchment can be tricky

bull Lots of inputs big data reduction

bull Some of the inputs are not so simple

119864119879 = ∆119877119899 + 120588119886 119888119901 120575119902 119892119886

(∆ + 120574 1 + 119892119886 119892119904 )120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

NASA MODIS

imagery source

archives

5 TB (600K files)

FLUXNET curated

sensor dataset

(30GB 960 files)

FLUXNET curated

field dataset

2 KB (1 file)

NCEPNCAR

~100MB

(4K files)

Vegetative

clumping

~5MB (1file)

Climate

classification

~1MB (1file)

20 US year = 1 global year

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Data collection (map) stage

bull Downloads requested input tiles

from NASA ftp sites

bull Includes geospatial lookup for

non-sinusoidal tiles that will

contribute to a reprojected

sinusoidal tile

Reprojection (map) stage

bull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline

algorithms

Derivation reduction stage

bull First stage visible to scientist

bull Computes ET in our initial use

Analysis reduction stage

bull Optional second stage visible to

scientist

bull Enables production of science

analysis artifacts such as maps

tables virtual sensors

Reduction 1

Queue

Source

Metadata

AzureMODIS

Service Web Role Portal

Request

Queue

Scientific

Results

Download

Data Collection Stage

Source Imagery Download Sites

Reprojection

Queue

Reduction 2

Queue

Download

Queue

Scientists

Science results

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

bull ModisAzure Service is the Web Role front door bull Receives all user requests

bull Queues request to appropriate Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Role bull Parses all job requests into tasks ndash

recoverable units of work

bull Execution status of all jobs and tasks persisted in Tables

ltPipelineStagegt

Request

hellip ltPipelineStagegtJobStatus

Persist ltPipelineStagegtJob Queue

MODISAzure Service

(Web Role)

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

hellip

Dispatch

ltPipelineStagegtTask Queue

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

All work actually done by a Worker Role

bull Sandboxes science or other executable

bull Marshalls all storage fromto Azure blob storage tofrom local Azure Worker instance files

Service Monitor

(Worker Role)

Parse amp Persist ltPipelineStagegtTaskStatus

GenericWorker

(Worker Role)

hellip

hellip

Dispatch

ltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 times

bull Maintains all task status

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

Reprojection Request

hellip

Service Monitor

(Worker Role)

ReprojectionJobStatus Persist

Parse amp Persist ReprojectionTaskStatus

GenericWorker

(Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMeta

Reprojection Data

Storage

Each entity specifies a

single reprojection job

request

Each entity specifies a

single reprojection task (ie

a single tile)

Query this table to get

geo-metadata (eg

boundaries) for each swath

tile

Query this table to get the

list of satellite scan times

that cover a target tile

Swath Source

Data Storage

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

Download

Queue

Scientists

Analysis Reduction Stage Derivation Reduction Stage Reprojection Stage

400-500 GB

60K files

10 MBsec

11 hours

lt10 workers

$50 upload

$450 storage

400 GB

45K files

3500 hours

20-100

workers

5-7 GB

55K files

1800 hours

20-100

workers

lt10 GB

~1K files

1800 hours

20-100

workers

$420 cpu

$60 download

$216 cpu

$1 download

$6 storage

$216 cpu

$2 download

$9 storage

AzureMODIS

Service Web Role Portal

Total $1420

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications

bull Clouds are the largest scale computer centers ever constructed and have the

potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed

resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can

support many interesting ldquoprogramming patternsrdquo but tightly coupled low-

latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both

individual researchers and entire communities of researchers

  • Day 2 - Azure as a PaaS
  • Day 2 - Applications