researchpaper4.pdf

Live Migration of Virtual Machines

Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen†,Eric Jul†, Christian Limpach, Ian Pratt, Andrew Warfield

University of Cambridge Computer Laboratory † Department of Computer Science15 JJ Thomson Avenue, Cambridge, UK University of Copenhagen, Denmark

firstname.lastname@cl.cam.ac.uk {jacobg,eric}@diku.dk

AbstractMigrating operating system instances across distinct phys-ical hosts is a useful tool for administrators of data centersand clusters: It allows a clean separation between hard-ware and software, and facilitates fault management, loadbalancing, and low-level system maintenance.

By carrying out the majority of migration while OSes con-tinue to run, we achieve impressive performance with min-imal service downtimes; we demonstrate the migration ofentire OS instances on a commodity cluster, recording ser-vice downtimes as low as 60ms. We show that that ourperformance is sufficient to make live migration a practicaltool even for servers running interactive loads.

In this paper we consider the design options for migrat-ing OSes running services with liveness constraints, fo-cusing on data center and cluster environments. We intro-duce and analyze the concept of writable working set, andpresent the design, implementation and evaluation of high-performance OS migration built on top of the Xen VMM.

1 Introduction

Operating system virtualization has attracted considerableinterest in recent years, particularly from the data centerand cluster computing communities. It has previously beenshown [1] that paravirtualization allows many OS instancesto run concurrently on a single physical machine with highperformance, providing better use of physical resourcesand isolating individual OS instances.

In this paper we explore a further benefit allowed by vir-tualization: that of live OS migration. Migrating an en-tire OS and all of its applications as one unit allows us toavoid many of the difficulties faced by process-level mi-gration approaches. In particular the narrow interface be-tween a virtualized OS and the virtual machine monitor(VMM) makes it easy avoid the problem of ‘residual de-pendencies’ [2] in which the original host machine mustremain available and network-accessible in order to service

certain system calls or even memory accesses on behalf ofmigrated processes. With virtual machine migration, onthe other hand, the original host may be decommissionedonce migration has completed. This is particularly valuablewhen migration is occurring in order to allow maintenanceof the original host.

Secondly, migrating at the level of an entire virtual ma-chine means that in-memory state can be transferred in aconsistent and (as will be shown) efficient fashion. This ap-plies to kernel-internal state (e.g. the TCP control block fora currently active connection) as well as application-levelstate, even when this is shared between multiple cooperat-ing processes. In practical terms, for example, this meansthat we can migrate an on-line game server or streamingmedia server without requiring clients to reconnect: some-thing not possible with approaches which use application-level restart and layer 7 redirection.

Thirdly, live migration of virtual machines allows a sepa-ration of concerns between the users and operator of a datacenter or cluster. Users have ‘carte blanche’ regarding thesoftware and services they run within their virtual machine,and need not provide the operator with any OS-level accessat all (e.g. a root login to quiesce processes or I/O prior tomigration). Similarly the operator need not be concernedwith the details of what is occurring within the virtual ma-chine; instead they can simply migrate the entire operatingsystem and its attendant processes as a single unit.

Overall, live OS migration is a extremelely powerful toolfor cluster administrators, allowing separation of hardwareand software considerations, and consolidating clusteredhardware into a single coherent management domain. Ifa physical machine needs to be removed from service anadministrator may migrate OS instances including the ap-plications that they are running to alternative machine(s),freeing the original machine for maintenance. Similarly,OS instances may be rearranged across machines in a clus-ter to relieve load on congested hosts. In these situations thecombination of virtualization and migration significantlyimproves manageability.

NSDI ’05: 2nd Symposium on Networked Systems Design & ImplementationUSENIX Association 273

We have implemented high-performance migration sup-port for Xen [1], a freely available open source VMM forcommodity hardware. Our design and implementation ad-dresses the issues and tradeoffs involved in live local-areamigration. Firstly, as we are targeting the migration of ac-tive OSes hosting live services, it is critically important tominimize the downtime during which services are entirelyunavailable. Secondly, we must consider the total migra-tion time, during which state on both machines is synchro-nized and which hence may affect reliability. Furthermorewe must ensure that migration does not unnecessarily dis-rupt active services through resource contention (e.g., CPU,network bandwidth) with the migrating OS.

Our implementation addresses all of these concerns, allow-ing for example an OS running the SPECweb benchmarkto migrate across two physical hosts with only 210ms un-availability, or an OS running a Quake 3 server to migratewith just 60ms downtime. Unlike application-level restart,we can maintain network connections and application stateduring this process, hence providing effectively seamlessmigration from a user’s point of view.

We achieve this by using a pre-copy approach in whichpages of memory are iteratively copied from the sourcemachine to the destination host, all without ever stoppingthe execution of the virtual machine being migrated. Page-level protection hardware is used to ensure a consistentsnapshot is transferred, and a rate-adaptive algorithm isused to control the impact of migration traffic on runningservices. The final phase pauses the virtual machine, copiesany remaining pages to the destination, and resumes exe-cution there. We eschew a ‘pull’ approach which faults inmissing pages across the network since this adds a residualdependency of arbitrarily long duration, as well as provid-ing in general rather poor performance.

Our current implementation does not address migrationacross the wide area, nor does it include support for migrat-ing local block devices, since neither of these are requiredfor our target problem space. However we discuss ways inwhich such support can be provided in Section 7.

2 Related Work

The Collective project [3] has previously explored VM mi-gration as a tool to provide mobility to users who work ondifferent physical hosts at different times, citing as an ex-ample the transfer of an OS instance to a home computerwhile a user drives home from work. Their work aims tooptimize for slow (e.g., ADSL) links and longer time spans,and so stops OS execution for the duration of the transfer,with a set of enhancements to reduce the transmitted imagesize. In contrast, our efforts are concerned with the migra-tion of live, in-service OS instances on fast neworks withonly tens of milliseconds of downtime. Other projects that

have explored migration over longer time spans by stop-ping and then transferring include Internet Suspend/Re-sume [4] and µDenali [5].

Zap [6] uses partial OS virtualization to allow the migrationof process domains (pods), essentially process groups, us-ing a modified Linux kernel. Their approach is to isolate allprocess-to-kernel interfaces, such as file handles and sock-ets, into a contained namespace that can be migrated. Theirapproach is considerably faster than results in the Collec-tive work, largely due to the smaller units of migration.However, migration in their system is still on the order ofseconds at best, and does not allow live migration; podsare entirely suspended, copied, and then resumed. Further-more, they do not address the problem of maintaining openconnections for existing services.

The live migration system presented here has considerableshared heritage with the previous work on NomadBIOS [7],a virtualization and migration system built on top of theL4 microkernel [8]. NomadBIOS uses pre-copy migrationto achieve very short best-case migration downtimes, butmakes no attempt at adapting to the writable working setbehavior of the migrating OS.

VMware has recently added OS migration support, dubbedVMotion, to their VirtualCenter management software. Asthis is commercial software and strictly disallows the publi-cation of third-party benchmarks, we are only able to inferits behavior through VMware’s own publications. Theselimitations make a thorough technical comparison impos-sible. However, based on the VirtualCenter User’s Man-ual [9], we believe their approach is generally similar toours and would expect it to perform to a similar standard.

Process migration, a hot topic in systems research duringthe 1980s [10, 11, 12, 13, 14], has seen very little use forreal-world applications. Milojicic et al [2] give a thoroughsurvey of possible reasons for this, including the problemof the residual dependencies that a migrated process re-tains on the machine from which it migrated. Examples ofresidual dependencies include open file descriptors, sharedmemory segments, and other local resources. These are un-desirable because the original machine must remain avail-able, and because they usually negatively impact the per-formance of migrated processes.

For example Sprite [15] processes executing on foreignnodes require some system calls to be forwarded to thehome node for execution, leading to at best reduced perfor-mance and at worst widespread failure if the home node isunavailable. Although various efforts were made to ame-liorate performance issues, the underlying reliance on theavailability of the home node could not be avoided. A sim-ilar fragility occurs with MOSIX [14] where a deputy pro-cess on the home node must remain available to supportremote execution.

NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association274

We believe the residual dependency problem cannot easilybe solved in any process migration scheme – even modernmobile run-times such as Java and .NET suffer from prob-lems when network partition or machine crash causes classloaders to fail. The migration of entire operating systemsinherently involves fewer or zero such dependencies, mak-ing it more resilient and robust.

3 Design

At a high level we can consider a virtual machine to encap-sulate access to a set of physical resources. Providing livemigration of these VMs in a clustered server environmentleads us to focus on the physical resources used in suchenvironments: specifically on memory, network and disk.

This section summarizes the design decisions that we havemade in our approach to live VM migration. We start bydescribing how memory and then device access is movedacross a set of physical hosts and then go on to a high-leveldescription of how a migration progresses.

3.1 Migrating Memory

Moving the contents of a VM’s memory from one phys-ical host to another can be approached in any number ofways. However, when a VM is running a live service itis important that this transfer occurs in a manner that bal-ances the requirements of minimizing both downtime andtotal migration time. The former is the period during whichthe service is unavailable due to there being no currentlyexecuting instance of the VM; this period will be directlyvisible to clients of the VM as service interruption. Thelatter is the duration between when migration is initiatedand when the original VM may be finally discarded and,hence, the source host may potentially be taken down formaintenance, upgrade or repair.

It is easiest to consider the trade-offs between these require-ments by generalizing memory transfer into three phases:

Push phase The source VM continues running while cer-tain pages are pushed across the network to the newdestination. To ensure consistency, pages modifiedduring this process must be re-sent.

Stop-and-copy phase The source VM is stopped, pagesare copied across to the destination VM, then the newVM is started.

Pull phase The new VM executes and, if it accesses a pagethat has not yet been copied, this page is faulted in(“pulled”) across the network from the source VM.

Although one can imagine a scheme incorporating all threephases, most practical solutions select one or two of the

three. For example, pure stop-and-copy [3, 4, 5] involveshalting the original VM, copying all pages to the destina-tion, and then starting the new VM. This has advantages interms of simplicity but means that both downtime and totalmigration time are proportional to the amount of physicalmemory allocated to the VM. This can lead to an unaccept-able outage if the VM is running a live service.

Another option is pure demand-migration [16] in which ashort stop-and-copy phase transfers essential kernel datastructures to the destination. The destination VM is thenstarted, and other pages are transferred across the networkon first use. This results in a much shorter downtime, butproduces a much longer total migration time; and in prac-tice, performance after migration is likely to be unaccept-ably degraded until a considerable set of pages have beenfaulted across. Until this time the VM will fault on a highproportion of its memory accesses, each of which initiatesa synchronous transfer across the network.

The approach taken in this paper, pre-copy [11] migration,balances these concerns by combining a bounded itera-tive push phase with a typically very short stop-and-copyphase. By ‘iterative’ we mean that pre-copying occurs inrounds, in which the pages to be transferred during roundn are those that are modified during round n! 1 (all pagesare transferred in the first round). Every VM will havesome (hopefully small) set of pages that it updates veryfrequently and which are therefore poor candidates for pre-copy migration. Hence we bound the number of rounds ofpre-copying, based on our analysis of the writable workingset (WWS) behavior of typical server workloads, which wepresent in Section 4.

Finally, a crucial additional concern for live migration is theimpact on active services. For instance, iteratively scanningand sending a VM’s memory image between two hosts ina cluster could easily consume the entire bandwidth avail-able between them and hence starve the active services ofresources. This service degradation will occur to some ex-tent during any live migration scheme. We address this is-sue by carefully controlling the network and CPU resourcesused by the migration process, thereby ensuring that it doesnot interfere excessively with active traffic or processing.

3.2 Local Resources

A key challenge in managing the migration of OS instancesis what to do about resources that are associated with thephysical machine that they are migrating away from. Whilememory can be copied directly to the new host, connec-tions to local devices such as disks and network interfacesdemand additional consideration. The two key problemsthat we have encountered in this space concern what to dowith network resources and local storage.

NSDI ’05: 2nd Symposium on Networked Systems Design & ImplementationUSENIX Association 275

For network resources, we want a migrated OS to maintainall open network connections without relying on forward-ing mechanisms on the original host (which may be shutdown following migration), or on support from mobilityor redirection mechanisms that are not already present (asin [6]). A migrating VM will include all protocol state (e.g.TCP PCBs), and will carry its IP address with it.

To address these requirements we observed that in a clus-ter environment, the network interfaces of the source anddestination machines typically exist on a single switchedLAN. Our solution for managing migration with respect tonetwork in this environment is to generate an unsolicitedARP reply from the migrated host, advertising that the IPhas moved to a new location. This will reconfigure peersto send packets to the new physical address, and while avery small number of in-flight packets may be lost, the mi-grated domain will be able to continue using open connec-tions with almost no observable interference.

Some routers are configured not to accept broadcast ARPreplies (in order to prevent IP spoofing), so an unsolicitedARP may not work in all scenarios. If the operating systemis aware of the migration, it can opt to send directed repliesonly to interfaces listed in its own ARP cache, to removethe need for a broadcast. Alternatively, on a switched net-work, the migrating OS can keep its original Ethernet MACaddress, relying on the network switch to detect its move toa new port1.

In the cluster, the migration of storage may be similarly ad-dressed: Most modern data centers consolidate their stor-age requirements using a network-attached storage (NAS)device, in preference to using local disks in individualservers. NAS has many advantages in this environment, in-cluding simple centralised administration, widespread ven-dor support, and reliance on fewer spindles leading to areduced failure rate. A further advantage for migration isthat it obviates the need to migrate disk storage, as the NASis uniformly accessible from all host machines in the clus-ter. We do not address the problem of migrating local-diskstorage in this paper, although we suggest some possiblestrategies as part of our discussion of future work.

3.3 Design Overview

The logical steps that we execute when migrating an OS aresummarized in Figure 1. We take a conservative approachto the management of migration with regard to safety andfailure handling. Although the consequences of hardwarefailures can be severe, our basic principle is that safe mi-gration should at no time leave a virtual OS more exposed

1Note that on most Ethernet controllers, hardware MAC filtering willhave to be disabled if multiple addresses are in use (though some cardssupport filtering of multiple addresses in hardware) and so this techniqueis only practical for switched networks.

Stage 0: Pre-Migration Active VM on Host A Alternate physical host may be preselected for migration Block devices mirrored and free resources maintained

Stage 4: Commitment VM state on Host A is released

Stage 5: Activation VM starts on Host B Connects to local devices Resumes normal operation

Stage 3: Stop and copy Suspend VM on host A Generate ARP to redirect traffic to Host B Synchronize all remaining VM state to Host B

Stage 2: Iterative Pre-copy Enable shadow paging Copy dirty pages in successive rounds.

Stage 1: Reservation Initialize a container on the target host

Downtime(VM Out of Service)

VM running normally onHost A

VM running normally onHost B

Overhead due to copying

Figure 1: Migration timeline

to system failure than when it is running on the original sin-gle host. To achieve this, we view the migration process asa transactional interaction between the two hosts involved:

Stage 0: Pre-Migration We begin with an active VM onphysical host A. To speed any future migration, a tar-get host may be preselected where the resources re-quired to receive migration will be guaranteed.

Stage 1: Reservation A request is issued to migrate an OSfrom host A to host B. We initially confirm that thenecessary resources are available on B and reserve aVM container of that size. Failure to secure resourceshere means that the VM simply continues to run on Aunaffected.

Stage 2: Iterative Pre-Copy During the first iteration, allpages are transferred from A to B. Subsequent itera-tions copy only those pages dirtied during the previoustransfer phase.

Stage 3: Stop-and-Copy We suspend the running OS in-stance at A and redirect its network traffic to B. Asdescribed earlier, CPU state and any remaining incon-sistent memory pages are then transferred. At the endof this stage there is a consistent suspended copy ofthe VM at both A and B. The copy at A is still con-sidered to be primary and is resumed in case of failure.

Stage 4: Commitment Host B indicates to A that it hassuccessfully received a consistent OS image. Host Aacknowledges this message as commitment of the mi-gration transaction: host A may now discard the orig-inal VM, and host B becomes the primary host.

Stage 5: Activation The migrated VM on B is now ac-tivated. Post-migration code runs to reattach devicedrivers to the new machine and advertise moved IPaddresses.

NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association276

Elapsed time (secs)0 2000 4000 6000 8000 10000 12000

Num

bero

fpag

es

0

10000

20000

30000

40000

50000

60000

70000

80000

Tracking the Writable Working Set of SPEC CINT2000

gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf

Figure 2: WWS curve for a complete run of SPEC CINT2000 (512MB VM)

This approach to failure management ensures that at leastone host has a consistent VM image at all times duringmigration. It depends on the assumption that the originalhost remains stable until the migration commits, and thatthe VM may be suspended and resumed on that host withno risk of failure. Based on these assumptions, a migra-tion request essentially attempts to move the VM to a newhost, and on any sort of failure execution is resumed locally,aborting the migration.

4 Writable Working Sets

When migrating a live operating system, the most signif-icant influence on service performance is the overhead ofcoherently transferring the virtual machine’s memory im-age. As mentioned previously, a simple stop-and-copy ap-proach will achieve this in time proportional to the amountof memory allocated to the VM. Unfortunately, during thistime any running services are completely unavailable.

A more attractive alternative is pre-copy migration, inwhich the memory image is transferred while the operat-ing system (and hence all hosted services) continue to run.The drawback however, is the wasted overhead of trans-ferring memory pages that are subsequently modified, andhence must be transferred again. For many workloads therewill be a small set of memory pages that are updated veryfrequently, and which it is not worth attempting to maintaincoherently on the destination machine before stopping andcopying the remainder of the VM.

The fundamental question for iterative pre-copy migration

is: how does one determine when it is time to stop the pre-copy phase because too much time and resource is beingwasted? Clearly if the VM being migrated never modifiesmemory, a single pre-copy of each memory page will suf-fice to transfer a consistent image to the destination. How-ever, should the VM continuously dirty pages faster thanthe rate of copying, then all pre-copy work will be in vainand one should immediately stop and copy.

In practice, one would expect most workloads to lie some-where between these extremes: a certain (possibly large)set of pages will seldom or never be modified and hence aregood candidates for pre-copy, while the remainder will bewritten often and so should best be transferred via stop-and-copy – we dub this latter set of pages the writable workingset (WWS) of the operating system by obvious extensionof the original working set concept [17].

In this section we analyze the WWS of operating systemsrunning a range of different workloads in an attempt to ob-tain some insight to allow us build heuristics for an efficientand controllable pre-copy implementation.

4.1 Measuring Writable Working Sets

To trace the writable working set behaviour of a number ofrepresentative workloads we used Xen’s shadow page ta-bles (see Section 5) to track dirtying statistics on all pagesused by a particular executing operating system. This al-lows us to determine within any time period the set of pageswritten to by the virtual machine.

Using the above, we conducted a set of experiments to sam-

NSDI ’05: 2nd Symposium on Networked Systems Design & ImplementationUSENIX Association 277

Effect of Bandwidth and Pre!Copy Iterations on Migration Downtime(Based on a page trace of Linux Kernel Compile)

Migration throughput: 128 Mbit/sec

Elapsed time (sec)0 100 200 300 400 500 600

Rat

eof

page

dirty

ing

(pag

es/s

ec)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Exp

ecte

ddo

wnt

ime

(sec

)

0

0.5

1

1.5

2

2.5

3

3.5

4

Migration throughput: 256 Mbit/sec

Elapsed time (sec)0 100 200 300 400 500 600

Rat

eof

page

dirty

ing

(pag

es/s

ec)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Exp

ecte

ddo

wnt

ime

(sec

)

0

0.5

1

1.5

2

2.5

3

3.5

4

Migration throughput: 512 Mbit/sec

Elapsed time (sec)0 100 200 300 400 500 600

Rat

eof

page

dirty

ing

(pag

es/s

ec)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Exp

ecte

ddo

wnt

ime

(sec

)

0

0.5

1

1.5

2

2.5

3

3.5

4

Figure 3: Expected downtime due to last-round memorycopy on traced page dirtying of a Linux kernel compile.

Effect of Bandwidth and Pre!Copy Iterations on Migration Downtime(Based on a page trace of OLTP Database Benchmark)

Migration throughput: 128 Mbit/sec

Elapsed time (sec)0 200 400 600 800 1000 1200

Rat

eof

page

dirty

ing

(pag

es/s

ec)

0

1000

2000

3000

4000

5000

6000

7000

8000

Exp

ecte

ddo

wnt

ime

(sec

)

0

0.5

1

1.5

2

2.5

3

3.5

4

Migration throughput: 256 Mbit/sec

Elapsed time (sec)0 200 400 600 800 1000 1200

Rat

eof

page

dirty

ing

(pag

es/s

ec)

0

1000

2000

3000

4000

5000

6000

7000

8000

Exp

ecte

ddo

wnt

ime

(sec

)

0

0.5

1

1.5

2

2.5

3

3.5

4

Migration throughput: 512 Mbit/sec

Elapsed time (sec)0 200 400 600 800 1000 1200

Rat

eof

page

dirty

ing

(pag

es/s

ec)

0

1000

2000

3000

4000

5000

6000

7000

8000

Exp

ecte

ddo

wnt

ime

(sec

)0

0.5

1

1.5

2

2.5

3

3.5

4

Figure 4: Expected downtime due to last-round memorycopy on traced page dirtying of OLTP.

Effect of Bandwidth and Pre!Copy Iterations on Migration Downtime(Based on a page trace of Quake 3 Server)

Migration throughput: 128 Mbit/sec

Elapsed time (sec)0 100 200 300 400 500

Rat

eof

page

dirty

ing

(pag

es/s

ec)

0

100

200

300

400

500

600

Exp

ecte

ddo

wnt

ime

(sec

)

0

0.1

0.2

0.3

0.4

0.5

Migration throughput: 256 Mbit/sec

Elapsed time (sec)0 100 200 300 400 500

Rat

eof

page

dirty

ing

(pag

es/s

ec)

0

100

200

300

400

500

600

Exp

ecte

ddo

wnt

ime

(sec

)

0

0.1

0.2

0.3

0.4

0.5

Migration throughput: 512 Mbit/sec

Elapsed time (sec)0 100 200 300 400 500

Rat

eof

page

dirty

ing

(pag

es/s

ec)

0

100

200

300

400

500

600

Exp

ecte

ddo

wnt

ime

(sec

)

0

0.1

0.2

0.3

0.4

0.5

Figure 5: Expected downtime due to last-round memorycopy on traced page dirtying of a Quake 3 server.

Effect of Bandwidth and Pre!Copy Iterations on Migration Downtime(Based on a page trace of SPECweb)

Migration throughput: 128 Mbit/sec

Elapsed time (sec)0 100 200 300 400 500 600 700

Rat

eof

page

dirty

ing

(pag

es/s

ec)

0

2000

4000

6000

8000

10000

12000

14000

Exp

ecte

ddo

wnt

ime

(sec

)

0

1

2

3

4

5

6

7

8

9

Migration throughput: 256 Mbit/sec

Elapsed time (sec)0 100 200 300 400 500 600 700

Rat

eof

page

dirty

ing

(pag

es/s

ec)

0

2000

4000

6000

8000

10000

12000

14000

Exp

ecte

ddo

wnt

ime

(sec

)

0

1

2

3

4

5

6

7

8

9

Migration throughput: 512 Mbit/sec

Elapsed time (sec)0 100 200 300 400 500 600 700

Rat

eof

page

dirty

ing

(pag

es/s

ec)

0

2000

4000

6000

8000

10000

12000

14000

Exp

ecte

ddo

wnt

ime

(sec

)

0

1

2

3

4

5

6

7

8

9

Figure 6: Expected downtime due to last-round memorycopy on traced page dirtying of SPECweb.

NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association278

ple the writable working set size for a variety of bench-marks. Xen was running on a dual processor Intel Xeon2.4GHz machine, and the virtual machine being measuredhad a memory allocation of 512MB. In each case we startedthe relevant benchmark in one virtual machine and readthe dirty bitmap every 50ms from another virtual machine,cleaning it every 8 seconds – in essence this allows us tocompute the WWS with a (relatively long) 8 second win-dow, but estimate it at a finer (50ms) granularity.

The benchmarks we ran were SPEC CINT2000, a Linuxkernel compile, the OSDB OLTP benchmark using Post-greSQL and SPECweb99 using Apache. We also measureda Quake 3 server as we are particularly interested in highlyinteractive workloads.

Figure 2 illustrates the writable working set curve producedfor the SPEC CINT2000 benchmark run. This benchmarkinvolves running a series of smaller programs in order andmeasuring the overall execution time. The x-axis measureselapsed time, and the y-axis shows the number of 4KBpages of memory dirtied within the corresponding 8 sec-ond interval; the graph is annotated with the names of thesub-benchmark programs.

From this data we observe that the writable working setvaries significantly between the different sub-benchmarks.For programs such as ‘eon’ the WWS is a small fraction ofthe total working set and hence is an excellent candidate formigration. In contrast, ‘gap’ has a consistently high dirty-ing rate and would be problematic to migrate. The otherbenchmarks go through various phases but are generallyamenable to live migration. Thus performing a migrationof an operating system will give different results dependingon the workload and the precise moment at which migra-tion begins.

4.2 Estimating Migration Effectiveness

We observed that we could use the trace data acquired toestimate the effectiveness of iterative pre-copy migrationfor various workloads. In particular we can simulate a par-ticular network bandwidth for page transfer, determine howmany pages would be dirtied during a particular iteration,and then repeat for successive iterations. Since we knowthe approximate WWS behaviour at every point in time, wecan estimate the overall amount of data transferred in the fi-nal stop-and-copy round and hence estimate the downtime.

Figures 3–6 show our results for the four remaining work-loads. Each figure comprises three graphs, each of whichcorresponds to a particular network bandwidth limit forpage transfer; each individual graph shows the WWS his-togram (in light gray) overlaid with four line plots estimat-ing service downtime for up to four pre-copying rounds.

Looking at the topmost line (one pre-copy iteration),

the first thing to observe is that pre-copy migration al-ways performs considerably better than naive stop-and-copy. For a 512MB virtual machine this latter approachwould require 32, 16, and 8 seconds downtime for the128Mbit/sec, 256Mbit/sec and 512Mbit/sec bandwidths re-spectively. Even in the worst case (the starting phase ofSPECweb), a single pre-copy iteration reduces downtimeby a factor of four. In most cases we can expect to doconsiderably better – for example both the Linux kernelcompile and the OLTP benchmark typically experience areduction in downtime of at least a factor of sixteen.

The remaining three lines show, in order, the effect of per-forming a total of two, three or four pre-copy iterationsprior to the final stop-and-copy round. In most cases wesee an increased reduction in downtime from performingthese additional iterations, although with somewhat dimin-ishing returns, particularly in the higher bandwidth cases.

This is because all the observed workloads exhibit a smallbut extremely frequently updated set of ‘hot’ pages. Inpractice these pages will include the stack and local vari-ables being accessed within the currently executing pro-cesses as well as pages being used for network and disktraffic. The hottest pages will be dirtied at least as fast aswe can transfer them, and hence must be transferred in thefinal stop-and-copy phase. This puts a lower bound on thebest possible service downtime for a particular benchmark,network bandwidth and migration start time.

This interesting tradeoff suggests that it may be worthwhileincreasing the amount of bandwidth used for page transferin later (and shorter) pre-copy iterations. We will describeour rate-adaptive algorithm based on this observation inSection 5, and demonstrate its effectiveness in Section 6.

5 Implementation Issues

We designed and implemented our pre-copying migrationengine to integrate with the Xen virtual machine moni-tor [1]. Xen securely divides the resources of the host ma-chine amongst a set of resource-isolated virtual machineseach running a dedicated OS instance. In addition, there isone special management virtual machine used for the ad-ministration and control of the machine.

We considered two different methods for initiating andmanaging state transfer. These illustrate two extreme pointsin the design space: managed migration is performedlargely outside the migratee, by a migration daemon run-ning in the management VM; in contrast, self migration isimplemented almost entirely within the migratee OS withonly a small stub required on the destination machine.

In the following sections we describe some of the imple-mentation details of these two approaches. We describehow we use dynamic network rate-limiting to effectively

NSDI ’05: 2nd Symposium on Networked Systems Design & ImplementationUSENIX Association 279

balance network contention against OS downtime. We thenproceed to describe how we ameliorate the effects of rapidpage dirtying, and describe some performance enhance-ments that become possible when the OS is aware of itsmigration — either through the use of self migration, or byadding explicit paravirtualization interfaces to the VMM.

5.1 Managed Migration

Managed migration is performed by migration daemonsrunning in the management VMs of the source and destina-tion hosts. These are responsible for creating a new VM onthe destination machine, and coordinating transfer of livesystem state over the network.

When transferring the memory image of the still-runningOS, the control software performs rounds of copying inwhich it performs a complete scan of the VM’s memorypages. Although in the first round all pages are transferredto the destination machine, in subsequent rounds this copy-ing is restricted to pages that were dirtied during the pre-vious round, as indicated by a dirty bitmap that is copiedfrom Xen at the start of each round.

During normal operation the page tables managed by eachguest OS are the ones that are walked by the processor’sMMU to fill the TLB. This is possible because guest OSesare exposed to real physical addresses and so the page ta-bles they create do not need to be mapped to physical ad-dresses by Xen.

To log pages that are dirtied, Xen inserts shadow page ta-bles underneath the running OS. The shadow tables arepopulated on demand by translating sections of the guestpage tables. Translation is very simple for dirty logging:all page-table entries (PTEs) are initially read-only map-pings in the shadow tables, regardless of what is permittedby the guest tables. If the guest tries to modify a page ofmemory, the resulting page fault is trapped by Xen. If writeaccess is permitted by the relevant guest PTE then this per-mission is extended to the shadow PTE. At the same time,we set the appropriate bit in the VM’s dirty bitmap.

When the bitmap is copied to the control software at thestart of each pre-copying round, Xen’s bitmap is clearedand the shadow page tables are destroyed and recreated asthe migratee OS continues to run. This causes all write per-missions to be lost: all pages that are subsequently updatedare then added to the now-clear dirty bitmap.

When it is determined that the pre-copy phase is no longerbeneficial, using heuristics derived from the analysis inSection 4, the OS is sent a control message requesting thatit suspend itself in a state suitable for migration. Thiscauses the OS to prepare for resumption on the destina-tion machine; Xen informs the control software once theOS has done this. The dirty bitmap is scanned one last

time for remaining inconsistent memory pages, and theseare transferred to the destination together with the VM’scheckpointed CPU-register state.

Once this final information is received at the destination,the VM state on the source machine can safely be dis-carded. Control software on the destination machine scansthe memory map and rewrites the guest’s page tables to re-flect the addresses of the memory pages that it has beenallocated. Execution is then resumed by starting the newVM at the point that the old VM checkpointed itself. TheOS then restarts its virtual device drivers and updates itsnotion of wallclock time.

Since the transfer of pages is OS agnostic, we can easilysupport any guest operating system – all that is required isa small paravirtualized stub to handle resumption. Our im-plementation currently supports Linux 2.4, Linux 2.6 andNetBSD 2.0.

5.2 Self Migration

In contrast to the managed method described above, selfmigration [18] places the majority of the implementationwithin the OS being migrated. In this design no modifi-cations are required either to Xen or to the managementsoftware running on the source machine, although a migra-tion stub must run on the destination machine to listen forincoming migration requests, create an appropriate emptyVM, and receive the migrated system state.

The pre-copying scheme that we implemented for self mi-gration is conceptually very similar to that for managed mi-gration. At the start of each pre-copying round every pagemapping in every virtual address space is write-protected.The OS maintains a dirty bitmap tracking dirtied physicalpages, setting the appropriate bits as write faults occur. Todiscriminate migration faults from other possible causes(for example, copy-on-write faults, or access-permissionfaults) we reserve a spare bit in each PTE to indicate that itis write-protected only for dirty-logging purposes.

The major implementation difficulty of this scheme is totransfer a consistent OS checkpoint. In contrast with amanaged migration, where we simply suspend the migra-tee to obtain a consistent checkpoint, self migration is farharder because the OS must continue to run in order totransfer its final state. We solve this difficulty by logicallycheckpointing the OS on entry to a final two-stage stop-and-copy phase. The first stage disables all OS activity ex-cept for migration and then peforms a final scan of the dirtybitmap, clearing the appropriate bit as each page is trans-ferred. Any pages that are dirtied during the final scan, andthat are still marked as dirty in the bitmap, are copied to ashadow buffer. The second and final stage then transfers thecontents of the shadow buffer — page updates are ignoredduring this transfer.

NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association280

5.3 Dynamic Rate-Limiting

It is not always appropriate to select a single networkbandwidth limit for migration traffic. Although a lowlimit avoids impacting the performance of running services,analysis in Section 4 showed that we must eventually payin the form of an extended downtime because the hottestpages in the writable working set are not amenable to pre-copy migration. The downtime can be reduced by increas-ing the bandwidth limit, albeit at the cost of additional net-work contention.

Our solution to this impasse is to dynamically adapt thebandwidth limit during each pre-copying round. The ad-ministrator selects a minimum and a maximum bandwidthlimit. The first pre-copy round transfers pages at the mini-mum bandwidth. Each subsequent round counts the num-ber of pages dirtied in the previous round, and divides thisby the duration of the previous round to calculate the dirty-ing rate. The bandwidth limit for the next round is thendetermined by adding a constant increment to the previ-ous round’s dirtying rate — we have empirically deter-mined that 50Mbit/sec is a suitable value. We terminatepre-copying when the calculated rate is greater than the ad-ministrator’s chosen maximum, or when less than 256KBremains to be transferred. During the final stop-and-copyphase we minimize service downtime by transferring mem-ory at the maximum allowable rate.

As we will show in Section 6, using this adaptive schemeresults in the bandwidth usage remaining low during thetransfer of the majority of the pages, increasing only atthe end of the migration to transfer the hottest pages in theWWS. This effectively balances short downtime with lowaverage network contention and CPU usage.

5.4 Rapid Page Dirtying

Our working-set analysis in Section 4 shows that every OSworkload has some set of pages that are updated extremelyfrequently, and which are therefore not good candidatesfor pre-copy migration even when using all available net-work bandwidth. We observed that rapidly-modified pagesare very likely to be dirtied again by the time we attemptto transfer them in any particular pre-copying round. Wetherefore periodically ‘peek’ at the current round’s dirtybitmap and transfer only those pages dirtied in the previ-ous round that have not been dirtied again at the time wescan them.

We further observed that page dirtying is often physicallyclustered — if a page is dirtied then it is disproportionallylikely that a close neighbour will be dirtied soon after. Thisincreases the likelihood that, if our peeking does not detectone page in a cluster, it will detect none. To avoid this

0

2000

4000

6000

8000

10000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

4kB

page

s

Iterations

Transferred pages

Figure 7: Rogue-process detection during migration of aLinux kernel build. After the twelfth iteration a maximumlimit of forty write faults is imposed on every process, dras-tically reducing the total writable working set.

unfortunate behaviour we scan the VM’s physical memoryspace in a pseudo-random order.

5.5 Paravirtualized Optimizations

One key benefit of paravirtualization is that operating sys-tems can be made aware of certain important differencesbetween the real and virtual environments. In terms of mi-gration, this allows a number of optimizations by informingthe operating system that it is about to be migrated – at thisstage a migration stub handler within the OS could helpimprove performance in at least the following ways:

Stunning Rogue Processes. Pre-copy migration worksbest when memory pages can be copied to the destinationhost faster than they are dirtied by the migrating virtual ma-chine. This may not always be the case – for example, a testprogram which writes one word in every page was able todirty memory at a rate of 320 Gbit/sec, well ahead of thetransfer rate of any Ethernet interface. This is a syntheticexample, but there may well be cases in practice in whichpre-copy migration is unable to keep up, or where migra-tion is prolonged unnecessarily by one or more ‘rogue’ ap-plications.

In both the managed and self migration cases, we can miti-gate against this risk by forking a monitoring thread withinthe OS kernel when migration begins. As it runs within theOS, this thread can monitor the WWS of individual pro-cesses and take action if required. We have implementeda simple version of this which simply limits each processto 40 write faults before being moved to a wait queue – inessence we ‘stun’ processes that make migration difficult.This technique works well, as shown in Figure 7, although

NSDI ’05: 2nd Symposium on Networked Systems Design & ImplementationUSENIX Association 281

one must be careful not to stun important interactive ser-vices.

Freeing Page Cache Pages. A typical operating systemwill have a number of ‘free’ pages at any time, rangingfrom truly free (page allocator) to cold buffer cache pages.When informed a migration is to begin, the OS can sim-ply return some or all of these pages to Xen in the sameway it would when using the ballooning mechanism de-scribed in [1]. This means that the time taken for the first“full pass” iteration of pre-copy migration can be reduced,sometimes drastically. However should the contents ofthese pages be needed again, they will need to be faultedback in from disk, incurring greater overall cost.

6 Evaluation

In this section we present a thorough evaluation of our im-plementation on a wide variety of workloads. We begin bydescribing our test setup, and then go on explore the mi-gration of several workloads in detail. Note that none ofthe experiments in this section use the paravirtualized opti-mizations discussed above since we wished to measure thebaseline performance of our system.

6.1 Test Setup

We perform test migrations between an identical pair ofDell PE-2650 server-class machines, each with dual Xeon2GHz CPUs and 2GB memory. The machines haveBroadcom TG3 network interfaces and are connected viaswitched Gigabit Ethernet. In these experiments only a sin-gle CPU was used, with HyperThreading enabled. Storageis accessed via the iSCSI protocol from an NetApp F840network attached storage server except where noted other-wise. We used XenLinux 2.4.27 as the operating system inall cases.

6.2 Simple Web Server

We begin our evaluation by examining the migration of anApache 1.3 web server serving static content at a high rate.Figure 8 illustrates the throughput achieved when continu-ously serving a single 512KB file to a set of one hundredconcurrent clients. The web server virtual machine has amemory allocation of 800MB.

At the start of the trace, the server achieves a consistentthroughput of approximately 870Mbit/sec. Migration startstwenty seven seconds into the trace but is initially rate-limited to 100Mbit/sec (12% CPU), resulting in the serverthroughput dropping to 765Mbit/s. This initial low-rate

pass transfers 776MB and lasts for 62 seconds, at whichpoint the migration algorithm described in Section 5 in-creases its rate over several iterations and finally suspendsthe VM after a further 9.8 seconds. The final stop-and-copyphase then transfer the remaining pages and the web serverresumes at full rate after a 165ms outage.

This simple example demonstrates that a highly loadedserver can be migrated with both controlled impact on liveservices and a short downtime. However, the working setof the server in this case is rather small, and so this shouldbe expected to be a relatively easy case for live migration.

6.3 Complex Web Workload: SPECweb99

A more challenging Apache workload is presented bySPECweb99, a complex application-level benchmark forevaluating web servers and the systems that host them. Theworkload is a complex mix of page requests: 30% requiredynamic content generation, 16% are HTTP POST opera-tions, and 0.5% execute a CGI script. As the server runs, itgenerates access and POST logs, contributing to disk (andtherefore network) throughput.

A number of client machines are used to generate the loadfor the server under test, with each machine simulatinga collection of users concurrently accessing the web site.SPECweb99 defines a minimum quality of service thateach user must receive for it to count as ‘conformant’; anaggregate bandwidth in excess of 320Kbit/sec over a seriesof requests. The SPECweb score received is the numberof conformant users that the server successfully maintains.The considerably more demanding workload of SPECwebrepresents a challenging candidate for migration.

We benchmarked a single VM running SPECweb andrecorded a maximum score of 385 conformant clients —we used the RedHat gnbd network block device in place ofiSCSI as the lighter-weight protocol achieves higher per-formance. Since at this point the server is effectively inoverload, we then relaxed the offered load to 90% of max-imum (350 conformant connections) to represent a morerealistic scenario.

Using a virtual machine configured with 800MB of mem-ory, we migrated a SPECweb99 run in the middle of itsexecution. Figure 9 shows a detailed analysis of this mi-gration. The x-axis shows time elapsed since start of migra-tion, while the y-axis shows the network bandwidth beingused to transfer pages to the destination. Darker boxes il-lustrate the page transfer process while lighter boxes showthe pages dirtied during each iteration. Our algorithm ad-justs the transfer rate relative to the page dirty rate observedduring the previous round (denoted by the height of thelighter boxes).

As in the case of the static web server, migration begins

NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association282

Elapsed time (secs)0 10 20 30 40 50 60 70 80 90 100 110 120 130

Thro

ughp

ut(M

bit/s

ec)

0

200

400

600

800

1000Effect of Migration on Web Server Transmission Rate

Sample over 100msSample over 500ms

512Kb files100 concurrent clients

1st precopy, 62 secs further iterations9.8 secs

765 Mbit/sec870 Mbit/sec

694 Mbit/sec

165ms total downtime

Figure 8: Results of migrating a running web server VM.

In the final iteration, the domain is suspended. The remaining 18.2 MB of dirty pages are sent and the VM resumes execution on the remote machine. In addition to the 201ms required to copy the last round of data, an additional 9ms elapse while the VM starts up. The total downtime for this experiment is 210ms.

0 50 55 60 65 70

0

100

200

300

400

500

600

676.8 MB

VM memory transfered

Memory dirtied during this iteration

126.7 MB 39.0 MB

28.4 MB

24.2 MB

16.7 MB

14.2 MB

15.3 MB

18.2 MB

The first iteration involves a long, relatively low-rate transfer of the VM’s memory. In this example, 676.8 MB are transfered in 54.1 seconds. These early phases allow non-writable working set data to be transfered with a low impact on active services.

Iterative Progress of Live Migration: SPECweb99350 Clients (90% of max load), 800MB VMTotal Data Transmitted: 960MB (x1.20)

Area of Bars:

Tra

nsf

erR

ate

(Mb

it/s

ec)

Elapsed Time (sec)

Figure 9: Results of migrating a running SPECweb VM.

with a long period of low-rate transmission as a first passis made through the memory of the virtual machine. Thisfirst round takes 54.1 seconds and transmits 676.8MB ofmemory. Two more low-rate rounds follow, transmitting126.7MB and 39.0MB respectively before the transmissionrate is increased.

The remainder of the graph illustrates how the adaptive al-gorithm tracks the page dirty rate over successively shorteriterations before finally suspending the VM. When suspen-sion takes place, 18.2MB of memory remains to be sent.This transmission takes 201ms, after which an additional9ms is required for the domain to resume normal execu-tion.

The total downtime of 210ms experienced by theSPECweb clients is sufficiently brief to maintain the 350

conformant clients. This result is an excellent validation ofour approach: a heavily (90% of maximum) loaded serveris migrated to a separate physical host with a total migra-tion time of seventy-one seconds. Furthermore the migra-tion does not interfere with the quality of service demandedby SPECweb’s workload. This illustrates the applicabilityof migration as a tool for administrators of demanding liveservices.

6.4 Low-Latency Server: Quake 3

Another representative application for hosting environ-ments is a multiplayer on-line game server. To determinethe effectiveness of our approach in this case we config-ured a virtual machine with 64MB of memory running a

NSDI ’05: 2nd Symposium on Networked Systems Design & ImplementationUSENIX Association 283

Elapsed time (secs)0 10 20 30 40 50 60 7

Pac

ketf

light

time

(sec

s)

0

0.02

0.04

0.06

0.08

0.1

0.12

Packet interarrival time during Quake 3 migration

Mig

ratio

n1

dow

ntim

e:50

ms

Mig

ratio

n2

dow

ntim

e:48

ms

Figure 10: Effect on packet response time of migrating a running Quake 3 server VM.

0 4.5 5 5.5 6 6.5 7

0

50

100

150

200

250

300

350

400

450

56.3 MB 20.4 MB 4.6 MB

1.6 MB

1.2 MB

0.9 MB

1.2 MB

1.1 MB

0.8 MB

0.2 MB

0.1 MBIterative Progress of Live Migration: Quake 3 Server6 Clients, 64MB VMTotal Data Transmitted: 88MB (x1.37)

VM memory transfered

Memory dirtied during this iteration

Area of Bars:

Tra

nsf

er

Ra

te(M

bit

/se

c)

Elapsed Time (sec)

The final iteration in this case leaves only 148KB of data to transmit. In addition to the 20ms required to copy this last round, an additional 40ms are spent on start-up overhead. The total downtime experienced is 60ms.

Figure 11: Results of migrating a running Quake 3 server VM.

Quake 3 server. Six players joined the game and started toplay within a shared arena, at which point we initiated amigration to another machine. A detailed analysis of thismigration is shown in Figure 11.

The trace illustrates a generally similar progression as forSPECweb, although in this case the amount of data to betransferred is significantly smaller. Once again the trans-fer rate increases as the trace progresses, although the finalstop-and-copy phase transfers so little data (148KB) thatthe full bandwidth is not utilized.

Overall, we are able to perform the live migration with a to-tal downtime of 60ms. To determine the effect of migrationon the live players, we performed an additional experimentin which we migrated the running Quake 3 server twiceand measured the inter-arrival time of packets received byclients. The results are shown in Figure 10. As can be seen,from the client point of view migration manifests itself as

a transient increase in response time of 50ms. In neithercase was this perceptible to the players.

6.5 A Diabolical Workload: MMuncher

As a final point in our evaluation, we consider the situationin which a virtual machine is writing to memory faster thancan be transferred across the network. We test this diaboli-cal case by running a 512MB host with a simple C programthat writes constantly to a 256MB region of memory. Theresults of this migration are shown in Figure 12.

In the first iteration of this workload, we see that half ofthe memory has been transmitted, while the other half isimmediately marked dirty by our test program. Our algo-rithm attempts to adapt to this by scaling itself relative tothe perceived initial rate of dirtying; this scaling proves in-

NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association284

0 5 10 15 20 25

0

200

400

600

800

1000Iterative Progress of Live Migration: Diabolical Workload512MB VM, Constant writes to 256MB region.Total Data Transmitted: 638MB (x1.25)

Elapsed Time (sec)

Tra

nsf

er

Ra

te(M

bit

/se

c)

255.4 MB

44.0 MB

116.0 MB 222.5 MB

In the first iteration, the workload dirties half of memory. The other half is transmitted, both bars are equal.

VM memory transfered

Memory dirtied during this iteration

Area of Bars:

Figure 12: Results of migrating a VM running a diabolicalworkload.

sufficient, as the rate at which the memory is being writtenbecomes apparent. In the third round, the transfer rate isscaled up to 500Mbit/s in a final attempt to outpace thememory writer. As this last attempt is still unsuccessful,the virtual machine is suspended, and the remaining dirtypages are copied, resulting in a downtime of 3.5 seconds.Fortunately such dirtying rates appear to be rare in realworkloads.

7 Future Work

Although our solution is well-suited for the environmentwe have targeted – a well-connected data-center or clusterwith network-accessed storage – there are a number of ar-eas in which we hope to carry out future work. This wouldallow us to extend live migration to wide-area networks,and to environments that cannot rely solely on network-attached storage.

7.1 Cluster Management

In a cluster environment where a pool of virtual machinesare hosted on a smaller set of physical servers, there aregreat opportunities for dynamic load balancing of proces-sor, memory and networking resources. A key challengeis to develop cluster control software which can make in-formed decision as to the placement and movement of vir-tual machines.

A special case of this is ‘evacuating’ VMs from a node thatis to be taken down for scheduled maintenance. A sensibleapproach to achieving this is to migrate the VMs in increas-ing order of their observed WWS. Since each VM migratedfrees resources on the node, additional CPU and networkbecomes available for those VMs which need it most. Weare in the process of building a cluster controller for Xensystems.

7.2 Wide Area Network Redirection

Our layer 2 redirection scheme works efficiently and withremarkably low outage on modern gigabit networks. How-ever, when migrating outside the local subnet this mech-anism will not suffice. Instead, either the OS will have toobtain a new IP address which is within the destination sub-net, or some kind of indirection layer, on top of IP, must ex-ist. Since this problem is already familiar to laptop users,a number of different solutions have been suggested. Oneof the more prominent approaches is that of Mobile IP [19]where a node on the home network (the home agent) for-wards packets destined for the client (mobile node) to acare-of address on the foreign network. As with all residualdependencies this can lead to both performance problemsand additional failure modes.

Snoeren and Balakrishnan [20] suggest addressing theproblem of connection migration at the TCP level, aug-menting TCP with a secure token negotiated at connectiontime, to which a relocated host can refer in a special SYNpacket requesting reconnection from a new IP address. Dy-namic DNS updates are suggested as a means of locatinghosts after a move.

7.3 Migrating Block Devices

Although NAS prevails in the modern data center, someenvironments may still make extensive use of local disks.These present a significant problem for migration as theyare usually considerably larger than volatile memory. If theentire contents of a disk must be transferred to a new hostbefore migration can complete, then total migration timesmay be intolerably extended.

This latency can be avoided at migration time by arrang-ing to mirror the disk contents at one or more remote hosts.For example, we are investigating using the built-in soft-ware RAID and iSCSI functionality of Linux to implementdisk mirroring before and during OS migration. We imag-ine a similar use of software RAID-5, in cases where dataon disks requires a higher level of availability. Multiplehosts can act as storage targets for one another, increasingavailability at the cost of some network traffic.

The effective management of local storage for clusters ofvirtual machines is an interesting problem that we hope tofurther explore in future work. As virtual machines willtypically work from a small set of common system images(for instance a generic Fedora Linux installation) and makeindividual changes above this, there seems to be opportu-nity to manage copy-on-write system images across a clus-ter in a way that facilitates migration, allows replication,and makes efficient use of local disks.

NSDI ’05: 2nd Symposium on Networked Systems Design & ImplementationUSENIX Association 285

8 Conclusion

By integrating live OS migration into the Xen virtual ma-chine monitor we enable rapid movement of interactiveworkloads within clusters and data centers. Our dynamicnetwork-bandwidth adaptation allows migration to proceedwith minimal impact on running services, while reducingtotal downtime to below discernable thresholds.

Our comprehensive evaluation shows that realistic serverworkloads such as SPECweb99 can be migrated with just210ms downtime, while a Quake3 game server is migratedwith an imperceptible 60ms outage.

References

[1] Paul Barham, Boris Dragovic, Keir Fraser, StevenHand, Tim Harris, Alex Ho, Rolf Neugebauer, IanPratt, and Andrew Warfield. Xen and the art of virtu-alization. In Proceedings of the nineteenth ACM sym-posium on Operating Systems Principles (SOSP19),pages 164–177. ACM Press, 2003.

[2] D. Milojicic, F. Douglis, Y. Paindaveine, R. Wheeler,and S. Zhou. Process migration. ACM ComputingSurveys, 32(3):241–299, 2000.

[3] C. P. Sapuntzakis, R. Chandra, B. Pfaff, J. Chow,M. S. Lam, and M.Rosenblum. Optimizing the mi-gration of virtual computers. In Proc. of the 5th Sym-posium on Operating Systems Design and Implemen-tation (OSDI-02), December 2002.

[4] M. Kozuch and M. Satyanarayanan. Internet sus-pend/resume. In Proceedings of the IEEE Work-shop on Mobile Computing Systems and Applications,2002.

[5] Andrew Whitaker, Richard S. Cox, Marianne Shaw,and Steven D. Gribble. Constructing services withinterposable virtual hardware. In Proceedings of theFirst Symposium on Networked Systems Design andImplementation (NSDI ’04), 2004.

[6] S. Osman, D. Subhraveti, G. Su, and J. Nieh. The de-sign and implementation of zap: A system for migrat-ing computing environments. In Proc. 5th USENIXSymposium on Operating Systems Design and Im-plementation (OSDI-02), pages 361–376, December2002.

[7] Jacob G. Hansen and Asger K. Henriksen. Nomadicoperating systems. Master’s thesis, Dept. of Com-puter Science, University of Copenhagen, Denmark,2002.

[8] Hermann Härtig, Michael Hohmuth, Jochen Liedtke,and Sebastian Schönberg. The performance of micro-

kernel-based systems. In Proceedings of the sixteenthACM Symposium on Operating System Principles,pages 66–77. ACM Press, 1997.

[9] VMWare, Inc. VMWare VirtualCenter Version 1.2User’s Manual. 2004.

[10] Michael L. Powell and Barton P. Miller. Process mi-gration in DEMOS/MP. In Proceedings of the ninthACM Symposium on Operating System Principles,pages 110–119. ACM Press, 1983.

[11] Marvin M. Theimer, Keith A. Lantz, and David R.Cheriton. Preemptable remote execution facilities forthe V-system. In Proceedings of the tenth ACM Sym-posium on Operating System Principles, pages 2–12.ACM Press, 1985.

[12] Eric Jul, Henry Levy, Norman Hutchinson, and An-drew Black. Fine-grained mobility in the emerald sys-tem. ACM Trans. Comput. Syst., 6(1):109–133, 1988.

[13] Fred Douglis and John K. Ousterhout. Transparentprocess migration: Design alternatives and the Spriteimplementation. Software – Practice and Experience,21(8):757–785, 1991.

[14] A. Barak and O. La’adan. The MOSIX multicom-puter operating system for high performance clustercomputing. Journal of Future Generation ComputerSystems, 13(4-5):361–372, March 1998.

[15] J. K. Ousterhout, A. R. Cherenson, F. Douglis, M. N.Nelson, and B. B. Welch. The Sprite network oper-ating system. Computer Magazine of the ComputerGroup News of the IEEE Computer Group Society, ;ACM CR 8905-0314, 21(2), 1988.

[16] E. Zayas. Attacking the process migration bottle-neck. In Proceedings of the eleventh ACM Symposiumon Operating systems principles, pages 13–24. ACMPress, 1987.

[17] Peter J. Denning. Working Sets Past and Present.IEEE Transactions on Software Engineering, SE-6(1):64–84, January 1980.

[18] Jacob G. Hansen and Eric Jul. Self-migration of op-erating systems. In Proceedings of the 11th ACMSIGOPS European Workshop (EW 2004), pages 126–130, 2004.

[19] C. E. Perkins and A. Myles. Mobile IP. Pro-ceedings of International Telecommunications Sym-posium, pages 415–419, 1997.

[20] Alex C. Snoeren and Hari Balakrishnan. An end-to-end approach to host mobility. In Proceedings of the6th annual international conference on Mobile com-puting and networking, pages 155–166. ACM Press,2000.

NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association286