当前位置:首页 >> >> Scheduling Strategies for Parallel Processing System Utilization Benchmark on the Cray T3E

Scheduling Strategies for Parallel Processing System Utilization Benchmark on the Cray T3E

February 22, 2000. For submission to The 6th Workshop on Job Scheduling Strategies for Parallel Processing

System Utilization Benchmark on the Cray T3E and IBM SP
Adrian Wong, Leonid Oliker, William Kramer, Teresa Kaltz and David Bailey National Energy Research Scienti?c Computing Center Lawrence Berkeley National Laboratory One Cyclotron Rd Berkeley, CA 94720, USA

Obtaining maximum utilization of parallel systems continues to be an active area of research and development. This article outlines a new benchmark, called the Effective System Performance (ESP) test, designed to provide a utilization metric that is transferable between systems and illuminate the effects of various scheduling parameters. Results with discussion are presented for the Cray T3E and IBM SP systems together with insights obtained from simulation.

1

Introduction

The overall value of a high performance computing system depends not only on the raw computational speed but also on system management. System characteristics such as job scheduling ef?ciency, reboot and recovery times and the level of process management are important factors in the overall usability and effectiveness of the system. Common performance metrics such as the LINPACK and NAS Parallel Benchmarks [1, 2] are useful for measuring sustained computational performance, but give little or no insight into systemlevel ef?ciency issues. In this article, we describe a new benchmark, the Effective System Performance (ESP) benchmark, which measures system utilization [3]. Our primary motivation in developing this benchmark is to aid the evaluation of high performance systems. Additionally, it will be used to monitor the impact of con?guration changes and software upgrades in existing systems. We also hope that this benchmark will provide a focal point for future research and development activities in the high performance computing community, including both industry and academia. The ESP test extends the idea of a throughput benchmark with additional constraints that encapsulate day-to-day operation. It yields an ef?ciency measurement based on the ratio of the actual elapsed time relative to the theoretical minimum time assuming perfect ef?ciency. This ratio is independent of the computational rate and is also relatively independent of the number of processors used, thus permitting comparisons between platforms. The product of this ratio with the measured sustained computational rate then gives a more accurate measure of the overall usable system performance. The test was run on the Cray T3E and the IBM SP. The T3E consists of 512 Alpha EV56 processors at 450 MHz with an aggregate peak of 614 GFlop/s. The SP consists of 512 Power3 processors at 200 MHz with an aggregate peak of 410 GFlop/s. The SP exhibits higher sustained performance for some applications, however, the T3E has better scalability for tightly-coupled parallel applications.

2

Utilization and Scheduling

In this work, we are primarily concerned with parallel applications that require a static number of processors (or partition size). Scheduling several parallel applications concurrently is particularly problematic since time and partition constraints must be satis?ed while simultaneously the system utilization should be at a maximum. Utilization, in this sense, is de?ned as the fraction of busy processors to available processors integrated over time. In day-to-day operation, other constraints and requirements create a situation more complex and subtle. For example, a scheduler may use a best-?t-?rst (BFF) strategy but at a cost of starvation of larger partition jobs. While this may reduce the elapsed time and achieve

1

higher utilization compared to a ?rst-come-?rst-serve (FCFS) strategy it does not address the issues of turnaround, fairness and productivity. One indirect measurement of utilization is the elapsed time required to process a workload consisting of a number of jobs. If the time for each job is constant, irrespective of whether it is run dedicated or concurrently with other jobs, then the variation in the elapsed time for the workload depends only on the scheduling strategy and system overhead. For a given workload with jobs of varying partition sizes and times, a utilization ef?ciency, , can be de?ned as, (1)

where and are the partition size and elapsed time, respectively, for the -th job, is the observed time for the workload and is the number of available processors. In the ideal limit of perfect packing with no overhead, the ef?ciency, , approaches unity. Here the numerator and denominator are in units of CPU-hours (or CPU-seconds), a unit of that is useful for discussing parallel resources. Note, there is a distinction between perfect packing and optimal scheduling given a list of submitted jobs and run time parameters. Some of the possible parameters include; preemption, back?lling, variable deadlines, multiple resource requirements, scheduling dependencies and job priorities. Not surprisingly this problem is NP-hard and many heuristic techniques have been developed to approximate a good solution. Examples of recent work for non-preemptive, preemptive and multiple resource schemes include [4, 5, 6, 7]. Particular scheduling strategies depend on the availability of certain key system functionality. As shown in Table I, the systems examined in this work differ considerably in available functionality. Checkpoint/restart is the ability to save to disk and subsequently restore a running job. System-initiated checkpointing of long-running jobs affords ?exibility in scheduling and maintenance. Without checkpointing, one is faced with the undesirable choice between waiting to idle the machine or the premature termination of jobs. Swapping and gang-scheduling is the parallel analog of time-slicing on single processor machines but usually with a course granularity. This allows oversubscription (multiple processes per processor), increasing the apparent number of processors available and allowing greater ?exibility in scheduling. Job migration/compaction refers to the capability to move a running job to a different set of processors. This aids in defragmentation and scheduling large partitions. Priority preemption is the ability to launch a higher priority job with the resulting effect of swapping out lower priority jobs. This is useful to improve turnaround and prevent starvation of large partition jobs. Back?ll is the insertion of smaller, shorter jobs ahead of earlier jobs in the queue to ?ll empty slots. More sophisticated back?ll algorithms exploit a priori run time information. Systems with disjoint partitions do not require contiguous (in a topological sense) parallel partitions on the interconnect and, therefore, fragmentation is not an issue. The ?rst four functions mentioned, namely, checkpoint/restart, swapping, migration 2

?





?

 ¨§?  ? ? ??? ?¤??



??



Table I: System Functionality on T3E and SP Function Checkpoint/restart Swapping/Gang-scheduling Job migration/compaction Priority preemption Back?ll Disjoint partitions T3E X X X X X SP

X X

and preemption, all depend on the fundamental kernel operation of preemptively changing process images between run and idle states and moving them to memory and disk. Unfortunately, this is either not implemented in stock operating systems or is not exposed to the global scheduling/queuing software. The Cray T3E at NERSC has demonstrated high utilization due to system management and the availability of checkpoint/restart and priority preemption [8, 9]. Furthermore, the Cray T3D at Livermore National Laboratory, using gang-scheduling with priority preemption [10, 6], has also shown high utilization while simultaneously satisfying the demands of user productivity and turnaround. In this system, various job classes (interactive, debug, batch, benchmark, background) were used that were assigned different priorities and characteristics. For example, benchmark jobs were marked non-swappable while background jobs were given the lowest priorities to mop up idle processors. The term scheduling is used to refer to both the launching of queued jobs and the dynamic management of running jobs. But, in practice, these are two separate subsystems that interact. NQS is used on the T3E to determine the order of job launch while global resource manager and psched implement swapping, gang-scheduling, priority preemption and migration/compaction of running jobs.

3

Benchmark Design

A throughput benchmark was the initial starting point for the design of the ESP test. It consists of a set of jobs of varying partition sizes and times with the objective of obtaining the shortest elapsed run time. By reporting the utilization ef?ciency, , instead of the absolute time, the ESP test is independent of the computational rate. A theoretical minimum elapsed time of 4 hours (or 4 512 CPU-hours) on the T3E was chosen for the benchmark. This choice was a compromise between a longer simulation more representative of actual production and a shorter time more amenable to benchmarking. 3

?

?

Table II: Workload Pro?le on the T3E at NERSC (%) Size (procs) Time(s) 300- 900 1000-3600 3700-9600 8 0.5 0.5 2.0 16 0.7 1.4 4.2 32 1.9 3.4 7.6 64 4.5 10.2 35.1 128 0.5 2.8 17.3 256 1.3 4.0 2.4

The turnaround of larger partition jobs has always been a concern. In order to encapsulate this problem, the test includes two jobs with partition sizes equal to the number of available processors (full con?guration jobs). The test stipulates that the ?rst full con?guration job must be run immediately upon submission before any further jobs are launched. The ?rst full con?guration job can only be submitted after 10% of the theoretical minimum time has elapsed while the second full con?guration job must complete within 90% of test time. Back?ll is implicitly disallowed once the full con?guration job has been submitted. Outages, both scheduled and unscheduled, are common in these systems and the time to shutdown and reboot has a signi?cant impact on utilization over the lifetime of a system. The ESP test includes a shutdown with reboot which is required to start immediately after the completion of the ?rst full con?guration job. In practice, the shutdown and reboot cycle is dif?cult to implement without manual intervention during the test. If the shutdown/reboot time, , is known in advance then the operation need not be performed. The utilization ef?ciency with shutdown/reboot is simply,

The jobs are grouped into three blocks and the order of submission is determined from a reproducible pseudo-random sequence. The total number of CPUs requested in the ?rst block is at least twice the available processors and the number of CPUs in the second block at least equal to the available processors. The remaining jobs constitute the third block. The ?rst block is submitted at the start with the second and third blocks submitted 10 and 20 minutes thereafter, respectively. This structure was designed to forestall arti?cially con?gured queues speci?c to this test and, at the same time, provide suf?cient queued work to allow ?exibility in scheduling. No manual intervention is allowed once the test is initiated. We consider it important that the job mix be representative of the user workload. Accounting records for three months from the T3E were distilled into a workload pro?le. User jobs were sorted into classes de?ned by run time and partition size. Table II shows the workload pro?le expressed as a percentage of the available CPU-hours. Using the 4 hour 4

? ? ¨?? ? ??? ?? ¤? §?

? ?

?

(2)

elapsed time (2048 CPU-hours) for the test, Table II allows one to calculate the amount of CPU-hours for each class. For example, the class of jobs with a partition size of 64 and taking 1000-3600 seconds, represents 209 CPU-hours ( = 209/64 hours). It is important to note that while the pro?le re?ects the application and problem-size constraints of users (for example, a minimum of aggregate memory), there is also a secondary effect of the scheduling policy enforced on the system. That is, the users will naturally adapt their pattern of usage to maximize their productivity for a given queue structure. The applications in the job mix originate from our user community and are used in production computing. Furthermore, the job mix pro?le was designed to span the diverse scienti?c areas of research amongst our users. Attention was also paid to diversify computational characteristics such as the amount of disk I/O and memory usage. For each class, an application and problem set was selected to satisfy the time and partition size constraints. The number of instances (Count) of each application/problem was adjusted such that aggregate CPU-hours re?ected the workload pro?le. Table III lists the ?nal job mix for the ESP benchmark with the elapsed times for each job on the T3E and SP. A priori run times are not provided to the schedulers, however, this rule may be relaxed in the future to examine back?ll schedulers.

4

Results and Discussion

Two test runs were completed on the T3E. In both cases, a separate queue was created for full con?guration jobs and was marked to preempt running jobs. The full con?guration jobs can thus be launched immediately on submission independent of the queue of general jobs. Process migration/compaction was also enabled for both runs. In the ?rst run, labeled Swap, the system was oversubscribed by two and gang-scheduled with a time-slice of 20 minutes. A single NQS queue was used for the general job mix. In the second run, labeled NoSwap, the system was not oversubscribed. Each job ran uninterrupted until completion. Six queues for different maximum partition sizes; 256, 128, 64, 32, 16, with decreasing priority were used. A single run on the SP was completed. Two classes (queues) were created in Loadleveler; a general class for all jobs and a special high priority class for the full con?guration jobs. A full con?guration job will be promoted to the head of the queue. However, the back?ll mechanism will defer the full con?guration job in favor of smaller, lower priority jobs. This would violate the stipulation that the full con?guration job be run immediately. In order to circumvent this problem, all jobs were assigned very large wallclock time limits. Since the back?ll algorithm exploits and respects time limits, Loadleveler was effectively reduced to a FCFS strategy. The results of the ESP test for the T3E and the SP are summarized in Table IV. Two ef?ciency measurements, with and without the shutdown/reboot time factored in, are reported.

5

Table III: ESP Application Job Mix Application gfft md md nqclarge nqclarge paratec qcdsmall qcdsmall scf scf scfdirect scfdirect superlu tlbebig tlbebig tlbebig tlbebig tlbebig Discipline Large FFT Biology Chemistry Material Science Nuclear Physics Chemistry Chemistry Linear Algebra Fusion Size 512 8 24 8 16 256 128 256 32 64 64 81 8 16 32 49 64 128 Count 2 4 3 2 5 1 1 1 7 10 7 2 15 2 6 5 8 1 Time(s) T3E 30.5 1208.0 602.7 8788.0 5879.6 746.9 1155.0 591.0 3461.1 1751.9 5768.9 4578.0 288.3 2684.5 1358.3 912.9 685.8 350.0 SP 255.6 1144.9 583.3 5274.9 2870.8 1371.0 503.3 342.4 1136.2 646.4 1811.7 1589.1 361.2 2058.8 1027.0 729.4 568.7 350.7

6

Figures 1, 2 and 3 show the details of the three runs where the instantaneous utilization is plotted against time and the time axis has been rescaled by the theoretical minimum time. Additionally, the start time for each job is indicated by an impulse where the height equals the partition size. The obvious point is the signi?cantly higher utilization ef?ciency of the T3E compared to the SP. This is due to the lack of a suitable mechanism to immediately launch full con?guration jobs on the SP. On submission of the full con?guration jobs, a considerable amount of time was spent waiting for running jobs to complete. This is evident in Figure 3 which shows two large regions where the instantaneous utilization drops very low. Without this drawback, it is likely the utilization on the SP would be comparable to the T3E. The time lag to run preferential jobs is indicative of the dif?culty in changing modes of operation on the SP. This is important for sites that routinely change system characteristics, for example, between interactive and batch or between small and large partitions. The most desirable remedy would be to either checkpoint or dynamically swap out running jobs. It is noteworthy that the SP completed the test in less time due to its higher computational rate. As seen in Figure 1, the BFF mechanism on the T3E deferred large partition jobs ( 128) until the end. Consequently, at the end of the test there were large gaps that could not be ?lled by small jobs. On the SP, a FCFS strategy was indirectly enforced which is illustrated in Figure 3 where the distribution of job start times is unrelated to partition size. The distribution of start times is qualitatively similar between the Swap and NoSwap runs on the T3E although the queue set up was different. In the second run, we deliberately assigned increasingly higher priorities to larger partition queues in an attempt to mitigate starvation. However, shortly after the start, it is unlikely that a large pool of idle processors would become coincidently available. In this scenario, the pattern of job submission reverts back to BFF and the queue set up makes little or no difference. On the other hand, there is considerable difference in ef?ciency between the two T3E runs. This is attributed to the overhead of swapping which is signi?cant when the oversubscribed processes cannot simultaneously ?t in memory and process images must be written to disk. To aid the evaluation of the performance and sensitivity of the ESP test, a simulator was developed to predict the runtime of this workload using various scheduling schemes. Several simple scheduling algorithms, such as FCFS and BFF were implemented together with the option of using back?ll schemes and priority preemption. The simulator was used in the de?nition of the ESP test in order to ensure that the test did not exhibit pathological behavior. We have also used the simulator to estimate results of the ESP test on various system con?gurations and compared them to observed results. Table V compares observed ef?ciency against various simulator ef?ciencies without shutdown/reboot. The simulated ef?ciencies are somewhat optimistic since they do not account for system overhead such as I/O contention, swapping and processor fragmentation. A gang-scheduled simulation of the ESP test with a 1000 second time-slice and oversubscription of two achieved a utilization ef?ciency of 0.86. However, this is overly optimistic
?

7

Table IV: ESP Results T3E Swap Available processors Jobmix work (CPU-seconds) Elapsed Time (seconds) Shutdown/reboot (seconds) Ef?ciency Ef?ciency (w/o reboot) 512 7437860 20736 2100 0.64 0.70 NoSwap 512 7437860 17327 2100 0.75 0.84 SP 512 3715861 14999 5400 0.36 0.48

Table V: Observed and Simulation Results on the T3E Ef?ciency 0.84 0.49 0.84 0.86

Observed (NoSwap) Simulation + w/o preemption + with preemption + gang-scheduling

8

as the observed ef?ciency of 0.70 from the Swap run indicate that there is a signi?cant overhead incurred with gang-scheduling. Furthermore, this is only a slight increase compared to the preemption simulation with an ef?ciency of 0.84 without the use of gang-scheduling. The most noticeable result obtained from the simulations has been the assessment of the preemption functionality. The simulated ef?ciency of the ESP test using a BFF algorithm increased from 0.49 to 0.84 when preemption was employed. These results indicate that preemption can have a very signi?cant impact on system utilization in a real operational setting which agrees with the conclusion from the SP run.

5

Conclusion

In this work we described a new utilization benchmark that we successfully ran on two parallel computer systems. It has provided quantitative data on utilization and scheduling ef?cacy of these resources and useful insights on how to manage these systems. The most important conclusion is that certain system functionality; including checkpoint/restart, swapping and migration, are critical for ef?cient scheduling strategies. Unfortunately, the scheduling of parallel systems has received little attention from operating system kernel developers wherein such functionality must originate. We plan to work closely with vendors to increase the utilization ef?ciency with the ESP test as a guide. Future work will also include testing other parallel platforms with different scheduling and system functionality. We are particularly interested in quantifying the effects of different schedulers with Loadleveler. We also plan to introduce variations on the ESP test, for example, providing a priori runtimes to schedulers that may exploit such information. The scalability of the test at larger number of processors will have be examined and, at some point, we hope to make the test freely available. Finally we intend to conduct a theoretical analysis of the ESP test in the context of multiple resource requirements. We hope that this test will be of use to other sites and spur both industry and research to improve system utilization.

6

Acknowledgments

This work was supported by the Of?ce of Computational and Technology Research, Division of Mathematical, Information and Computational Sciences of the U.S. Department of Energy, under contract DE-AC03-76SF00098. The authors would like to acknowledge the assistance of the system administrators on the T3E and SP; Tina Butler, Nicholas Cardo and Jacqueline Scoggins, in setting up the ESP test. We would like to thank the developers of the applications for allowing their software to be included in this test, including; Environmental Molecular Science Laboratory (NWChem), George Vahala (tlbe), Gregory Kilcup (qcd), Xiaoye Li (SuperLU) and Andrew Canning (paratec). 9

References
[1] Jack Dongarra. Performance of various computers dard linear algebra software in a fortran environment. http://www.netlib.org/benchmarks/performance.ps. using stanAvailable at

[2] David H. Bailey. The NAS parallel benchmarks. International Journal of Supercomputer Applications, 5:66, 1991. [3] Adrian Wong, Leonid Oliker, William Kramer, Teresa Kaltz, and David Bailey. Evaluating system effectiveness in high performance computing systems. Available at http://www.nersc.gov/ dhbailey. [4] D. Talby and D. G. Feitelson. Supporting priorities and improving utilization of the IBM SP2 scheduler using slack-based back?lling. In 13th International Parallel Processing Symposium, page 513. IEEE, 1999. [5] Dmitry Zotkin and Peter J. Keleher. Job-length estimation and performance in back?lling schedulers. In 8th High Performance Distributed Computing Conference. IEEE, 1999. [6] D. G. Feitelson and M. A. Jette. Improved utilization and responsiveness with gang scheduling. In D. G. Feitelson and L. Rudolph, editors, Workshop in Job Scheduling Strategies for Parallel Processing, page 238. IEEE, Springer-Verlag, 1997. [7] W. Leinberg, G. Karypis, and V. Kumar. Job scheduling in the presence of mulitple resource requirements. TR 99-025, 1999. Computer Science Department, University of Minnesota. [8] Horst Simon, William Kramer, and Robert Lucas. Building the tera?ops/petabytes production supercomputing center. In Fifth International Euro-Par Conference Proceeding, page 61, August 1999. [9] Jay Blakeborough and Mike Welcome. T3E scheduling update. In 41st Cray User Group Conference Proceedings, May 1999. [10] Morris A. Jette. Performance characteristics of gang scheduling in muliprogrammed environments. In Supercomputing 97, 1997.

10

100 640 512 384 256 128 0 0.5 0.75 1 1.25 Normalized Time 1.5 1.75 2 2.25

Figure 1: T3E Swap : Utilization Chronology

80

Utilization (%)

40

20

0

0

0.25

Partition Size

60

11

768

100 640 512 384 256 128 0 0.5 0.75 1 1.25 Normalized Time 1.5 1.75 2 2.25

Figure 2: T3E NoSwap : Utilization Chronology

80

Utilization (%)

40

20

0

0

0.25

Partition Size

60

12

768

100 640 512 384 256 128 0 0.5 0.75 1 1.25 Normalized Time 1.5 1.75 2 2.25

Figure 3: SP : Utilization Chronology

80

Utilization (%)

40

20

0

0

0.25

Partition Size

60

13

768


更多相关文档:

...Strategies for Parallel Processing System Utilization ....pdf

Scheduling Strategies for Parallel Processing System Utilization Benchmark on the Cray T3E_专业资料。Obtaining maximum utilization of parallel systems continues ...

双通道机载雷达STAP紧凑系统设计.pdf

“i2,^缸Q如,o咒92ArmamentandUtilization...processingstraintconditionofisto (STAP)system with...thedesignobjectiVeThecompactparallel decreasethelatency...

操作系统原理01_图文.ppt

SystemS 嵌入式系统 Parallel Systems 并行系统 ...单道批处理系统(simple batch processing) 50年代末... run next in order to increase CPU utilization....

模流分析并行计算MDX Parallel Computing_图文.pdf

DMP) is a distributed memory computer system in which the processing ...(2) Computing node: This node is used for parallel computing. Benchmark ...

@2014_oe_Parallel femtosecond laser ablation with ....pdf

holographic femtosecond laser processing system,” ...utilization of all laser power regardless of the ...This parallel processing combining femtosecond laser ...

Scheduling rules for a small dynamic job shop.pdf

scheduling of a simple manufacturing system and ... unrelated parallel machines, a mix of different ... tardiness and to maximize machine utilization. ...

Benchmarking Joyent SmartDataCenter for Hadoop MapReduce and ....pdf

account to improve resource utilization and reduce ...for parallel processing of large data sets [17]...b. Data mining benchmark We use the data base...

OS操作系统复习资料.doc

A.The process can be run in parallel B.The ...system requires that we consider page size,I/O,...(2)What is the CPU utilization when three jobs...

Study on Cloud Computing Resource Scheduling Strategy Based ....pdf

Study on Cloud Computing Resource Scheduling ...utilization rate for the user to provide services... parallel processing and grid computing and ...

操作系统课件Chapter2-Scheduling_图文.ppt

Determines which programs are admitted to the system for processing ? ...O bound tasks for maximal parallel resource utilization Process Behavior (2)...

...for FPGA-based high-end image processing.pdf

PROCESSING Dip1.-Ing Sven Heithecker, D i p ... scheduling to achieve high bandwidth utilization. ...using parallel DSP processors with interleaved ...

Parallel processing of matrix multiplication in a CPU and GPU....pdf

Parallel processing of matrix multiplication in a CPU and GPU heterogeneous ...The problems addressed are: scheduling for e?ective utilization of all ...

BDDNOW A parallel BDD package.pdf

have investigated using parallel processing for BD...Average memory utilization per node is comparable ...benchmark example, we ran a simple combinational ...

An Implementation of Pipelined Prallel Processing System for ....pdf

Also, the memory system has the important goals to provide the efficient utilization for n (= pq ) PEs of the pipelined parallel processing system we ...

AA-V3-I1-Need-for-Speed.pdf

Parallel processing also allows larger problems to ...Based on benchmark problem performance, customers ...(for better memory locality and system utilization...

Parallel Data Mining on ATM-Connected PC Cluster and ....pdf

and/or some small benchmark programs were ...O intensive process, parallel processing is ... which allows more e ective utilization of ...

The Boyer benchmark at warp speed.pdf

The Boyer benchmark at warp speed_专业资料。We ...utilization have destroyed the usefulness of this ...parallel programming techniques, because one ...

Plan-Per-Tuple Optimization Solution- Parallel Execution of ....pdf

Felipe Carifio and William O’Connell NCR Teradata - Parallel Systems 100 ...utilization considerations at run-time, not only at compile or scheduling ...

...Utilization of Parallel Generic Algorithms for Scientific ....pdf

Development and Utilization of Parallel Generic ...In addition, job scheduling is based on a ...Programming Languages and Systems, 17 (1995), ...

操作系统原理01_图文.ppt

SystemS 嵌入式系统 Parallel Systems 并行系统 ...单道批处理系统(simple batch processing) 50年代末... run next in order to increase CPU utilization....

更多相关标签:
网站地图

文档资料共享网 nexoncn.com copyright ©right 2010-2020。
文档资料共享网内容来自网络,如有侵犯请联系客服。email:zhit325@126.com