Xeduler |
Linux's Witness Ministry
|
Xeduler is a simple job scheduler. It is designed to:
Xeduler schedules jobs. Because this is all it does, it is possible for an administrator to deploy and completely understand the scheduling system with very little effort. Also, by focusing on such a limited scope, a user can preserve his choice of tools for other things or choose not to do anything else.
There are many things that Xeduler does not do. It does not provide health monitoring services for resources. Why should it? Maybe you don't need that. Maybe your resources are very exotic and you want a custom monitoring system. Whatever your monitoring requirements are, Xeduler makes it easy to integrate them. Xeduler does not do load balancing. Again, requirements for that could be so varied that extreme complexity would be necessary to cover all possibilities. Xeduler is designed so that it would be quite simple to incorporate a load balancing system. Xeduler does not provide redundant systems to remotely execute jobs somewhere else. For that, use ssh or rsh or OpenMosix or whatever you like. Those facilities exist, work fine, and don't need replacing.
These seem like ok products from an outsider's perspective and if you can make use of them, great. For me, it was so much of an ordeal that I began to wonder if it might actually be easier to write my own scheduler. I use a lot of software and very, very rarely do I ever think, hmm, I should just reimplement this myself. First, there are usually dozens of possible solutions to the problem out there and second, it's usually a very serious undertaking to replicate the functionality of some capital piece of software.
As it was becoming clear to me what a substantial effort was going to be required just to use someone else's scheduler, I started to wonder why it was so difficult. I feel that there is possibly superfluous complexity in at least two areas. First, I think these packages do a lot more than I needed to have done. Second, I think that because of reason one, there is a lot of overhead in creating a lot of custom facilities (like the user interface commands, and database, etc).
My concept was to simplify the problem and break apart components that were not necessarily related so that they could be treated as smaller problems. I decided to focus exclusively on the job scheduling aspect for now. Issues like health monitoring, disaster recovery, and resource optimization are all interesting and possibly useful. On the other hand, they are possibly not. My system does none of that, but rather provides for subsequent software to do these things on an as needed basis.
For me, the problem is that there are a "large" number of jobs that must be run on a collection of computation resources. The jobs must be dispatched in a way that does not overwhelm the resources. That's it. An additional problem I wanted to solve was to eliminate a huge amount of the complexity of installation, deployment, and usage.
Considering the problem to be just one of simply and sensibly dispatching jobs, I was able to take advantage of some creative tactics. I realized that everyday I use a job dispatching program. This program is natively installed in practically all Unix operating systems, and beyond. This job dispatcher is Bash. I like programming Bash and it seems to do a good job at doing exactly what it was designed to do - dispatch jobs. So while that may seem crazy, it does seem effective.
Using Bash as the platform to create my scheduler produces many excellent and interesting advantages. First, everyone has Bash and if one doesn't, it's generally easy to get. Certainly most Linux cluster users would be in a very unusual situation to not have immediate access to it. Second, Bash is well-known and reasonably understood by people having anything to do with Linux clusters. My system tries to leverage the exisiting untapped power of a native Linux environment.
Another advantage of using Bash is that platform problems are no longer a problem. One can use Bash on 64 bit architecture, under Cygwin, even on a Zaurus handheld, and basically anywhere Bash runs. That's a lot of flexibility.
Not only did I choose to use Bash, I also wanted to leverage the existing familiar facilities that a typical Unix system provides for dealing with various aspects of the system. Consequently there are not really any special commands that need to be learned in order to interact with the system. The scheduler is controlled by issuing normal commands that Unix users are already familiar with.
The scheduling problem is fundamentally one of assigning jobs to resources. This implies a list or queue of jobs and a list of resources. Both of these lists are implemented using an ordinary filesystem. Any filesystem will do. If you're particularly concerned about speed, the filesystem can be implemented in a ramdisk which most Linux distributions provide.
The job queue is simply an ordinary directory which contains files which contain jobs. There can be one job per job file in the queue or there can be gigabytes of jobs per file in the queue, one job per line. To submit a job, simply move a job file into the job queue directory, or you can create it on the fly (echo /bin/jobtorun > JOB--a_sample_job). Job files begin with the "JOB--" prefix; what follows that is up to you. The priority of the jobs is in the same order that the ls command will list them (this is locale dependent) which ensure predictable results. If a file contains multiple jobs, they are prioritized from first to last. To change a job file's priority in the queue, simply rename it to something higher in the ls order (using the standard mv, generally).
The resource list is also simply a directory with meaningful files in it. The resource list directory contians one file per resource. In this case a resource could be a particular machine or a CPU on a machine or even just an arbitrary share on a machine if, for example, you want to keep a machine loaded with 11 jobs at all times for some reason.
You must have a QUEUE directory and a RESOURCES directory. These locations are set as variables (Qdir and Rdir) at the top of the program.
When the RESOURCES directory is listed, it could look something like this:
$ ls -l -rw-r--r-- 1 xed users 751 Dec 19 16:35 BUSY--10.0.0.44-spare_server -rw-r--r-- 1 xed users 751 Dec 19 16:35 BUSY--raven02 -rw-r--r-- 1 xed users 0 Dec 17 17:57 DOWN--10.0.0.13-archive_server -rw-r--r-- 1 xed users 0 Dec 17 17:57 DOWN--raven13 -rw-r--r-- 1 xed users 0 Dec 17 17:57 DOWN--robin13 -rw-r--r-- 1 xed users 751 Dec 19 16:35 FREE--raven01 -rw-r--r-- 1 xed users 751 Dec 19 16:35 FREE--raven04 -rw-r--r-- 1 xed users 771 Dec 19 16:35 FREE--raven05-cpu1 -rw-r--r-- 1 xed users 771 Dec 19 16:35 FREE--raven05-cpu2 -rw-r--r-- 1 xed users 751 Dec 19 16:34 FREE--robin01 -rw-r--r-- 1 xed users 751 Dec 19 16:34 FREE--robin02 -rw-r--r-- 1 xed users 0 Dec 19 16:34 STOP--raven02
This shows many machines or nodes in the resources list in different states. The first state is DOWN. This indicates that a resource is not to be used. Perhaps there will be maintenance on that machine or just selective avoidance. The next status is BUSY. This means that a job has been dispatched to this machine and Xeduler is waiting to hear back from it. The third state is FREE and these are the machines to which jobs are sent as they arrive in the job queue. The FREE machines are chosen in the order they appear in the ls list.
The last state is STOP. In the sample listing, raven02 is both BUSY and it also has a STOP entry. The STOP feature allows the user to pass along the intention that a resource be taken DOWN after it finishes being BUSY. Basically the user just creates a file that matches a BUSY resource but with a stop in place (touch STOP--raven02) and when raven02 is finished with its job, it will go DOWN instead of taking more jobs. You could simply rename a FREE node to DOWN, but it's best to use the STOP since you may not know when a job might show up and go to BUSY just after you renamed it. The chance of this happening is very small, but the STOP feature precludes this problem.
To restore a DOWN machine to FREE again, simply renaming it works fine. You can't confuse the scheduler by adding resources, just by taking them away.
The STOP feature could be used by a separate program which monitors the health of the nodes. If a node stops working they way you want (it stops responding, it's cpu usage goes very high, whatever), you can have the monitor program touch a STOP-- file.
Another clever time to use STOP is in a cron job. Perhaps the office staff's desktop machines do nothing all night. One could set a cron job to do something like this:
30 19 * * 1,2,3,4,5 mv DOWN--10.0.0.124-conf_rm_mac FREE--10.0.0.124-conf_rm_mac 30 5 * * 1,2,3,4,5 touch STOP--10.0.0.124-conf_rm_mac
This should cause a node to be down between 05h30 and 19h30 on weekdays and available the rest of the time. You can see that by leveraging the existing power of typical Unix systems you can get a lot of very nice features for essentially no extra complexity.
The format of the node filenames is somewhat important. The first part of the name must be the state codes (which can all be customized) such as DOWN--, STOP--, etc. Then comes the machine name that your remote shell program will use. Since listing order is important, you may want to prioritize resources by having special listings in the /etc/hosts file to look up. Finally, anything after and including a dash is considered for user information only. Here you can say more descriptive things about the resource. So in the example, raven05 has 2 cpus that should be loaded. The two entries of raven05 mean that two jobs will be sent to the name raven05. The -cpuX part is informational.
To stop Xeduler, you can kill it in the normal way, but this might be too abrupt if there are jobs running. If a file name with the word "quit" (case-insensitive) is found in the resource directory, then no new jobs are dispatched and the Xeduler is stopped. Currently running jobs continue on their own until they are complete.
Xeduler tackles the problem of what to send where, that's it. By not meddling in the affairs of other machines, the complexity of the system is reduced to an absolute minimum. By being so focused in scope it is my hope that it will be possible to be very creative with this tool. Xeduler has no idea what a job is really doing. It doesn't really care where it goes. Unlike other systems, it is not in constant contact with the machines running jobs. This has its limitations for sure, but it reduces the complexity enormously. Other scheduling systems work like a horse race - dedicated jockeys follow and guide the real workhorses the whole way; Xeduler is like racing pigeons - you let the jobs go and you hope they come back.
The huge advantage of the "pigeon" system, of course, is that you can utilize any machine you have access to without specifically preparing that machine to work as part of a cluster. It's onerous enough to have to set up everything any scheduler requires; to repeat that feat for every node minimally doubles that effort. It also severly discourages heterogeneity. Xeduler, on the other hand, can take advantage of any resource you can log into without further configuration annoyance.
Everytime Xeduler discovers a job that can and should be run on an available resource, it prepares a little wrapper script that should be able to accomplish that. This script is the actual contents of a BUSY file. You can read the contents of the BUSY files to understand what's going on, but it's probably unwise to modify them while the jobs are underway.
When there are no jobs and there are resources available, Xeduler does poll to see if any jobs were deposited. This is relatively low impact and only incurred when there are resources available. When the system determines that all the resourses are full, it sends itself a SIGSTOP and effectively goes to sleep. Why this odd strategy of falling asleep when things are at their busiest? First of all, it is likely that the user can be using the scheduling machine as a compute node too and we don't want to delay the computation with scheduling polling. What if more jobs are deposited while the scheduler is sleeping? Turns out it doesn't matter, there is nowhere for them to go anyway.
The trick is that the wrapper program which dispatched the job is, upon completion, setup to check if Xeduler is sleeping. If so, it wakes it up. Now there is obviously a resource newly freed and the next job can be assigned. This means that the scheduler overhead is absolutely nothing during full-capacity operation and that newly freed resources are captured and utilized immediately.
Because I was so bewildered by the other systems I tried to use, I intended to make Xeduler very easy to understand in its entirety. The source code that does the actual work is shockingly small, only 2951 bytes. The program as distributed adds 5kB of illustrative comments in the code with the intention that a normal administrator of a Linux cluster should be able to read and understand the entire source code in less time than it would take to read this page until here. You are encouraged to do that and empower yourself with the ability to utilize Xeduler's capabilities to the limits of your skills and imagination.
Xeduler was developed to solve a very specific problem which I felt was easier to solve from scratch than to install a more comprehensive package that did more than I needed. Consequently, there are plenty of missing things that would be nice to have.
Some of these missing things should be separate programs and perhaps I'll get around to implementing them. The two big ones are health monitoring and load balancing. By creating separate facilities to handle both of these tasks, Xeduler can still be used for what it's good at - job scheduling. The interface which these advanced features would use is as easy as it gets.
Another area that could be easily improved with separate software is fault recovery. If you send a job out and the node it's running on dies, then what happens? As far as Xeduler is concerned, that's not a scheduling problem strictly speaking. If the machine where Xeduler itself is running dies, then that is more of a scheduling problem and there are many possible tactics to deal with that. The fact that everything is naturally saved to a file system means that when the machine is rebooted, it would be fairly straightforward to reconstruct what was where and check up on that. Again, best left to some other program that specializes in such a thing.
Xeduler's main limitation right now which should be fixed is one of permissions. I created this to be run by one user and that's what it does. But it would be trivial to add the capabilities to have it run as root and dispatch jobs with an effective id of the owner of the job file. If that's interesting to you, let me know and I'll add it.
Another improvement which could be made to Xeduler itself is in configurability. Already there is a ton of flexibility in what you can do, especially for how easy it is to do simple things. However, if you had situations where you wanted certain jobs to be favored on certain machines, and others on others or some users can only use these nodes, while others can only use the cluster at these times, etc. this kind of thing would just take implementing a complex system to sort that out. I feel it was correct to design and actually create the simple system first so that users with a simple problem can have a simple solution. Simplicity was the big objective here.
| Return Linux Stuff |
| This page was created with only free,
open-source, publicly licensed software. This page was designed to be viewed with any browser on any system. |
| Chris X. Edwards ~ January 2006 |