Task Scheduling Improvements
I took some time to see how I can improve the sample implementation of the task scheduling that I posted yesterday.
You can checkout the code here. What I wanted to check is the ability to scale the solution to many tasks and to IO bound tasks.
In a stress test that I made, the library held its own while scheduling 9,864,100 tasks, at one point we had ~3,500,000 concurrently queued tasks. Memory usage hovered at the ~500Mb range.
The next stage was to see what happens when we have tasks that takes a long amount of time to execute, depends on IO, etc.
I wrote this piece of code:
public class GatherAllFiles : AbstractTask { private readonly string root; public GatherAllFiles(string root) { this.root = root; } protected override IEnumerable<Condition> Execute() { List<string> results = new List<string>(); results.Add(root); List<Future> futures = new List<Future>(); foreach (string directory in Directory.GetDirectories(root)) { Future spawn = Spawn(new GatherAllFiles(directory)); futures.Add(spawn); } string[] files = Directory.GetFiles(root); yield return Done(futures); foreach (Future future in futures) { results.AddRange(future.GetValue<IEnumerable<string>>()); } results.AddRange(files); SetResult(results); } }
I then run it on my OSS directory. This directory contains 133,108 directories and 206,298 files (ask me how I know... ).
The library just ate it up without even noticing. Very nice, even if I say so myself :-)
Comments
Using Lambda Expressions, I would simplify:
class AbstractTask:
public WaitFuture Done(IEnumerable<Future> items)
{
return new WaitForFuture(() => items.All(item => item.HasValue));
}
Task scheduling is a topic that's been researched a lot. In general, round-robin with a priority system is on average a system which gets you the best results.
Your testcase is a bit weird. The disk is a resource which can be in 1 task at a time. This means that even if you create a fancy scheduling system, it doesn't matter, everything is done sequential anyway. What's worse: the slowest thing in your system is the head in the harddisk: every time it steps, you'll lose a big chunk of performance.
To get optimal performance on a disk with a set of tasks is to execute all of them in a sequential order and not having any of them interfere with one another. The thing is though that modern disks are able to 'accept' more than one command at a time, and switch between them during atomic block reads for example. This causes a lot of stepping, which results in a degration of performance.
So to illustrate/test task scheduling, I'd pick a topic which isn't limited by a sequential resource like a disk ;)
Frans,
Actually, the FS is not limited to the HD, there is quite a bit of caching and forward loading going on.
In addition to that, we aren't executing everything at the same time, this is the beauty of this approach, we are only executing N tasks at a time, where N == count(CPU)
On a decent server, I would expect multiple CPUs and a RAID configuration, so a file system (FS) example is not totally out of the question.
How would you handle the scheduling part? for example run one task every minute and another task every day at 05:00..
Thanks
Marco
Marco,
That is something that I usually won't do using this approach.
Nevertheless
public class PrintTimeTask : AbstractTask
{
public override IEnumerable<Condition> Execute()
{
}
}
It works of course, but I can imagine it would get really tiresome having to write code in this stilted form all of the time.
It's not clear to me why you don't just compose all of the futures together more directly using lambdas to encapsulate dependent computations instead of all of this tedious "yield return" nonsense.
Jeff,
Can you show me the code of doing this with Lambda?
Probably something like
class AbstractTask
public Condition Where(Condition condition)
{
return condition;
}
And then
while(true)
{
Console.WriteLine(DateTime.Now);
return Where(() => DateTime.Now >= nextSchedule);
Francois,
I am using .NET 2.0 to write the code.
This approach is the same thing, just with lambda syntax.
The idea with lambda is instead of making up a scheduling loop with yield return which only works within the context of a single method, we make the tasks chain themselves around an asynchronous callback.
So basically, we have Task objects floating around. To start running some Task, we create a new Task object containing a delegate for the first block of code we want to run. We let the Task bubble up to the scheduler.
When the delegate eventually runs, it can begin a new Task which gets chained onto the previous one.
This idea works particularly well if we think of Tasks as spawning asynchronous operations. That is, we might run a little code in a delegate that starts the asynchronous operation. When the asynchronous operation finishes, we finally mark the Task finished and set its result (or an exception).
When the Task is marked done, it sends out a notification that enables other chained Tasks to start up using whatever result was previously computed.
I did something very much like this in ActionScript 3 because I had to manage a large number of asynchronous operations all sharing the only thread available. So everything ends up having to post back to the event loop. Basically the event loop is functioning as a kind of scheduler.
Here's the code for the ActionScript 3 stuff.
https://svn.castleproject.org/svn/castlecontrib/flexbridge/trunk/src/Castle.FlexBridge.Flash/src/castle/flexbridge/common/AsyncTask.as
Unfortunately most of the cool code using AsyncTask is in other proprietary projects that have yet to be OSS'd However typical usage looks like:
Now in .Net we have a lot more power available so we will probably want to use a somewhat different abstraction. Moreover with the C# 3.0 syntax, we can cook up a more fluent interface for creating and chaining these asynchronous tasks together.
And we can rearrange the control-flow a bit to put some kind of TaskScheduler in change of the scheduling so that each new Task to run gets posted back to the scheduler instead of running immediately. If desired.
Lots of possibilities in this design space. So don't stop with "yield return" loops...
Jeff,
what you are talking about is message passing architecture.
I have tried that with Retlang, and it works very nicely. For some things.
For others, this approach is far preferable, it is much easier to write code this way than explicitly handle the state transfer issues.
No. It's not strictly message passing... Quite different from passing data around using message ports.
What we're really doing is chaining Futures that to create composite asynchronous operations.
I don't see the difference, can you elaborate?
Comment preview