Hi Everyone,
I've been thinking about this question of the Drake implementation vs. the ThreadPool implementation and I wanted to share my thoughts. I had no idea the resulting email would be so long. It's my hope to offer interesting points for discussion.
These are all ordered by importance so you can bail when you like :)
Please bear with me...
What Should -j mean? (Part 1.)
There are two features for which I've made pull requests:
1 - Limit the number of concurrent tasks executing.
2 - All tasks process their prerequisites in parallel.
Both of these features are activated with separate flags: -j and -m, respectively. Neither feature requires the other. They are complementary.
Drake uses one flag to specify both features but there is no technical reason why Rake couldn't also activate both features with a single -j.
I raise this to separate the issue of "what -j means" from the possibly larger issue of the advantages of the drake implementation.
A Perk of the ThreadPool Implementation
The reason I ask if the issue isn't simply about "what -j means" is because the drake implementation is documented as breaking the existing contract exposed by the Rake API. From the drake page ( http://quix.github.com/rake/files/doc/parallel_rdoc.html ):
Task#invoke inside Task#invoke
Parallelizing tasks means surrendering control over the micro-management
of their execution. Manually invoking tasks inside other tasks is rather
contrary to this notion, throwing a monkey wrench into the system. An
exception will be raised when this is attempted in -j mode.
The ThreadPool implementation does not share this same limitation or limit any features of the Rake API.
[A use case for this is below...]
What Should -j mean? (Part 2.)
As a Rakefile author, I have found a lot of utility in being able to incrementally parallelize my Rakefile. Allowing both task and multitask enables me to quickly activate parallelization for a section of my Rakefile. I like that if I've detected a parallelization bug, I can quickly fix it by simply removing the parallelization for that section, leaving the rest of the file to remain in parallel (which hopefully still maintains good performance). I've been grateful for those times when I can quickly fix the build by changing a multitask to a task.
Being able to choose between task and multitask has always seemed to me a gentler way to allow authors to parallelize their Rakefiles while retaining the power to really take advantage of the machine upon which it runs.
That's why I like the separation of the -m option.
Use Case For Task#invoke inside Task#invoke
Being able to call and activate tasks on the fly is also important to me because the build system at my job uses Task#invoke from within another Task#invoke. It's possible that I'm misusing Rake (and if so, this is a great opportunity for me to get a better solution from the community).
Here's how we use Task#invoke:
Our build system has a packaging component which creates a deployable "package" containing variations of the product, and a collection of global items used by all variations. For each product variation, there is a binary of the build with its corresponding symbol files.
Package
-------
- variations
- debug
- product.exe
- product.pdb
- release
- ...
- debug-only-feature-A
- release-only-feature-B
- etc...
- global-items
- assets
- manifest
- etc...
We need to be able to specify at the rake command-line:
- Which variations will be included
- Overall options that affect every variation in the package
I tried to write a Rakefile that would take all those options and build a giant dependency tree. Inside a enumeration of variations would be a declaration for the current variation for our :build task. The :build task would be declared with a unique name based on the configuration, essentially creating a parametrized task (akin to C++ templates). These would all depend on a resulting :package task. Each variation would depend on a prerequisite, which would all depend on a single task :preprocess_assets
Here's pseudo-code:
multitask :preprocess_assets => asset_tasks do |t,args|
[code]
end
variations.each do |variation|
task "build_prereq(#{variation.to_s})" => :preprocess_assets do |t,args|
[code]
end
task "build(#{variation.to_s})" => "build_prereq(#{variation.to_s})" do |t,args|
[use variation in build code]
end
task :package => "build(#{variation.to_s})"
end
task :package do |t,args|
[packaging code]
end
Here's an ascii diagram (note that there were many more variables than "conf" and "features"):
[asset,asset,...] <-- (in parallel)
|
:preprocess_assets ------------------------------------
/ | \ \
"build_prereq(conf=release,features=A,B) | "build_prereq(conf=debug,features=A,B)" |
| "build_prereq(conf=debug,features=A)" / "build_prereq(conf=release,features=B)"
| | / /
"build(conf=release,features=A,B) | "build(conf=debug,features=A,B)" /
| "build(conf=debug,features=A)" / "build(conf=release,features=B)"
\ | / /
\ \ / /
----------------------------- :package -------
It seemed very straightforward, but it was difficult to read and debug the Rakefile. All the task names were generated (making them hard to find in the code when referenced from rake output) and the tree was very large.
Using Task#invoke allowed me to get rid of all the parameterization and create a Rakefile that better matched the flow of the process and was simpler to read.
multitask :preprocess_assets => asset_tasks do |t,args|
[code]
end
task :build_prereq, [:conf, :features] => :preprocess_assets do |t,args|
[code]
end
task :build, [:conf, :features] => :build_prereq do |t,args|
[use args]
end
task :package do |t,args|
variations.each do |variation|
Rake::Task[:build].invoke(*variation)
[reenable :build and its prerequisites]
end
[packaging code]
end
Here's an ascii diagram
[asset,...] <-- (in parallel)
|
:preprocess_assets
|
:build_prereq
|
:build <--loops over-- :package
Keeping Rake Flexible
On a more general note, Rake has always been presented to me as an API to enable dependency-based programming and the DSL is a (significant) perk enabling writing a dependency tree in a declarative style. But as far as I know, there has never a formal boxing of the Rake system into "declare tasks" mode and "execute tasks" mode which it seems the drake implementation encourages, if not requires.
Thank you for making it this far. I look forward to the discussion generated by these points.
Sincerely,
_ michael bishop
Post by Jim WeirichConservative is one thing, but drake was written 2 years ago. There has been no response every time someone asks why drake was not merged.
My main problem with drake is that it adds a second task execution engine that is subtly different the mainline rake engine. The difference isn't critical and most projects won't even notice the difference, but having two similar but different engines offends my sensibilities.
If drake were to be merge, I would want to either (a) discard the current engine and use drake's engine exclusively, or (b) make the parallelization mechanism work more closely with the current rake engine.
I know drake uses a dry-run pass to compute the dependency tree, but I'm not sure if the dry run pass uses the regular rake engine (which might impact option (a)) or if it does its own thing.
In any case, a drake merge won't happen in the 0.9.x series as I would like to work out the current bug list and hit some simple features. The Thread pool looked like an easy win and is really needed for the multitask stuff anyways. Michael has also proposed a -m option that implicitly turns tasks into multitasks, and I'm considering that instead of a drake integration.
However, if the -m flag is deemed inadequate, I will probably hold off on the thread pool as well and reconsider a drake move a bit farther down the line.
Thoughts are welcome.
(Postscript: I also have some concerns about turning on parallel execution in arbitrary Rakefiles. I suspect it will work fine in projects that most shell out to compilers and linkers, but Rakefiles that run most Ruby code will probably be broken in ways that are hard to detect and reproduce. If anyone has any ideas on addressing that issue, I would love to hear them.)
--
-- Jim Weirich
_______________________________________________
Rake-devel mailing list
http://rubyforge.org/mailman/listinfo/rake-devel