static code generation and open Accs
Stephen Blackheath [to Accelerate]
lug.agonised.stephen at blacksapphire.com
Mon Aug 23 10:13:09 EDT 2010
> What do you think?
I think that's all mighty fine! I understand everything you said, and I
see how it all fits together. The Template Haskell is a great idea!
One small detail is that on our cross-compiler, TH doesn't work. This
to do with how I compiled it, but it is possible that GHC might need
some persuading before it will run Template Haskell when compiled as a
cross compiler. We should use TH in the general case, but I can bypass
it for my application.
My priority is completing the game, so some time soon I am going to
decide whether to contribute a C back end to accelerate, and then use
that, or hand-code the C. I know which one would be more fun, but I
have to be hard-nosed about it. Time is not the only consideration.
Code quality is also important, so accelerate is very attractive indeed.
Later I may add support for VecLib like we discussed, but I'd start with
If and when I come to it, I feel like I've got enough background now to
start ripping into it and sending you patches, with plenty of questions
about the details of course, so thank you very much for your email.
On 23/08/10 11:31 PM, Manuel M T Chakravarty wrote:
> Stephen Blackheath [to Accelerate]:
>> I mentioned to you at AusHac2010 that I've got some Haskell code in
>> our video game that needs speeding up, and accelerate is looking
>> like an attractive option to avoid having to write some C.
>> Since our platforms are mobile ones (and therefore no compiler at
>> runtime), I'm considering writing a static C back end for
>> accelerate, so I took a look at what would be needed.
>> The generated code would need to be a C function that takes
>> arguments, and so the accelerate AST would have to represent a
>> lamba expression. Assuming I'm reading the code right, it looks
>> like I can do this easily by adding the arguments to the
>> environment, then binding an Avar to the argument names at the
>> Haskell level. Primitive values can be passed in as scalars.
>> However, the accelerate language code could not have the current
>> type of Acc a ~ OpenAcc () a, because that doesn't allow the
>> environment type required by Avar. The obvious way to fix this is
>> to change every function in the accelerate language to have type
>> OpenAcc aenv a instead of Acc a. But - I can't imagine you would
>> want to do this.
> I don't think that this is necessary.
> As far as the front-end of Accelerate is concerned, there is not
> really a need for array computations to be closed. Similar to the
> existing functions D.A.A.Smart.convertFun1 and
> D.A.A.Smart.convertFun2, we could have support for converting
> array-valued functions (over Acc) — and by using a type class, we
> could avoid having an awkward family of conversion functions, but
> just have one (overloaded) class method.
> The CUDA backend couldn't translate such functions in its current
> version. However, as you were planning to have a new backend anyway,
> that might not matter. Nevertheless, it might also make sense to
> find a more general solution that could be used with all backends.
> (Then, we might be able to use both the C backend and the CUDA
> backend dynamically as well as statically.) The latter might be
> achieved in either of two ways:
> (1) The actual CUDA backend code D.A.A.CUDA.Compile.compileAcc takes
> an OpenAcc anyway. So, we could think about changing the interface.
> One disadvantage is that using a type class to avoid a family of run
> functions complicated the interface.
> (2) Given a function 'foo :: Vector Float -> Acc (Vector Float)' and
> two invocations 'foo vec1' and 'foo vec2' in the same program, the
> CUDA backend doesn't really generate the code for 'foo' twice.
> Instead, it does actually abstract 'foo' out, compiles that once, and
> caches the result. On the second invocation of 'foo', the cached
> binary is used. We would only need to arrange for that cached code
> to be generated at compile time.
> I'm in favour of Alternative (2). The caching works by identifying
> all occurrences of the 'use' function in the array expression. The
> compilation process abstracts over all of them, generating binary
> code that is effectively a function over all 'use'd arrays. We could
> exploit that by introducing, in addition to 'run', a 'precompile'
> function for each backend. The function 'precompile' would have the
> same signature as 'run' and it would behave in the same way with two
> exceptions: (1) it wouldn't attempt to transfer the 'use'd arrays
> from the host to the device (in the case of CUDA) and (2) it wouldn't
> actually invoke the generated code.
> Even in applications that don't have to statically generate code,
> 'precompile' will be useful to populate the cache (e.g., in a game,
> you don't want to take the hit of the initial compile during the
> actual game play, but you want to move that into the startup phase).
> To precompile, we don't even need to make up unused array argument.
> We can use issue 'precompile (foo undefined)' — the 'undefined' array
> value won't be touched, thanks to laziness. Finally, to generate
> code during *compile* time, rather than at start up, we can use
> Template Haskell. (This should amount to a rather simple use of TH
> if I'm not overlooking anything.)
> Summary ~~~~~~~ I'd prefer a general mechanism to separate code
> generation from the first invocation of an Accelerate computation.
> We can achieve that by introducing 'precompile', a variant of 'run',
> that omits actual code execution. In combination with TH, this
> enables static code generation. We could use that approach uniformly
> in the CUDA backend and in a C backend (as well as other backends).
> What do you think?
More information about the Accelerate