static code generation and open Accs
Manuel M T Chakravarty
mchakravarty at mac.com
Mon Aug 23 07:31:56 EDT 2010
Stephen Blackheath [to Accelerate]:
> I mentioned to you at AusHac2010 that I've got some Haskell code in our
> video game that needs speeding up, and accelerate is looking like an
> attractive option to avoid having to write some C.
> Since our platforms are mobile ones (and therefore no compiler at
> runtime), I'm considering writing a static C back end for accelerate, so
> I took a look at what would be needed.
> The generated code would need to be a C function that takes arguments,
> and so the accelerate AST would have to represent a lamba expression.
> Assuming I'm reading the code right, it looks like I can do this easily
> by adding the arguments to the environment, then binding an Avar to the
> argument names at the Haskell level. Primitive values can be passed in
> as scalars.
> However, the accelerate language code could not have the current type of
> Acc a ~ OpenAcc () a, because that doesn't allow the environment type
> required by Avar. The obvious way to fix this is to change every
> function in the accelerate language to have type OpenAcc aenv a instead
> of Acc a. But - I can't imagine you would want to do this.
I don't think that this is necessary.
As far as the front-end of Accelerate is concerned, there is not really a need for array computations to be closed. Similar to the existing functions D.A.A.Smart.convertFun1 and D.A.A.Smart.convertFun2, we could have support for converting array-valued functions (over Acc) — and by using a type class, we could avoid having an awkward family of conversion functions, but just have one (overloaded) class method.
The CUDA backend couldn't translate such functions in its current version. However, as you were planning to have a new backend anyway, that might not matter. Nevertheless, it might also make sense to find a more general solution that could be used with all backends. (Then, we might be able to use both the C backend and the CUDA backend dynamically as well as statically.) The latter might be achieved in either of two ways:
(1) The actual CUDA backend code D.A.A.CUDA.Compile.compileAcc takes an OpenAcc anyway. So, we could think about changing the interface. One disadvantage is that using a type class to avoid a family of run functions complicated the interface.
(2) Given a function 'foo :: Vector Float -> Acc (Vector Float)' and two invocations 'foo vec1' and 'foo vec2' in the same program, the CUDA backend doesn't really generate the code for 'foo' twice. Instead, it does actually abstract 'foo' out, compiles that once, and caches the result. On the second invocation of 'foo', the cached binary is used. We would only need to arrange for that cached code to be generated at compile time.
I'm in favour of Alternative (2). The caching works by identifying all occurrences of the 'use' function in the array expression. The compilation process abstracts over all of them, generating binary code that is effectively a function over all 'use'd arrays. We could exploit that by introducing, in addition to 'run', a 'precompile' function for each backend. The function 'precompile' would have the same signature as 'run' and it would behave in the same way with two exceptions: (1) it wouldn't attempt to transfer the 'use'd arrays from the host to the device (in the case of CUDA) and (2) it wouldn't actually invoke the generated code.
Even in applications that don't have to statically generate code, 'precompile' will be useful to populate the cache (e.g., in a game, you don't want to take the hit of the initial compile during the actual game play, but you want to move that into the startup phase).
To precompile, we don't even need to make up unused array argument. We can use issue 'precompile (foo undefined)' — the 'undefined' array value won't be touched, thanks to laziness. Finally, to generate code during *compile* time, rather than at start up, we can use Template Haskell. (This should amount to a rather simple use of TH if I'm not overlooking anything.)
I'd prefer a general mechanism to separate code generation from the first invocation of an Accelerate computation. We can achieve that by introducing 'precompile', a variant of 'run', that omits actual code execution. In combination with TH, this enables static code generation. We could use that approach uniformly in the CUDA backend and in a C backend (as well as other backends).
What do you think?
More information about the Accelerate