I am looking right now at Schneier's reference implementation of the Blowfish key setup function, and there is nothing in it that would not be straightforward to parallelize across a CUDA warp (read: no conditional branches). Is it your assertion that getting from here to a working bcrypt cracker would be sufficiently difficult as to constitute a significant "contribution to cryptography literature"?