"At the core of SnappyCam is a capture and image signal processing engine with innovations that took over 12 months of research and development. With it, we can also beat competing iOS camera apps by 400% on full-sensor shooting performance with the same iOS device and hardware.
Once photos are captured and buffered in real-time, our multi-threaded JPEG compression engine takes over. It compresses shots in software at speeds that exceed that of the hardware encoder normally dedicated to the task.
We had to reinvent JPEG to do it. First we studied the fast discrete cosine transform (DCT) algorithms of the early 1990s, when JPEG was first introduced. We then extended some of that research to create a new algorithm that’s a good fit for the ARM NEON SIMD co-processor instruction set architecture. The final implementation comprises nearly 10,000 lines of hand-tuned assembly code, and over 20,000 lines of low-level C code. (In comparison, the SnappyCam app comprises almost 50,000 lines of Objective C code.)
At first we did try to leverage the iPhone graphics processing unit (GPU) for the DCT computation. It turned out to be a dead-end. Back then, iOS 4 limited the data transfer speed in and out of the GPU; but even with that limitation eliminated, with the introduction of OpenGL pixel buffers in iOS 5, it appeared that the GPU parallelism was limited to about two render units that ran at a slower clock-rate than the main CPU. Without support for OpenCL or multiple render targets, we were also forced to use a naive (slow) DCT algorithm that was essentially a full matrix multiplication.
The ARM NEON approach was optimal: the SIMD pipeline can perform up to 8 simultaneous arithmetic operations in parallel at the full clock rate of the device, without any data transfer overheads, and allowing us to use any DCT algorithm we could conceive. And when it comes to speed, it’s all about doing less for more. Less computation, more work done, faster.
We also optimized out pipeline bubbles using a cycle counter tool so that every clock tick was put to work.
JPEG compression comprises two parts: the DCT (above), and a lossless Huffman compression stage that forms a compact JPEG file. Having developed a blazing fast DCT implementation, Huffman then became a bottleneck. We innovated on that portion with tight hand-tuned assembly code that leverages special features of the ARM processor instruction set to make it as fast as possible.
Similar innovations were put into a custom JPEG decoder, powering the unique SnappyCam thumb-to-interact living photo viewer. When dealing with massive 8 Mpx (32 MByte BGRX uncompressed) images, decoder performance became critical to a great user experience."
This reads like some genuine old-school low-level wizardry hacking and tuning that is rarely seen these days. Amazingly well done by the developers. Sounds like an epic amount of effort went into it.
https://web.archive.org/web/20131010012005/http://www.snappy...
"At the core of SnappyCam is a capture and image signal processing engine with innovations that took over 12 months of research and development. With it, we can also beat competing iOS camera apps by 400% on full-sensor shooting performance with the same iOS device and hardware.
Once photos are captured and buffered in real-time, our multi-threaded JPEG compression engine takes over. It compresses shots in software at speeds that exceed that of the hardware encoder normally dedicated to the task.
We had to reinvent JPEG to do it. First we studied the fast discrete cosine transform (DCT) algorithms of the early 1990s, when JPEG was first introduced. We then extended some of that research to create a new algorithm that’s a good fit for the ARM NEON SIMD co-processor instruction set architecture. The final implementation comprises nearly 10,000 lines of hand-tuned assembly code, and over 20,000 lines of low-level C code. (In comparison, the SnappyCam app comprises almost 50,000 lines of Objective C code.)
At first we did try to leverage the iPhone graphics processing unit (GPU) for the DCT computation. It turned out to be a dead-end. Back then, iOS 4 limited the data transfer speed in and out of the GPU; but even with that limitation eliminated, with the introduction of OpenGL pixel buffers in iOS 5, it appeared that the GPU parallelism was limited to about two render units that ran at a slower clock-rate than the main CPU. Without support for OpenCL or multiple render targets, we were also forced to use a naive (slow) DCT algorithm that was essentially a full matrix multiplication.
The ARM NEON approach was optimal: the SIMD pipeline can perform up to 8 simultaneous arithmetic operations in parallel at the full clock rate of the device, without any data transfer overheads, and allowing us to use any DCT algorithm we could conceive. And when it comes to speed, it’s all about doing less for more. Less computation, more work done, faster.
We also optimized out pipeline bubbles using a cycle counter tool so that every clock tick was put to work.
JPEG compression comprises two parts: the DCT (above), and a lossless Huffman compression stage that forms a compact JPEG file. Having developed a blazing fast DCT implementation, Huffman then became a bottleneck. We innovated on that portion with tight hand-tuned assembly code that leverages special features of the ARM processor instruction set to make it as fast as possible.
Similar innovations were put into a custom JPEG decoder, powering the unique SnappyCam thumb-to-interact living photo viewer. When dealing with massive 8 Mpx (32 MByte BGRX uncompressed) images, decoder performance became critical to a great user experience."