Re: [arm-gnu] how to compile C code to NEON instructions
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [arm-gnu] how to compile C code to NEON instructions



David is right. You need to determine how the Neon can be used to best implement your algorithm. This involves structuring your algorithm for SIMD execution. Once structured for SIMD, then if you write the "C" code to follow this structure, there is a chance that the compiler will generate good code. However, learning the idiosyncrasies of the compiler's code generation will be difficult. I haven't found any documentation on how to write C for optimum SIMD vectorization. Trial and error coding or reading the compiler source code may be the best methods.

In determining the SIMD implementation of my algorithms, I found myself roughly coding the algorithm in Neon instructions. At that point it was easier for me to translate the algorithm to Neon Intrinsics than to figure out how to get the C compiler to produce similar results. The Intrinsics often generated unneeded register to register move instructions. If eliminating these extra instruction was important, I would implement the segment in inline assembly.

The below link provides some Neon coding examples.
http://www.arm.com/files/pdf/NEON_Support_in_the_ARM_Compiler.pdf

There is also a project to create a Neon optimized Math library. Many good coding samples can be found in its source.

For some things like in-place matrix transpose, only inline assembly will achieve optimal results. Neon can transpose a 4x4 matrix of 32-bit elements in four instructions. I have not seen a 'C' implementation that comes close to this.

Regards,
Bob

On 7/8/2011 12:33 AM, David Brown wrote:
On 08/07/2011 08:45, vandung.tran@xxxxxxxxxxxxxxxx wrote:

Hi Bob
Thank you very much for your information
> Yes, but the automatic vectorization is poor.
> Check out the Neon Intrinsics and using inline assembly in your C
programs.

Can you provide me some documents that explain why automatic
vectorization is poor?

I want to know which compiler option is better: using mfpu=neon or not?
So I am looking for some C source code (no assembly inside) that can be
compiled to NEON instructions.
However, I can 't find.

Any information would be helpful.

Best Regards,
=====================
Tran Van Dung


What are you actually trying to do here? It sounds like you want to generate Neon instructions just so that you can say "the compiler can generate Neon instructions".

First, you'll have to learn about Neon - its programming model, registers, and instructions. Then /you/ have to write some C code that is relevant for /your/ application needs, and which you are confident would be best implemented using Neon. Then you compile it using various selections of compiler flags, studying the generated assembly code. Run the code and measure real world timings - for both Neon and non-Neon variants. Then re-implement the code with explicit Neon intrinsics, and compare that to the compiler-generated code for speed and size.

/Then/, and only then, will you have a good understanding about how to work together with the compiler to get the fastest possible code.


As for automatic vectorisation being poor, it's actually a bit of a mixed bag. It has definitely been improving with newer versions of gcc - you have to be precise about the version number when asking about this, or when looking up the gcc manuals, as it's a part of gcc that has been under heavy development. The quality of the automatic vectorisation code varies a lot - different arrangements of the source code can have a heavy influence in how well the compiler can understand and optimise it. It's important to give the compiler as much information as you can - for example, it is better to use arrays with fixed sizes rather than pointers, and the ordering of loops is vital. The compiler flags will also have a big effect - many of the loop and vectorisation optimisations are not enabled by any -O flags, but must be specified explicitly.