Re: [arm-gnu] how to compile C code to NEON instructions
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [arm-gnu] how to compile C code to NEON instructions
- To: vandung.tran@xxxxxxxxxxxxxxxx
- Subject: Re: [arm-gnu] how to compile C code to NEON instructions
- From: Bob Feretich <bob.feretich@xxxxxxxxxxxxxxx>
- Date: Fri, 08 Jul 2011 12:32:36 -0700
David is right. You need to determine how the Neon can be used to best
implement your algorithm. This involves structuring your algorithm for
SIMD execution. Once structured for SIMD, then if you write the "C" code
to follow this structure, there is a chance that the compiler will
generate good code. However, learning the idiosyncrasies of the
compiler's code generation will be difficult. I haven't found any
documentation on how to write C for optimum SIMD vectorization. Trial
and error coding or reading the compiler source code may be the best
methods.
In determining the SIMD implementation of my algorithms, I found myself
roughly coding the algorithm in Neon instructions. At that point it was
easier for me to translate the algorithm to Neon Intrinsics than to
figure out how to get the C compiler to produce similar results. The
Intrinsics often generated unneeded register to register move
instructions. If eliminating these extra instruction was important, I
would implement the segment in inline assembly.
The below link provides some Neon coding examples.
http://www.arm.com/files/pdf/NEON_Support_in_the_ARM_Compiler.pdf
There is also a project to create a Neon optimized Math library. Many
good coding samples can be found in its source.
For some things like in-place matrix transpose, only inline assembly
will achieve optimal results.
Neon can transpose a 4x4 matrix of 32-bit elements in four instructions.
I have not seen a 'C' implementation that comes close to this.
Regards,
Bob
On 7/8/2011 12:33 AM, David Brown wrote:
On 08/07/2011 08:45, vandung.tran@xxxxxxxxxxxxxxxx wrote:
Hi Bob
Thank you very much for your information
> Yes, but the automatic vectorization is poor.
> Check out the Neon Intrinsics and using inline assembly in your C
programs.
Can you provide me some documents that explain why automatic
vectorization is poor?
I want to know which compiler option is better: using mfpu=neon or not?
So I am looking for some C source code (no assembly inside) that can be
compiled to NEON instructions.
However, I can 't find.
Any information would be helpful.
Best Regards,
=====================
Tran Van Dung
What are you actually trying to do here? It sounds like you want to
generate Neon instructions just so that you can say "the compiler can
generate Neon instructions".
First, you'll have to learn about Neon - its programming model,
registers, and instructions. Then /you/ have to write some C code
that is relevant for /your/ application needs, and which you are
confident would be best implemented using Neon. Then you compile it
using various selections of compiler flags, studying the generated
assembly code. Run the code and measure real world timings - for both
Neon and non-Neon variants. Then re-implement the code with explicit
Neon intrinsics, and compare that to the compiler-generated code for
speed and size.
/Then/, and only then, will you have a good understanding about how to
work together with the compiler to get the fastest possible code.
As for automatic vectorisation being poor, it's actually a bit of a
mixed bag. It has definitely been improving with newer versions of
gcc - you have to be precise about the version number when asking
about this, or when looking up the gcc manuals, as it's a part of gcc
that has been under heavy development. The quality of the automatic
vectorisation code varies a lot - different arrangements of the source
code can have a heavy influence in how well the compiler can
understand and optimise it. It's important to give the compiler as
much information as you can - for example, it is better to use arrays
with fixed sizes rather than pointers, and the ordering of loops is
vital. The compiler flags will also have a big effect - many of the
loop and vectorisation optimisations are not enabled by any -O flags,
but must be specified explicitly.