This is MaxAlt optimization page

Summary

gcc 4.2 has integrated Intel's patch for Conroe/Merom (Core Duo 2) default optimizations at -O2. It was backported to gcc 4.1, and attached in this page. When built, make sure that -m64 biarch patch is on with #define DRIVER_SELF_SPECS "%{m64:%{!mtune:-mtune=x86-64}}" Have not tested on Bensley server, but on laptop Core Duo2: - with -O3, I get a 7% improvement - with -O2, I get a 5% improvement for the static interpreter - with -O2, I get a 15% improvement for the dynamically linked interpreter


Edit conflict - other version:


gcc 4.2 has integrated Intel's patch for Conroe/Merom (Core Duo 2) default optimizations at -O2. It was backported to gcc 4.1, and attached in this page. When built, make sure that -m64 biarch patch is on with #define DRIVER_SELF_SPECS "%{m64:%{!mtune:-mtune=x86-64}}" Have not tested on Bensley server, but on laptop Core Duo2: - with -O3, I get a 7% improvement - with -O2, I get a 5% improvement for the static interpreter - with -O2, I get a 15% improvement for the dynamically linked interpreter


Edit conflict - other version:


gcc 4.2 has integrated Intel's patch for Conroe/Merom (Core Duo 2) default optimizations at -O2. It was backported to gcc 4.1, and attached in this page. When built, make sure that -m64 biarch patch is on with #define DRIVER_SELF_SPECS "%{m64:%{!mtune:-mtune=x86-64}}" Have not tested on Bensley server, but on laptop Core Duo2: - with -O3, I get a 7% improvement - with -O2, I get a 5% improvement for the static interpreter - with -O2, I get a 15% improvement for the dynamically linked interpreter


Edit conflict - your version:



End of edit conflict



Edit conflict - your version:



End of edit conflict


Rationale

Use Cases

Scope

Design

Resolve overhead of interception algorithm when to use unchanged glibc/strncmp and optimized strcmp: when inlining and by call.

strcmp will contain generic optimizations and will not be microarchitecture specific. The code is single threaded itself, so the shared cache architecture does not affect optimizations directly. Proposed code would : * take care of alignment/length of the string * prefetch into cache if reused or threaded * use correct optimized compiler flags and intrinsics * account for cache and cacheline size * SSE/SSE2 usage * reduce mispredictions

run the new strcmp through harness tests

ld.so would benefit out of optimization as well, as optimized ld would be shared architecture aware and will prefetch into cache shared strings for multi-threaded compare

Summary

Rationale

Implementation

Outstanding Issues

BoF agenda and discussion


CategorySpec

Maxalt (last edited 2008-08-06 16:16:29 by localhost)