This is MaxAlt optimization page * '''Launchpad Entry''': https://launchpad.net/distros/ubuntu/+spec/optimize-glibc-multi-core * '''Created''': <> * '''Contributors''': MaxAlt * '''Packages affected''': == Summary == gcc 4.2 has integrated Intel's patch for Conroe/Merom (Core Duo 2) default optimizations at -O2. It was backported to gcc 4.1, and attached in this page. When built, make sure that -m64 biarch patch is on with #define DRIVER_SELF_SPECS "%{m64:%{!mtune:-mtune=x86-64}}" Have not tested on Bensley server, but on laptop Core Duo2: - with -O3, I get a 7% improvement - with -O2, I get a 5% improvement for the static interpreter - with -O2, I get a 15% improvement for the dynamically linked interpreter ---- /!\ '''Edit conflict - other version:''' ---- gcc 4.2 has integrated Intel's patch for Conroe/Merom (Core Duo 2) default optimizations at -O2. It was backported to gcc 4.1, and attached in this page. When built, make sure that -m64 biarch patch is on with #define DRIVER_SELF_SPECS "%{m64:%{!mtune:-mtune=x86-64}}" Have not tested on Bensley server, but on laptop Core Duo2: - with -O3, I get a 7% improvement - with -O2, I get a 5% improvement for the static interpreter - with -O2, I get a 15% improvement for the dynamically linked interpreter ---- /!\ '''Edit conflict - other version:''' ---- gcc 4.2 has integrated Intel's patch for Conroe/Merom (Core Duo 2) default optimizations at -O2. It was backported to gcc 4.1, and attached in this page. When built, make sure that -m64 biarch patch is on with #define DRIVER_SELF_SPECS "%{m64:%{!mtune:-mtune=x86-64}}" Have not tested on Bensley server, but on laptop Core Duo2: - with -O3, I get a 7% improvement - with -O2, I get a 5% improvement for the static interpreter - with -O2, I get a 15% improvement for the dynamically linked interpreter ---- /!\ '''Edit conflict - your version:''' ---- ---- /!\ '''End of edit conflict''' ---- ---- /!\ '''Edit conflict - your version:''' ---- ---- /!\ '''End of edit conflict''' ---- == Rationale == == Use Cases == == Scope == * Changes in glibc == Design == Resolve overhead of interception algorithm when to use unchanged glibc/strncmp and optimized strcmp: when inlining and by call. strcmp will contain generic optimizations and will not be microarchitecture specific. The code is single threaded itself, so the shared cache architecture does not affect optimizations directly. Proposed code would : * take care of alignment/length of the string * prefetch into cache if reused or threaded * use correct optimized compiler flags and intrinsics * account for cache and cacheline size * SSE/SSE2 usage * reduce mispredictions run the new strcmp through harness tests ld.so would benefit out of optimization as well, as optimized ld would be shared architecture aware and will prefetch into cache shared strings for multi-threaded compare === Summary === === Rationale === == Implementation == == Outstanding Issues == == BoF agenda and discussion == ---- CategorySpec