2024-02-11

Writing high-performance code.

Writing high-performance code requires a thorough understanding of the underlying hardware. Areas such as game development and high-frequency trading demand this knowledge for significant success. This article aims to highlight relevant areas and optimization potential.

"Top Ten" optimization approaches

Profiling before optimizing Understand where the program spends most of its time. Use profiling tools to identify bottlenecks.
Algorithmic improvements Switching to a more efficient algorithm often yields more significant performance gains than micro-optimizations.
Data Structure Selection Choosing the right data structure for the job can significantly impact performance, especially in terms of data access and manipulation.
Code Simplification Simplifying complex code can improve both readability and performance by reducing overhead and making further optimizations more apparent.
Loop Unrolling Manually expanding loops can reduce the loop overhead for critical sections of code, though this should be used sparingly and tested for actual performance improvements.
Avoiding Unnecessary Work Eliminating unnecessary calculations and avoiding work that doesn't need to be done can significantly improve performance.
Memory Access Optimization Optimizing the way memory is accessed and laid out, including minimizing cache misses and optimizing data structures for locality of reference.
Parallelization Utilizing multiple processors or cores to perform computations in parallel can dramatically speed up processing for tasks that can be divided into independent units of work.
Lazy Evaluation Deferring computations until their results are needed can save resources, especially if there's a chance the results may not be required.
Compiler Optimizations Taking advantage of compiler optimizations by understanding how different compiler flags affect performance and ensuring the code is written in a way that allows the compiler to optimize effectively.

NOTE: Use of Efficient Libraries Leveraging highly optimized libraries and frameworks for common tasks can save development time and improve performance, as these libraries are often optimized by experts.

Definitions

Term	Definition
Latency	Amount of time it takes to finish one job (time/operations)
Throughput	Number of jobs which can be finished within a given time (operations/time)

Cache Line

A cache line is the smallest unit of data transfer between the cache and main memory. Modern CPUs use caches to reduce the latency of data access by storing frequently accessed data in faster, smaller memory near the CPU cores. These caches are significantly faster than accessing RAM but are also much smaller. The cache is organized into lines, where each line typically ranges from 32 to 64 bytes in modern processors, though the exact size can vary between CPU architectures.

To retrieve the cache line size on Linux execute

$ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size 64

Ensuring data structure alignment to cache line boundaries can significantly impact performance, especially on modern architectures where cache line sizes are typically 64 bytes. Aligning data structures to these boundaries can help reduce cache misses and false sharing in multithreaded applications.

// Define a struct and align it to a 64-byte boundary, typical for cache 
// lines 

typedef struct { 
    int data; 
    // More fields here 
} __attribute__((aligned(64))) AlignedStruct;

This example uses the __attribute__((aligned(64))) syntax to request that the AlignedStruct structure be aligned on a 64-byte boundary, which matches common cache line sizes.

Dynamic Allocation Example

AlignedStruct* alloc_aligned_struct() { 
    // posix_memalign requires the memory size to be a multiple of the 
    // alignment 
    void* ptr = NULL; 

    if (posix_memalign(&ptr, 64, sizeof(AlignedStruct)) != 0) { 
        return NULL; // Allocation failed 
    } 
    return (AlignedStruct*)ptr; 
}

People of Interest

Casey Muratori - Author of Performance-Aware Programming Series
Fedor Pikus

time

Based on the displayed output from the CLI tool time under Linux, the factors influencing execution speed are explained.

time provides the following times:

Type	Explanation
real	Duration of the entire execution, including all I/O operations, such as reading a file from the hard drive.
user	Processing time in USER mode
sys	Processing time in KERNEL mode

Example measurement 1)

real 0m1.949s
user 0m26.654s
sys 0m.868s

A higher value under user than real arises when multiple cores are used. The value under user represents the aggregated value across all CPU cores.

Example measurement 2)

$ time cat index.db > /dev/null

0m02.94s real
0m00.01s user
0m00.87s system

The output of a large file primarily consists of I/O operations because the content needs to be read from the hard drive. In a subsequent measurement immediately after, the data is cached and therefore retrieved faster.

$ time cat index.db  > /dev/null 

0m00.39s real     
0m00.00s user
0m00.33s system

mmap

mmap (memory map) is a Linux/UNIX system call that maps files or devices into memory. It provides a mechanism for applications to access files on disk as if they were in memory, facilitating both reading from and writing to the file. This can lead to performance improvements because it allows for direct file data manipulation without the need for read and write system calls, and it can also reduce the overhead of disk I/O operations.

How mmap Works

Mapping: mmap creates a new mapping in the virtual address space of the calling process. The file or device content appears in memory, starting from a specified address (if given) or chosen by the kernel. The memory area is then accessible by the program as if it were a part of its own data space.

Lazy Loading

The actual loading of file data into memory is typically lazy; it doesn't happen all at once when mmap is called. Instead, data is loaded into memory on demand, i.e., as the program accesses the mapped areas.

Page Caching

The kernel uses the page cache to manage the mapped data, meaning that modifications to the mapped area can be cached and written back to the file later (in the case of a writable map), and reads can be satisfied from the cache directly if the data is already present.

Synchronization

Changes made to the mapped memory are eventually synchronized back to the file. This can happen automatically at intervals, when msync is explicitly called, or when the mapping is released (using munmap).

constexpr

constexpr in C++ specifies that the value of a variable or function can be evaluated at compile time. This means the compiler will compute the value or result before the program runs, leading to potentially more efficient code. It's used for defining constants, allowing for compile-time calculations, and ensuring that functions can be executed during the compilation phase when their arguments are also constant expressions.

Using constexpr in C++ can have the following negative aspects or limitations:

Complexity It can increase the complexity of the code, especially for developers not familiar with compile-time computations.
Restrictions on Code constexpr functions are limited in what they can do. They cannot have side effects, use dynamic memory allocation, or contain static or thread-local variables, among other restrictions.
Compile-Time Overhead While it can make the runtime execution faster, it might increase compile-time because the compiler needs to perform the computations during compilation.
Maintenance and Readability Overuse of constexpr can make code harder to maintain or read, especially if used in contexts where compile-time evaluation is not necessary or beneficial.
Portability Issues Older compilers or those not fully compliant with the latest C++ standards might not support constexpr well, leading to portability issues.
Debugging Difficulty Debugging constexpr expressions can be more challenging because they are evaluated at compile time.