Writing high-performance code requires a thorough understanding of the underlying hardware. Areas such as game development and high-frequency trading demand this knowledge for significant success. This article aims to highlight relevant areas and optimization potential.
Profiling before optimizing Understand where the program spends most of its time. Use profiling tools to identify bottlenecks.
Algorithmic improvements Switching to a more efficient algorithm often yields more significant performance gains than micro-optimizations.
Data Structure Selection Choosing the right data structure for the job can significantly impact performance, especially in terms of data access and manipulation.
Code Simplification Simplifying complex code can improve both readability and performance by reducing overhead and making further optimizations more apparent.
Loop Unrolling Manually expanding loops can reduce the loop overhead for critical sections of code, though this should be used sparingly and tested for actual performance improvements.
Avoiding Unnecessary Work Eliminating unnecessary calculations and avoiding work that doesn't need to be done can significantly improve performance.
Memory Access Optimization Optimizing the way memory is accessed and laid out, including minimizing cache misses and optimizing data structures for locality of reference.
Parallelization Utilizing multiple processors or cores to perform computations in parallel can dramatically speed up processing for tasks that can be divided into independent units of work.
Lazy Evaluation Deferring computations until their results are needed can save resources, especially if there's a chance the results may not be required.
Compiler Optimizations Taking advantage of compiler optimizations by understanding how different compiler flags affect performance and ensuring the code is written in a way that allows the compiler to optimize effectively.
NOTE: Use of Efficient Libraries Leveraging highly optimized libraries and frameworks for common tasks can save development time and improve performance, as these libraries are often optimized by experts.
Term | Definition |
---|---|
Latency | Amount of time it takes to finish one job (time/operations) |
Throughput | Number of jobs which can be finished within a given time (operations/time) |
A cache line is the smallest unit of data transfer between the cache and main memory. Modern CPUs use caches to reduce the latency of data access by storing frequently accessed data in faster, smaller memory near the CPU cores. These caches are significantly faster than accessing RAM but are also much smaller. The cache is organized into lines, where each line typically ranges from 32 to 64 bytes in modern processors, though the exact size can vary between CPU architectures.
To retrieve the cache line size on Linux execute
$ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size 64
Ensuring data structure alignment to cache line boundaries can significantly impact performance, especially on modern architectures where cache line sizes are typically 64 bytes. Aligning data structures to these boundaries can help reduce cache misses and false sharing in multithreaded applications.
// Define a struct and align it to a 64-byte boundary, typical for cache
// lines
typedef struct {
int data;
// More fields here
} __attribute__((aligned(64))) AlignedStruct;
This example uses the __attribute__((aligned(64)))
syntax to request that the
AlignedStruct structure be aligned on a 64-byte boundary, which matches common
cache line sizes.
AlignedStruct* alloc_aligned_struct() {
// posix_memalign requires the memory size to be a multiple of the
// alignment
void* ptr = NULL;
if (posix_memalign(&ptr, 64, sizeof(AlignedStruct)) != 0) {
return NULL; // Allocation failed
}
return (AlignedStruct*)ptr;
}
Based on the displayed output from the CLI tool time under Linux, the factors influencing execution speed are explained.
time
provides the following times:
Type | Explanation |
---|---|
real | Duration of the entire execution, including all I/O operations, such as reading a file from the hard drive. |
user | Processing time in USER mode |
sys | Processing time in KERNEL mode |
Example measurement 1)
real 0m1.949s
user 0m26.654s
sys 0m.868s
A higher value under user than real arises when multiple cores are used. The value under user represents the aggregated value across all CPU cores.
Example measurement 2)
$ time cat index.db > /dev/null
0m02.94s real
0m00.01s user
0m00.87s system
The output of a large file primarily consists of I/O operations because the content needs to be read from the hard drive. In a subsequent measurement immediately after, the data is cached and therefore retrieved faster.
$ time cat index.db > /dev/null
0m00.39s real
0m00.00s user
0m00.33s system
mmap (memory map) is a Linux/UNIX system call that maps files or devices into memory. It provides a mechanism for applications to access files on disk as if they were in memory, facilitating both reading from and writing to the file. This can lead to performance improvements because it allows for direct file data manipulation without the need for read and write system calls, and it can also reduce the overhead of disk I/O operations.
Mapping: mmap creates a new mapping in the virtual address space of the calling process. The file or device content appears in memory, starting from a specified address (if given) or chosen by the kernel. The memory area is then accessible by the program as if it were a part of its own data space.
The actual loading of file data into memory is typically lazy; it doesn't happen all at once when mmap is called. Instead, data is loaded into memory on demand, i.e., as the program accesses the mapped areas.
The kernel uses the page cache to manage the mapped data, meaning that modifications to the mapped area can be cached and written back to the file later (in the case of a writable map), and reads can be satisfied from the cache directly if the data is already present.
Changes made to the mapped memory are eventually synchronized back to the file. This can happen automatically at intervals, when msync is explicitly called, or when the mapping is released (using munmap).
constexpr
in C++ specifies that the value of a variable or function can be
evaluated at compile time. This means the compiler will compute the value or
result before the program runs, leading to potentially more efficient code. It's
used for defining constants, allowing for compile-time calculations, and
ensuring that functions can be executed during the compilation phase when their
arguments are also constant expressions.
Using constexpr
in C++ can have the following negative aspects or limitations:
constexpr
functions are limited in what they can
do. They cannot have side effects, use dynamic memory allocation, or contain
static or thread-local variables, among other restrictions.constexpr
can make code harder
to maintain or read, especially if used in contexts where compile-time
evaluation is not necessary or beneficial.constexpr
well, leading to portability
issues.constexpr
expressions can be more
challenging because they are evaluated at compile time.