1# Native Memory Allocator Verification 2This document describes how to verify the native memory allocator on Android. 3This procedure should be followed when upgrading or moving to a new allocator. 4A small minor upgrade might not need to run all of the benchmarks, however, 5at least the 6[SQL Allocation Trace Benchmark](#sql-allocation-trace-benchmark), 7[Memory Replay Benchmarks](#memory-replay-benchmarks) and 8[Performance Trace Benchmarks](#performance-trace-benchmarks) should be run. 9 10It is important to note that there are two modes for a native allocator 11to run in on Android. The first is the normal allocator, the second is 12called the svelte config, which is designed to run on memory constrained 13systems and be a bit slower, but take less RSS. To enable the svelte config, 14add this line to the `BoardConfig.mk` for the given target: 15 16 MALLOC_SVELTE := true 17 18The `BoardConfig.mk` file is usually found in the directory 19`device/<DEVICE_NAME>/` or in a sub directory. 20 21When evaluating a native allocator, make sure that you benchmark both 22versions. 23 24## Android Extensions 25Android supports a few non-standard functions and mallopt controls that 26a native allocator needs to implement. 27 28### Iterator Functions 29These are functions that are used to implement a memory leak detector 30called `libmemunreachable`. 31 32#### malloc\_disable 33This function, when called, should pause all threads that are making a 34call to an allocation function (malloc/free/etc). When a call 35is made to `malloc_enable`, the paused threads should start running again. 36 37#### malloc\_enable 38This function, when called, does nothing unless there was a previous call 39to `malloc_disable`. This call will unpause any thread which is making 40a call to an allocation function (malloc/free/etc) when `malloc_disable` 41was called previously. 42 43#### malloc\_iterate 44This function enumerates all of the allocations currently live in the 45system. It is meant to be called after a call to `malloc_disable` to 46prevent further allocations while this call is being executed. To 47see what is expected for this function, the best description is the 48tests for this funcion in `bionic/tests/malloc_itearte_test.cpp`. 49 50### Mallopt Extensions 51These are mallopt options that Android requires for a native allocator 52to work efficiently. 53 54#### M\_DECAY\_TIME 55When set to zero, `mallopt(M_DECAY_TIME, 0)`, it is expected that an 56allocator will attempt to purge and release any unused memory back to the 57kernel on free calls. This is important in Android to avoid consuming extra 58RSS. 59 60When set to non-zero, `mallopt(M_DECAY_TIME, 1)`, an allocator can delay the 61purge and release action. The amount of delay is up to the allocator 62implementation, but it should be a reasonable amount of time. The jemalloc 63allocator was implemented to have a one second delay. 64 65The drawback to this option is that most allocators do not have a separate 66thread to handle the purge, so the decay is only handled when an 67allocation operation occurs. For server processes, this can mean that 68RSS is slightly higher when the server is waiting for the next connection 69and no other allocation calls are made. The `M_PURGE` option is used to 70force a purge in this case. 71 72For all applications on Android, the call `mallopt(M_DECAY_TIME, 1)` is 73made by default. The idea is that it allows application frees to run a 74bit faster, while only increasing RSS a bit. 75 76#### M\_PURGE 77When called, `mallopt(M_PURGE, 0)`, an allocator should purge and release 78any unused memory immediately. The argument for this call is ignored. If 79possible, this call should clear thread cached memory if it exists. The 80idea is that this can be called to purge memory that has not been 81purged when `M_DECAY_TIME` is set to one. This is useful if you have a 82server application that does a lot of native allocations and the 83application wants to purge that memory before waiting for the next connection. 84 85## Correctness Tests 86These are the tests that should be run to verify an allocator is 87working properly according to Android. 88 89### Bionic Unit Tests 90The bionic unit tests contain a small number of allocator tests. These 91tests are primarily verifying Android extensions and non-standard behavior 92of allocation routines such as what happens when a non-power of two alignment 93is passed to memalign. 94 95To run all of the compliance tests: 96 97 adb shell /data/nativetest64/bionic-unit-tests/bionic-unit-tests --gtest_filter="malloc*" 98 adb shell /data/nativetest/bionic-unit-tests/bionic-unit-tests --gtest_filter="malloc*" 99 100The allocation tests are not meant to be complete, so it is expected 101that a native allocator will have its own set of tests that can be run. 102 103### Libmemunreachable Tests 104The libmemunreachable tests verify that the iterator functions are working 105properly. 106 107To run all of the tests: 108 109 adb shell /data/nativetest64/memunreachable_binder_test/memunreachable_binder_test 110 adb shell /data/nativetest/memunreachable_binder_test/memunreachable_binder_test 111 adb shell /data/nativetest64/memunreachable_test/memunreachable_test 112 adb shell /data/nativetest/memunreachable_test/memunreachable_test 113 adb shell /data/nativetest64/memunreachable_unit_test/memunreachable_unit_test 114 adb shell /data/nativetest/memunreachable_unit_test/memunreachable_unit_test 115 116### CTS Entropy Test 117In addition to the bionic tests, there is also a CTS test that is designed 118to verify that the addresses returned by malloc are sufficiently randomized 119to help defeat potential security bugs. 120 121Run this test thusly: 122 123 atest AslrMallocTest 124 125If there are multiple devices connected to the system, use `-s <SERIAL>` 126to specify a device. 127 128## Performance 129There are multiple different ways to evaluate the performance of a native 130allocator on Android. One is allocation speed in various different scenarios, 131another is total RSS taken by the allocator. 132 133The last is virtual address space consumed in 32 bit applications. There is 134a limited amount of address space available in 32 bit apps, and there have 135been allocator bugs that cause memory failures when too much virtual 136address space is consumed. For 64 bit executables, this can be ignored. 137 138### Bionic Benchmarks 139These are the microbenchmarks that are part of the bionic benchmarks suite of 140benchmarks. These benchmarks can be built using this command: 141 142 mmma -j bionic/benchmarks 143 144These benchmarks are only used to verify the speed of the allocator and 145ignore anything related to RSS and virtual address space consumed. 146 147For all of these benchmark runs, it can be useful to add these two options: 148 149 --benchmark_repetitions=XX 150 --benchmark_report_aggregates_only=true 151 152This will run the benchmark XX times and then give a mean, median, and stddev 153and helps to get a number that can be compared to the new allocator. 154 155In addition, there is another option: 156 157 --bionic_cpu=XX 158 159Which will lock the benchmark to only run on core XX. This also avoids 160any issue related to the code migrating from one core to another 161with different characteristics. For example, on a big-little cpu, if the 162benchmark moves from big to little or vice-versa, this can cause scores 163to fluctuate in indeterminate ways. 164 165For most runs, the best set of options to add is: 166 167 --benchmark_repetitions=10 --benchmark_report_aggregates_only=true --bionic_cpu=3 168 169On most phones with a big-little cpu, the third core is the little core. 170Choosing to run on the little core can tend to highlight any performance 171differences. 172 173#### Allocate/Free Benchmarks 174These are the benchmarks to verify the allocation speed of a loop doing a 175single allocation, touching every page in the allocation to make it resident 176and then freeing the allocation. 177 178To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: 179 180 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_free_default 181 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_free_default 182 183To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands: 184 185 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_free_decay1 186 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_free_decay1 187 188The last value in the output is the size of the allocation in bytes. It is 189useful to look at these kinds of benchmarks to make sure that there are 190no outliers, but these numbers should not be used to make a final decision. 191If these numbers are slightly worse than the current allocator, the 192single thread numbers from trace data is a better representative of 193real world situations. 194 195#### Multiple Allocations Retained Benchmarks 196These are the benchmarks that examine how the allocator handles multiple 197allocations of the same size at the same time. 198 199The first set of these benchmarks does a set number of 8192 byte allocations 200in one loop, and then frees all of the allocations at the end of the loop. 201Only the time it takes to do the allocations is recorded, the frees are not 202counted. The value of 8192 was chosen since the jemalloc native allocator 203had issues with this size. It is possible other sizes might show different 204results, but, as mentioned before, these microbenchmark numbers should 205not be used as absolutes for determining if an allocator is worth using. 206 207This benchmark is designed to verify that there is no performance issue 208related to having multiple allocations alive at the same time. 209 210To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: 211 212 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_default 213 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_default 214 215To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands: 216 217 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_decay1 218 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_decay1 219 220For these benchmarks, the last parameter is the total number of allocations to 221do in each loop. 222 223The other variation of this benchmark is to always do forty allocations in 224each loop, but vary the size of the forty allocations. As with the other 225benchmark, only the time it takes to do the allocations is tracked, the 226frees are not counted. Forty allocations is an arbitrary number that could 227be modified in the future. It was chosen because a version of the native 228allocator, jemalloc, showed a problem at forty allocations. 229 230To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: 231 232 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_default 233 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_default 234 235To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these command: 236 237 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_decay1 238 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_decay1 239 240For these benchmarks, the last parameter in the output is the size of the 241allocation in bytes. 242 243As with the other microbenchmarks, an allocator with numbers in the same 244proximity of the current values is usually sufficient to consider making 245a switch. The trace benchmarks are more important than these benchmarks 246since they simulate real world allocation profiles. 247 248#### SQL Allocation Trace Benchmark 249This benchmark is a trace of the allocations performed when running 250the SQLite BenchMark app. 251 252This benchmark is designed to verify that the allocator will be performant 253in a real world allocation scenario. SQL operations were chosen as a 254benchmark because these operations tend to do lots of malloc/realloc/free 255calls, and they tend to be on the critical path of applications. 256 257To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: 258 259 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_default 260 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_default 261 262To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands: 263 264 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_decay1 265 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_decay1 266 267These numbers should be as performant as the current allocator. 268 269#### mallinfo Benchmark 270This benchmark only verifies that mallinfo is still close to the performance 271of the current allocator. 272 273To run the benchmark, use these commands: 274 275 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_mallinfo 276 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_mallinfo 277 278Calls to mallinfo are used in ART so a new allocator is required to be 279nearly as performant as the current allocator. 280 281#### mallopt M\_PURGE Benchmark 282This benchmark tracks the cost of calling `mallopt(M_PURGE, 0)`. As with the 283mallinfo benchmark, it's not necessary for this to be better than the previous 284allocator, only that the performance be in the same order of magnitude. 285 286To run the benchmark, use these commands: 287 288 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_mallopt_purge 289 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_mallopt_purge 290 291These calls are used to free unused memory pages back to the kernel. 292 293### Memory Trace Benchmarks 294These benchmarks measure all three axes of a native allocator, RSS, virtual 295address space consumed, speed of allocation. They are designed to 296run on a trace of the allocations from a real world application or system 297process. 298 299To build this benchmark: 300 301 mmma -j system/extras/memory_replay 302 303This will build two executables: 304 305 /system/bin/memory_replay32 306 /system/bin/memory_replay64 307 308And these two benchmark executables: 309 310 /data/benchmarktest64/trace_benchmark/trace_benchmark 311 /data/benchmarktest/trace_benchmark/trace_benchmark 312 313#### Memory Replay Benchmarks 314These benchmarks display RSS, virtual memory consumed (VA space), and do a 315bit of performance testing on actual traces taken from running applications. 316 317The trace data includes what thread does each operation, so the replay 318mechanism will simulate this by creating threads and replaying the operations 319on a thread as if it was rerunning the real trace. The only issue is that 320this is a worst case scenario for allocations happening at the same time 321in all threads since it collapses all of the allocation operations to occur 322one after another. This will cause a lot of threads allocating at the same 323time. The trace data does not include timestamps, 324so it is not possible to create a completely accurate replay. 325 326To generate these traces, see the [Malloc Debug documentation](https://android.googlesource.com/platform/bionic/+/master/libc/malloc_debug/README.md), 327the option [record\_allocs](https://android.googlesource.com/platform/bionic/+/master/libc/malloc_debug/README.md#record_allocs_total_entries). 328 329To run these benchmarks, first copy the trace files to the target using 330these commands: 331 332 adb shell push system/extras/traces /data/local/tmp 333 334Since all of the traces come from applications, the `memory_replay` program 335will always call `mallopt(M_DECAY_TIME, 1)' before running the trace. 336 337Run the benchmark thusly: 338 339 adb shell memory_replay64 /data/local/tmp/traces/XXX.zip 340 adb shell memory_replay32 /data/local/tmp/traces/XXX.zip 341 342Where XXX.zip is the name of a zipped trace file. The `memory_replay` 343program also can process text files, but all trace files are currently 344checked in as zip files. 345 346Every 100000 allocation operations, a dump of the RSS and VA space will be 347performed. At the end, a final RSS and VA space number will be printed. 348For the most part, the intermediate data can be ignored, but it is always 349a good idea to look over the data to verify that no strange spikes are 350occurring. 351 352The performance number is a measure of the time it takes to perform all of 353the allocation calls (malloc/memalign/posix_memalign/realloc/free/etc). 354For any call that allocates a pointer, the time for the call and the time 355it takes to make the pointer completely resident in memory is included. 356 357The performance numbers for these runs tend to have a wide variability so 358they should not be used as absolute value for comparison against the 359current allocator. But, they should be in the same range as the current 360values. 361 362When evaluating an allocator, one of the most important traces is the 363camera.txt trace. The camera application does very large allocations, 364and some allocators might leave large virtual address maps around 365rather than delete them. When that happens, it can lead to allocation 366failures and would cause the camera app to abort/crash. It is 367important to verify that when running this trace using the 32 bit replay 368executable, the virtual address space consumed is not much larger than the 369current allocator. A small increase (on the order of a few MBs) would be okay. 370 371There is no specific benchmark for memory fragmentation, instead, the RSS 372when running the memory traces acts as a proxy for this. An allocator that 373is fragmenting badly will show an increase in RSS. The best trace for 374tracking fragmentation is system\_server.txt which is an extremely long 375trace (~13 million operations). The total number of live allocations goes 376up and down a bit, but stays mostly the same so an allocator that fragments 377badly would likely show an abnormal increase in RSS on this trace. 378 379NOTE: When a native allocator calls mmap, it is expected that the allocator 380will name the map using the call: 381 382 prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, <PTR>, <SIZE>, "libc_malloc"); 383 384If the native allocator creates a different name, then it necessary to 385modify the file: 386 387 system/extras/memory_replay/NativeInfo.cpp 388 389The `GetNativeInfo` function needs to be modified to include the name 390of the maps that this allocator includes. 391 392In addition, in order for the frameworks code to keep track of the memory 393of a process, any named maps must be added to the file: 394 395 frameworks/base/core/jni/android_os_Debug.cpp 396 397Modify the `load_maps` function and add a check of the new expected name. 398 399#### Performance Trace Benchmarks 400This is a benchmark that treats the trace data as if all allocations 401occurred in a single thread. This is the scenario that could 402happen if all of the allocations are spaced out in time so no thread 403every does an allocation at the same time as another thread. 404 405Run these benchmarks thusly: 406 407 adb shell /data/benchmarktest64/trace_benchmark/trace_benchmark 408 adb shell /data/benchmarktest/trace_benchmark/trace_benchmark 409 410When run without any arguments, the benchmark will run over all of the 411traces and display data. It takes many minutes to complete these runs in 412order to get as accurate a number as possible. 413