I was porting this very interesting feature to an older Glibc version for ARMv7. I need a simple tool to gauge any performance boost (or drop for that matter). Surprisingly few information is available on this topic.
I didn't search hard but most papers on
malloc performance dating back to many years ago. Other benchmark suites are too heavy for my simple task. I found some smaller tools but they are either lacking portability or credibility in test methodologies.
I learned Glibc's dev team has a set of very well engineered benchtests for their benchmark and regression tests. And this folk has extracted the necessary bits from benchtests and make it compilable on x86 Linux standalone.
I prefer my tool to be essential and simple. So what I really need is the Glibc team's test routine. With little tinkering, I cross compile it for ARMv7 Entware as well as stock FW's uClibc. Then I wrote a script to direct the tests and drive the binary routines.
So here comes the extremely simple package, 'bench-malloc' for RT-AC56U and compatible machines.
About the Tests
The tool will harness and report the performance of memory allocations in the system's LIBC library in a simulated and multi-threaded application environment.
The included script tests with 1, 2, 4, 8 and 16 threads. Beyond 16 threads may be crash prone on small systems. Each test will last one minute. The thread(s) will perform as many 'malloc/free' as possible within the minute. Each 'malloc' is a request of random size between 4 to 32768 bytes.
The pdf of random numbers seem favor small values. Hence, the per-thread cache feature is expected to performs very well in such workload. That's also typical in real-world applications.
At the end of each test, the actual time taken (by all threads) are divided by the total number of times of 'malloc/free' performed by all the threads.
The script presents this number as "per malloc(ns)", which is the average time one 'malloc' takes in nano-seconds. Less "per malloc(ns)" indicates better performance!
Hardware: 1.2GHz Cortex-A9 ASUS RT-AC56U
Stock firmware (uClibc 0.9.33.2) Stock Entware (Glibc 2.23) # th per malloc(ns) max rss(kB) # th per malloc(ns) max rss(kB) 1 534.4 752 1 284.2 744 2 3156.3 756 2 596.0 872 4 6924.8 812 4 1203.6 1180 8 14555.6 1136 8 2568.0 1628 16 32240.9 1744 16 5418.9 2240
The per-thread cache patch for Entware's LIBC is still under tests. Below are some preliminary numbers!
Entware Glibc 2.23 /w per-thread cache. # th per malloc(ns) max rss(kB) 1 158.9 756 2 315.4 956 4 646.9 1324 8 1415.4 1892 16 3131.5 2812
Hardware: 1.8GHz Cortex-A53 ASUS RT-AC86U
Stock firmware (Glibc 2.22) Stock Entware (Glibc 2.27) # th per malloc(ns) max rss(kB) # th per malloc(ns) max rss(kB) 1 201.0 2108 1 107.6 1976 2 414.7 2120 2 221.9 1984 4 826.5 2124 4 444.4 1988 8 1687.5 2124 8 894.2 3084 16 3460.5 2584 16 1805.2 4816
Test performed by SNBforum member Asad Ali. Entware uses a newer Glibc version for armv8 devices. I believe the per-thread feature is already in its Glibc. Hence, performance beats stock firmware in which Glibc 2.22 doesn't have this feature.
Get bench-malloc package
Download & Extract
cd /opt/local wget -qO- https://gitlab.com/kvic/Entware-Goodies/raw/master/bench-malloc.tgz | tar xzf -
README bench-malloc-thread.Glibc-Entware bench-malloc-thread.Glibc-Entware-aarch64 bench-malloc-thread.Glibc-FW-aarch64 bench-malloc-thread.Glibc-FW-armv8 bench-malloc-thread.uClibc-FW bench-malloc-thread.uClibc-NPTL runbench.sh
cd /opt/local/bench-malloc ./runbench.sh
Without argument, the script outputs usage and quit.
Last update: Aug 5, 2018