Performance versus productivity: Getting high performance by using compilers, simple code rewrites and professional libraries. Evaluation using dense matrix transpose.

R. F. Rweye
11 min readMay 17, 2021

CMU 18–647: Computational Problem Solving final blog post.

0. Introduction / Big picture

A programmer’s time is finite. In the current world where everything is moving fast, duration of development and deployment of solutions is among the crucial benchmarks for measuring engineers and enterprises productivity. At the same time, performance of deployment is just as important. While these aspects may not be in absolute opposition, when faced with choice between the two, often enterprises and programmers choose productivity over performance [1], [2].

Performance in a software could generally be defined as “the response time or throughput as seen by the users”[3] is undoubtedly an important quality [4]. However it requires special attention and making the right choices. It is also very hard. One of the outstanding questions that an engineer may ask is there a way of achieving both productivity and performance? Asking differently, can one achieve performance productively?

In this article, we will attempt to respond to this question, using a well defined problem, that of a square dense matrix transpose. In subsequent sections, we will redefine the problem, why it is an interesting problem to take as an example. Take a naive implementation as a baseline and explore different “relatively” easy ways that one can use to achieve better performance. And in the end provide some pointers for more advanced approaches, these may require more time and learning to make them work, which may bite programmer’s short term’s productivity.

1. Matrix transpose problem

Matrix transpose is the exchange of rows and columns of a matrix. It is one of most useful concepts in linear algebra, a branch of mathematics which is particularly important to engineers [5]. Transposition of matrices is one of the simple concepts in linear algebra to understand. For these reasons, and the fact that linear algebra is the workhorse behind the current machine learning approaches, which requires a lot of data, it provides a clear need to why one would want a highly performing transposition. Due to the importance of linear algebra, matrix transposition is well studied and there exists a widely accepted standard, BLAS [6] and many industry high performance libraries that one could benchmark against.

In this article, we use square matrices for simplicity.

Given NXN matrix A, its transpose AT is a NXN matrix found by exchanging rows and columns as follows:

AT j,i = A i,j for i,j 1, …, N

This is easy to reason about, and make an intuitive implementation.

In smaller dimensions (small N), the intuitive implementation is both productive and performant. However, when matrix dimensions increase, to the extent that it can no longer be contained in cache memory, performance tuning could have considerable gains.

We will measure performance in terms of bandwidth, a rate which data can be read from or stored memory by the processor, measured in GB/s.

2. Hardware and software

We have chosen to use the C language. C language is mature, its possibility to give more control to the programmer enables achieving higher performance. This control may however prove too difficult to master, thus its usage can demonstrate the trade off easily.

We experimented with different machines, in the end, we kept results from a local machine with Intel Core™ i5–8250U CPU and from Regular Memory nodes from PSC Bridges 2 [7].

Figure1: Regular Memory (RM) node from Bridges 2 [7].

When trying bigger matrices, our local machine stops when we allocate maximum memory, in this case 16GB. The memory availability in the RM node of Bridges 2, gives an advantage of getting to even bigger matrix dimensions. The only constraint is the patient.

Compilers are the key to performance tuning. To use them we experimented with Clang [8] and Intel’s oneAPI DPCPP Compiler and C++ Compiler Classic [9]. We vary compiler optimization options to get a baseline and see how high the compilers can improve on the baseline performance by asking for more optimizations.

We also use Math Kernel Library by Intel [10] as an example of available professional software libraries. Software library reuse [11] has proven to be a key to achieve software reliability, performance and scalability. Using these libraries, programmers can reduce development time, but avoid reinventing the wheel each time. In addition, when they follow a better decision metric for their choice [12], they would know exactly their tradeoffs. The downside of using libraries is the time taken to learn using them and getting some productivity out of them.

3. Matrix transpose

Now that we have different components in place, we can play with matrix transposition.

3.1 Basic implementation and relying on compilers

In the basic implementation of dense square matrix transpose, we take pointers to the source and destination matrices. By traversing each column of each row from the source matrix, we save each element found to the corresponding column and row address of the destination matrix after having exchanged them.

Figure 2 : Naive dense matrix transpose

In this implementation, other than storing a multi dimensional matrix in a single memory block and accessing its elements linearly, we don’t have any further enhancements. It is thus very simple to implement.

It is useful to use this implementation to get baseline performances. First, how the compiler would compile it without any optimizations. And second, see if we could get performance improvements by simply choosing a different compiler and having some optimizations from them.

Figure 3: Makefile extract for compiling program using the naive implementation.

To get the true baseline of our implementation, we first compile the program with clang compiler with the minimum optimizations using wing -O1 directive.

We then get the compiler to do more work for us. We try using clang, dpcpp (in the local machine) and icc (in Bridges 2) with the maximum optimizations. With -O3 optimization, we want to get the compiler to optimize our code for loops, as it is a loop intensive code.

Figure 4: Performance evaluation of intuitive implementation of transpose without and with compiler optimizations.

The above figure shows what happens when we compile and run our intuitive implementation, both in the local machine and in Bridges 2. This is not an optimal implementation, the machine would either run out of memory after the 32768 x 32768 matrix transpose or get killed by the operating system.

The results we observe are rather shocking. The code seems to run faster without compiler optimizations than with compiler optimizations. Except for the smaller matrix dimension in Bridges 2 machine. It is interesting that the non optimized code performs much better across the two machines.

This bad news is good. First it shows that code would not automatically get performance improvement from the compiler, especially, if the program is not written with that in view.

Second, when the compiler optimizes the code, it optimizes the entire code, not only the bits where we want it to optimize. Performance gains in one block may be counterbalanced by poor performance due to optimization in other blocks. Luckily there are ways to use #pragma to indicate the compiler where we want it to optimize our code.

Figure 5: Control which code block should be optimized and which should not [13].

Thirdly, with all optimizations, one should always measure, to see its effect. There are changes which may normally give performance boost in some code, but not in others, and worse, give an adverse effect.

3.2 Code tweaks

As we have found out that the compiler would not help our code to run faster, we can go back to the source code and see where we are getting performance penalties, and how we can change them to make our code faster. In our dense transpose problem, we can easily employ following techniques: loops unrolling, loop blocking and implement some parallelization.

3.2.1 Loop unrolling

Could it be that our code loses performance by redoing calculation each time it loops over the matrix? Then rewriting the loops, that limits this could improve performance. Loop unrolling does that. Here below is the intuitive code transformed to do this. Instead of writing our loops, we once again make use of the compiler, by adding a #pragma instruction to tell the compiler to unroll the loop by a specified unroll factor.

Figure 6: Signatures for various loop unrolling factors
Figure 7: Transpose with loop unrolling with #pragma.

Here we still try to remain productive by attempting to get the compiler to do extra work to get more performance without changing all the implementation.

Figure 8: Evaluating performance with different unroll factors, including without unrolling and with complete loop unrolling.

Evaluating the results, there is a slight performance increase in some matrix dimensions. For instance, using an unroll factor of 100, there is a change in performance improvement from -2.3% to 12.8% over no unrolling. Also, we can observe that complete unroll provides a consistent slight improvement most of the time, except when the dimension is 32768x32768.

3.2.2 Loop tiling

Loop tiling or blocking is another possible code rewrite one can do to improve performance of dense matrix transposition. With this approach we change the interaction size to small blocks to gain some advantages from memory caches.

Figure 9: Blocking implementation with 4 nested loops.

In the implementation above, we have added extra loops to add the possibility of loop blocking. It is possible to have several nested loops, and multiple blocks, here we keep everything simple.

Figure 10: Blocking implementation, performance evaluation.

As usual, it is necessary to observe which blocking size gives the best performance. Performance may vary due processor and cache. In our implementation, blocking with 256 gives the best performance, especially when the matrix increases. This could be promising for larger matrices.

3.2.3 Parallelization

Since most current processors have several cores, we may be interested to take advantage of this and have an implementation with parallelization. Again, we want productivity as well as performance, we simply use #pragmas to get parallelization with OpenMP. To get more performance, it may be necessary to rewrite the code with explicit threads.

Figure 11: Implementation using parallelization with omp and blocking.

In this parallel implementation, we make small changes to the blocking code to have those block loops accessed in parallel.

3.2.3 Code rewrite performance

Figure 12: Performance evaluation of various code rewrite attempts.

Now that we have tried different things, how do they perform as compared to each other? Figure 12 shows that the blocking implementation gives a more consistent high performance compared to the parallelization, simple compiler optimization and loop unrolling.

3.3 Using a professional library

Often we may want to use industry standard libraries, or we do not have enough time and domain knowledge to improve our implementation’s performance. In those cases, and in many others, professional libraries can improve code performance dramatically. In our case we are using the Math Kernel Library [10]. What libraries cost in learning to use them, make up for performance and reliability.

Figure 13: Using MKL to apply transpose.
Figure 14: Performance of MKL library and using best implementation as baseline.

Taking the best performance from the code rewrite, that is blocking with 256 block size as a benchmark, how does the use of MKL implementation perform? Here we see that Blocking performs quite well compared to the use of MKL. However, using MKL provides a more consistent performance especially in Bridges2 computer.

4. Need for more speed

When there is a need for more performance, one can try more exciting possibilities. Although we will not explore Graphic Processing Units (GPUs) and Field-Programmable Gate Array (FPGA), depending on the problem at hand they may boot performance dramatically. They may even need more time to learn to use and program them to see some performance improvements.

5: Conclusion

We have seen that a programmer’s time is finite. Current technological landscape requires that an enterprise and a programmer be productive in deploying performant solutions. Here we have seen using a matrix transpose example, we can go about achieving higher performance by relying on the compiler for performance. We can get even further performance by doing some rewrite in parts of our implementations and ultimately use a library. There are other possibilities that we can take to even achieve more performance. However, we should always try to keep that in relationship with productivity. How much should we change the code to make it perform marginally better. This can only be determined with the problem at hand and experience.

6: Acknowledgements

I feel I have barely scratched the surface for what can be achieved in performance engineering. However, in the course of this semester 18–647: Computational Problem Solving has taught me far more than what I imagined. I am grateful for your help and tireless feedback from Prof. Franchetti and Sanil Rao. I have also learnt a lot from my colleagues. Thank you.

References

[1] “Stack Overflow Developer Survey 2020,” Stack Overflow. https://insights.stackoverflow.com/survey/2020/?utm_source=social-share&utm_medium=social&utm_campaign=dev-survey-2020 (accessed May 01, 2021).

[2] M. Lindstrom and M. Lindstrom, “The Truth About Being ‘Done’ Versus Being ‘Perfect,’” Fast Company, Sep. 25, 2012. https://www.fastcompany.com/3001533/truth-about-being-done-versus-being-perfect (accessed May 01, 2021).

[3] C. U. Smith, “Introduction to software performance engineering: Origins and outstanding problems,” in International School on Formal Methods for the Design of Computer, Communication and Software Systems, 2007, pp. 395–428.

[4] M. Woodside, G. Franks, and D. C. Petriu, “The Future of Software Performance Engineering,” in Future of Software Engineering (FOSE ’07), May 2007, pp. 171–187. doi: 10.1109/FOSE.2007.32.

[5] E. Kreyszig, “Advanced Engineering Mathematics 10th Edition,” 2009.

[6] The BLAS Technical Forum Standard, “BLAS (Basic Linear Algebra Subprograms).” https://www.netlib.org/blas/ (accessed May 01, 2021).

[7] Pittsburgh Supercomputing Center, “Bridges-2 | PSC,” Introducing Bridges-2. https://www.psc.edu/resources/bridges-2/ (accessed May 01, 2021).

[8] The Clang project, “Clang C Language Family Frontend for LLVM.” https://clang.llvm.org/ (accessed May 02, 2021).

[9] Intel, “Data Parallel C++ for Cross-Architecture Applications,” Intel. https://www.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-compiler.html (accessed May 16, 2021).

[10] “Accelerate Fast Math with Intel® oneAPI Math Kernel Library,” Intel. https://www.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html (accessed May 16, 2021).

[11] I. J. Mojica, B. Adams, M. Nagappan, S. Dienst, T. Berger, and A. E. Hassan, “A large-scale empirical study on software reuse in mobile apps,” IEEE software, vol. 31, no. 2, pp. 78–86, 2013.

[12] F. López de la Mora and S. Nadi, “Which Library Should I Use?: A Metric-Based Comparison of Software Libraries,” in 2018 IEEE/ACM 40th International Conference on Software Engineering: New Ideas and Emerging Technologies Results (ICSE-NIER), May 2018, pp. 37–40.

[13] Intel, “optimize,” Intel. https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top/compiler-reference/pragmas/intel-specific-pragma-reference/optimize.html (accessed May 16, 2021).

--

--