Choosing the right compiler options can drastically reduce the energy consumption of a program. Finding the right way to schedule those options can give equally significant gains. Modern compilers use pass managers to schedule the order in which optimisations are applied, but these aim to reduce some combination of compile time and execution time rather than energy efficiency. I am interested in the task of learning how to predict which sequences of compiler optimisations will minimise energy consumption of a given executable on a given machine. Before this problem can be tackled it is first necessary to obtain data on how performance metrics (such as execution time and energy consumption) are affected by the ordering of optimisations for various benchmarks on various architectures. Part of the Mageec project involves the collection of this data and its storage in a relational database. In this article, I examine a number of relational representations that could be used for this purpose and explain the rationale behind the most promising approach.
Introduction
The information in this article is taken from a presentation I gave to the Mageec group on 09/10/2013 entitled “A note on the Relational Representation of Benchmark Data acquired under the Sequential Scheduling of Compiler Passes”. The aim of this work is simply to find a good way of representing compiler pass sequences in a relational form that can be used by the Mageec project. The proposed solutions are based on methods from the field of relational data mining, which has developed relational techniques for encoding sequences (as well as more general structures such as trees and graphs, which are beyond the scope of the current application). Six different database designs are examined in turn, and it is argued why each successive one can be considered an improvement over its predecessor. This leads us to an elegant solution that satisfies good database design principles and can be efficiently queried with respect to the sorts of task we are interested in.
Database 1
prog |
seq |
time |
energy |
a |
“ABC” |
u |
f |
a |
“CBB” |
v |
g |
a |
… |
… |
… |
b |
… |
… |
… |
… |
… |
… |
… |
For the purposes of illustration assume we wish to store the execution time and energy consumption of several benchmarks when compiled under different sequences of compiler optimisations. Suppose we have a number of benchmark programs a,b,c… and a number of candidate optimisation passes A,B,C,… Then the above table shows that the result of compiling program a with passes A, B, C (in that order) gives time u and energy f. But using a different pass sequence B, C, C gives (potentially different) results v and g. Note how a given pass can appear once, many times, or no times in a sequence; and, in general, sequences can be of any finite length. Note also that underlined column headers denote primary key attributes.
The above table essentially uses a string to store the sequence of passes used in each compilation. While this is very a simple way to store the information in relational form, it does have several disadvantages. First any restriction on the maximum length of strings could be problematic in practice (as many passes may have to be scheduled). Second most interesting queries over this data representation will have to rely on string processing functions – which is not usually the strong point of query engines (or machine learning algorithms). It would be better to make the pass sequences more explicit in the structure of the database.
Database 2
prog |
1 |
2 |
3 |
… |
N |
time |
energy |
a |
A |
B |
C |
… |
null |
u |
f |
a |
C |
B |
B |
… |
null |
v |
g |
a |
… |
… |
… |
… |
… |
… |
… |
b |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
A better solution uses a separate column to store each individual pass in a sequence of successive positions 1,2,3,…,N as shown above. But this requires an arbitrary bound N to be placed on the length of any sequence – which is likely to be inconvenient in practice. This approach is also likely to lead to a sparse table with many null entries. Moreover these null’s will be in primary key attributes – which is not good practice or even allowed in many database systems.
Database 3
prog |
seq |
time |
energy |
a |
#1 |
u |
f |
a |
#2 |
v |
g |
a |
… |
… |
… |
b |
… |
… |
… |
… |
… |
… |
… |
|
seq |
1 |
2 |
3 |
… |
N |
#1 |
A |
B |
C |
… |
null |
#2 |
C |
B |
B |
… |
null |
… |
… |
… |
… |
… |
… |
|
The answer is to split the data into two separate tables by using a unique identifier #1,#2,… to represent each sequence of passes as shown above. This structure ensures that primary key attributes are non-null but the second table is still likely to be sparse (as most sequences will not contain the maximum number of passes) and the arbitrary bound N is still likely to be inconvenient in practice.
Database 4
prog |
seq |
time |
energy |
a |
#1 |
u |
f |
a |
#2 |
v |
g |
a |
… |
… |
… |
b |
… |
… |
… |
… |
… |
… |
… |
|
seq |
A |
B |
C |
… |
X |
#1 |
{1} |
{2} |
{3} |
… |
null |
#2 |
null |
{2,3} |
{1} |
… |
null |
… |
… |
… |
… |
… |
… |
|
Another possibility is to use each non-key column in the second table to store the position(s) in a sequence occupied by each of the possible passes A,B,C,…,X as shown above. But the fact some passes may not be used means the table will still be sparse; and the fact that some passes may be used more than once means individual entries are non-atomic. Although non-atomicity is not allowed in many database systems, the above representation can be achieved in MySQL, for example, using the group_concat functionality. But we would prefer to rely on standard SQL functionality.
Database 5
prog |
seq |
time |
energy |
a |
#1 |
u |
f |
a |
#2 |
v |
g |
a |
… |
… |
… |
b |
… |
… |
… |
… |
… |
… |
… |
|
seq |
next |
pass |
#1 |
#1′ |
A |
#1′ |
#1” |
B |
#1” |
null |
C |
#2 |
#2′ |
C |
#2′ |
#2” |
B |
#2” |
null |
B |
… |
… |
… |
|
An alternative approach is to use the second table to explicitly store the implicit sequence successor relation as a linked list by introducing additional identifiers to effectively denote list suffixes as shown above. This structure is much less sparse (with just one null in the second table for each sequence used in the first table). But reconstructing a given sequence effective relies on computing transitive closures – which is not currently supported by some database systems such as MySQL.
Database 6
prog |
seq |
time |
energy |
a |
#1 |
u |
f |
a |
#2 |
v |
g |
a |
… |
… |
… |
b |
… |
… |
… |
… |
… |
… |
… |
|
seq |
pos |
pass |
#1 |
1 |
A |
#1 |
2 |
B |
#1 |
3 |
C |
#2 |
1 |
C |
#2 |
2 |
B |
#2 |
3 |
B |
… |
… |
… |
|
A more elegant solution is to explicitly index each position in the sequence in the second table as shown above. This structure results in no null values, non non-atomic values, no need to compute transitive closures, and can easily be queried. For added efficiency it is convenient to generate sequence identifiers #1,#2,… by applying a hash function to the sequence they represent (so that #1=hash(“ABC“) for example). This makes it slightly easier to determine if a given sequence is already in the database. (So, for example, if program b is compiled with passes ABC then it is easy to check that a corresponding identifier #1 is already in the database using the hash function. Note that without the hash function we would have to query for an identifier X such that (X,1,A), (X,2,B), (X,3,B) are all in the second table but (X,4,Y) is not for any Y).
Conclusion
From the above discussion, Database 6 seems to be the best database structure in both theory and practice. Moreover, this approach can be utilised to handle many extensions of the data that we envisage. For example we can easily include parameters for each pass by adding another column or using another table to store parameter sequences in the same way we stored pass sequences. We can also store information about the target architecture and associated compiler, as well as other performance metrics such as compilation time or average and peak power consumption. Because optimisation passes are compiler dependent, it may be sensible to use a different database to store data relating to each compiler. But if necessary the information could potentially be stored in a single database as follows:
machine |
compiler |
prog |
seq |
time |
energy |
max power |
avg power |
PC |
gcc |
a |
#1 |
u |
f |
p |
r |
PC |
gcc |
a |
#2 |
v |
g |
q |
s |
PC |
gcc |
a |
… |
… |
… |
… |
… |
PC |
gcc |
b |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
|
seq |
pos |
pass |
param |
#1 |
1 |
A |
… |
#1 |
2 |
B |
… |
#1 |
3 |
C |
… |
#2 |
1 |
C |
… |
#2 |
2 |
B |
… |
#2 |
3 |
B |
… |
… |
… |
… |
… |
|