Initial review of data interpretation/collection for Summer 2014
Hello, my name is Greg Cawthorne, I’m with the MAGEEC project from 18th June to around September 4th and will be working with the team at Embecosm along with my peers at Bristol University.
In this first blog post I shall discuss what work has been done thus far — mainly around Machine Learning and collecting energy readings — as well as where I shall be doing further research and have potential routes to explore in terms of improvement.
Review of Machine Learning in Munaaf’s paper
Munaaf collected and benchmarked 84 programs running on a Pandaboard, which utilises an ARM Cortex A9 processor. He used three STM32F4-Discovery boards connected to the power rails VDD_CORE1 and VDD_CORE2, and also through power rail VDD_MEM, to monitor the energy consumption in the processor and memory respectively.
Munaaf used 500 randomised combinations of optimisation flags in all the benchmarks. He analysed each flag and corresponding energy reading and classified them as ‘good’, represented by a 1, or ‘bad’, represented by a 0. He did this by sorting all the energy readings under a given flag combination for each program and summed up how many times a flag appears in the lower energy readings, and then the same for the higher energy readings, thus to give a ratio of good appearances over bad ones. The flags were then classified as good or bad for a given program.
The next step was to create feature vectors for each program. These were the same as the features chosen in the MILEPOST research project. Once this is done the flag combinations were split up into two mutually exclusive sets of size 78, and each combination has saved a feature vector for all 84 programs and ‘class’ variable specifying if it is good or bad.
Next 8 different machine learning algorithms were run using the WEKA (Waikato Environment for Knowledge Analysis) libraries to try and predict if the second (unseen) set of flag combinations were good or bad. The classification results that the machine learner produced were then compared to see if it matched how the initial run had classified them, and statistics like how close it was to the initial analysis, the number of false negatives and execution time of the ML algorithm.
It was concluded that the favourite algorithms were:
- J48 (Fastest, 70% accuracy, high N° of FN)
- 1-NN (Fast, 65% accuracy, low N° of FN)
- LADTree (Slow, 65% accuracy, lowest N° of FN)
- KStar (Very slow, average accuracy, low N° of FN)
Future plans & ideas
Gathering more programs
Munaaf’s dissertation noted the lack of programs to benchmark. BEEBS aims to solve this and more work is to be done on it.
Additional thought: Automatic code generation could be potentially be explored and may lead to an indefinite amount of programs. A program could be made to fit a desired feature vector (after the features are verified to be useful with ILP – another potential extra) N.B. we must ensure the programs are produced with random variation to avoid over-fitting.
Improving data collection
Munaaf first selected 500 random combinations of flags. I propose a more effective approach: if we first create feature vectors for all benchmark programs, then apply, for example, all the flags on their own and then find the feature vectors for the representation of the program after optimisation, and then benchmark as normal to get an energy reading, we can then ‘guess’ potentially good program flag combinations in the data collection stage, which may improve the ML stages.
The advantage of having the start feature vector, as well as the feature vector of the optimised representation is as follows: imagine that we have a program p1, with feature vector v1 and it is optimised with flag combination f1 to get p2 with corresponding v2 and then it is compiled and benchmarked to get an energy reading e1. Say e1 represents high energy consumption (poor optimisation). Then say we spot in the history of previously compiled programs there is a second program p3 with feature vector v3 and it was optimised with flag combination f2 to produce p4 and v4, which was then benchmarked to get e2. Say e2 turns out to represent low energy consumption (good optimisation). If we notice v3 is similar to v2 and therefore, if the feature vector is truly representative we can infer that p2 is similar to p3, then it would be sensible to next try to optimise program p1 with flags f1, then f2. Hopefully this would get a new energy reading similar to e2. I hope to test this theory out on hardware as soon as possible.
Improving Machine Learning techniques
Currently all our machine learning works on a flag by flag basis and unfortunately doesn’t take into account order of flags (which can make a huge difference, since flags are like functions which operate on ‘feature vector’ space and some will be non-commutative). I plan to come up with another system whereby the order flags are taken into consideration and potentially lazy machine learning algorithms are used to predict good flags, which are then actually tested out in hardware to get energy readings. This way the machine learner ‘s database can updated in a real-time feedback loop. I have already begun to experiment with the WEKA libraries.
Alternate flag picking system 1: scoring all flags
This approach considers all potential flags after each optimisation is performed and looks at the most recent feature vector, v1, of the program after the most recent optimisation is completed. It then uses ML to decide which flag is most likely to help (based on the frequency it appeared in the training data for programs with similar feature vectors to v1 and executed with low energy readings). This is a greedy approach and could get stuck in local maxima or loop recommending the same flag (implying another optimisation is required for the flag being applied to take effect and said flag has no effect on it in the current form).
Alternate flag picking system 2: using displacement vectors
This method looks at the feature vector of the original source code and the feature vector of the most recently optimised representation and creates a displacement vector (potentially characterising all the flags that have been performed thus far), then we scan through our database of compilations (with corresponding intermediate feature vectors) which were considered ‘good’ and see if there are any flags or combination of flags with produce similar displacement vectors, then apply the next flag they did (and perhaps the rest?). Note, although the displacement vector we make of out working solution will be for every flag so far, the displacement vector we find could represent any sub-combination of the compilation in the database.
Alternate flag picking system 3: use compilation (large overhead)
Currently our ML systems are set up to evaluate individual flags. Perhaps altering the ML to instead evaluate feature vectors — how likely the program they represent will have a good energy reading — could lead to a better solution. The new ML system would be given a feature vector from GCC and compare it to the entire database of feature vectors which were benchmarked and have corresponding energy readings. It would then find the closest feature vector to it and return a score for the feature vector based on the energy reading associated from the benchmark.
It is important to note that input feature vector from GCC will be compared with the feature vectors of programs post optimisation (similar to the method in the improving data collection post). Since, if they were pre-optimisation, the ML would then also have to provide the flags for the optimisation as the energy reading’s relationship with original input code is not relevant.
The GCC plugin can be adapted so that when a new program is introduced it starts with no flags. It then creates the feature vector, v1, for the input code bite. It then cycles through every flag compiling the code with just one flag at a time and creates new optimised feature vectors from the intermediate code. Next it evaluates all the generated feature vectors and finds the lowest energy reading for each one (greedy algorithm) within a given tolerance, decides which is best for a given architecture and commits to that optimisation. It may pick v1, in which case no flags are added at all. This can be repeated for the addition of more flags, using the last optimised feature vector as the new start feature vector. This system also takes into consideration the order of optimisations.
N.B. this will have a longer execution time (Number of flags^2 * amount of code bites = number of compilations) and may be too much of an overhead. Extra methodology would need to be implemented to consider flags that depend on other flags to compile. Also as its greedy it has the risk of only finding local maxima.
Final observations
A mixture of the proposed improvements could be used, as well as perhaps some randomisation in order to help each of the algorithms overcome things like getting stuck in local maxima and to offset the overhead. Next I will be putting together hardware and monitoring equipment to gather data and test out my ideas further.