Cache Conscious software Development

The cache performance of a program can be made better by creating a new organization and a new layout of its data ad data structures. In the past years, the techniques used arranged instances that re distinct in order to get grater reference locality and hence improve the cache performance of the structures, though the technique improved the performance of cache program its limitation is that it works only well for structures that are small.

In this text focus is put on internal arrangement of fields in the structure, field reordering and splitting are the two techniques that are looked at. The techniques are expected to improve performance of cache programs whose structures are bigger and cannot fit in a single cache block. Splitting works for structures that can be compared to a cache block by creating more hot fields to it in the cache block, for example splitting led to decline in miss rates in java programs by 27 percent. In structures that cannot it in a cache block, reordering fields that have high affinity or each other. bbcache tool is used in the reordering process, reordering has been shown to increase the performance of structures by 2-3 percent.

The issue of increase in process memory gap can be sorted effectively by placing structures of data in such a way as to increase the reference locality of a program and to better how its cache performs. A data lay out that is cache conscious places objects that are temporarily related into a similar cache block or blocks that do not conflict with each other. Cache blocks are used to move data in a cache whose associativity is highly finite; having that kind of associativity it is hard for a block to be orderly placed in a cache. When related cache blocks are put in the same cache, space is more efficiently utilized as well as reduction of the catches block footprint.

Cache misses are sorted by having structural elements that are accessed concurrently being mapped to cache blocks that do not conflict, structures smaller than cache blocks are less likely to benefit from the manipulation. Where a structure size can be compared to a cache block then the structure can be split into hot and cold elements and which permits the cache conscious applications such as reorganization into the portions. Many objects in java allow use of the technique as they fall into this category, it is done by first determining the frequency at which users access the program and the access are classifies as hot which are those accessed frequently or cold which are cells rarely accessed, the cold fields are placed in a new object and access to them is referenced in an indirect manner while access to the hot fields remain the same. The effect of use of the cache conscious programs on the java structures has reduces miss rates by 29-43 percent and class splitting takes an account of 26-42 percent of the reduction and on performance improvement the cache conscious method has set a record of 18-28 percent and class splitting takes 22-66 percent of the improved performance.

Structure splitting

The chilimbi and lams techniques proposed the use of a garbage collector so as to ensure dynamical lying out of objects and to have those with temporal affinity placed next to each other so as to get to the same cache block. Chilimbi ad lams demonstrated that objects in Cecil program which is a language that has similarities to java were less than a half a cache block. With that low overhead and data profiling in real time is made possible and they have also come up with a new algorithm of copying that produces a layout that is cache conscious. Through experiments, java objects are found to be small but approximately 8 bytes bigger than the Cecil objects and when the cache conscious techniques of collation designed by lams and chilimbi are applied they give performance improvements that are lower of 10-20 percent as compared to those of Cecil objects whose improvement is 14-37 percent. The variation in the improvement is attributed to java objects being larger hence reduction in the number of objects that can be contemporaneously accessed that could have been packed into a cache block. The ways to reduce the size of the java objects is to split into hot and cold portions, splitting enables a larger number of objects to be placed in a cache block and accessed at the same time. Splitting of the structure is a method of optimization of java objects that is well known and applied manually.

A class splitting process occurs by first analyzing the java program in form of byte codes and then it is instrumented using the BIT, the analysis yields information on the class that includes names of the fields their types and sizes the java program that has been instrumented is executed and profiling is done. There is an algorithm that gives the classes that should be split based on the data that is static and dynamic. The decisions of splitting are then communicated to the vortex complier whose function is to compile the byte code of java to native code; the vortex compiler also splits the communicated classes and changes the program to account for the changes made. The class splitting techniques applied to java programs leads to performance improvement of 18-28 percent.

Average Java Object Size

An experiment was carried out to test the hypothesis that mostly the heaps allocated to java objects are averagely smaller than a cache block, objects that are equal to or greater than 256 bytes are considered to be large and their management is different. Small objects are seen to die faster than larger objects yet the cache conscious lay out described in the chilimbi larus paper can only work for objects that live longer. The experiment measures the numbers of small objects that are alive after each scavenge throughout the entire program. Rom the results obtained, it is seen that the hypothesis is true and that a great number of java objects are small and on average the live object is smaller than a cache block. When comparing the size of java and Cecil objects, java objects are bigger by 8 bytes and the largeness is attributed to java having large object headers and the largeness makes the effectiveness of packing java objects into a cache block low. In order to be effective, the data access patterns need to be measured; mostly a program of a training run made earlier is used to guide the program optimization. In order to get data access patterns that are correct, then real time profiling of data is necessary and it also saves the programmer an extra profile execute cycle. The real time profiling must has to be made low in order not to be over weighted by costs.

Class Information

BIT is an instrumentation technique that is used to gather class information that is static that includes names, number of field’s access types and descriptors for fields that are not static. The fields that are not static are the constituents of instance variables that are located in a heap. BIT is also used to come up with access frequencies of the fields.

Hot/cold class splitting algorithm

The primary advantage of splitting is the capability to place a lot of hot classes in a cache block while the disadvantage is the cost incurred in creating an additional reference from the cold portion, redirection of cold access and having a lot of objects in the memory. The trouble encountered in splitting classes to hot and cold runs can only be solved when the program is made to rerun on the same set of data but the interest in this case to have the classes split so as greater performance is achieved for a wide range of inputs. A check is performed to ensure that classes are no split without having a dam of sufficient representatives, a formula was designed to come up with the threshold which is At > LS I (IOO*C) where LS represents the overall number of field accesses, C total number of classes with one field access and the number represents live classes. A cache conscious copying algorithm is used to transfer the data, the algorithm can be divided into 3 steps, the first step is to exchange the roles of the spaces TO and FROM, the pointers that are not processed and are free are initialized to the start of the space TO, then on the affinity graph, the ones with the highest affinity weight are picked and a greedy depth traversing of the graph is performed beginning at the node.

Program Transformation

The vortex compiler is modified to split the classes that have been selected and sent by the splitting algorithm to carry out program transformations. The frequently accessed fields are not changed and the cold fields are put in a new counterpart of the class. The complier also modifies the used codes so as to account accountable or the split classes, the changes that are made in splitting include creating a replacement for the cold field’s access through indirection and give it a reference number.

Field reordering

Logical grouping is done on large structures and the resulting layout may not correspond to the access patterns and thus the interaction between the structure and the data access pattern of the program will be poor and lead to cache misses that are unnecessary. In order to ensure increase in the cache block utilization the bbcaches recommendations are applied s they also ensure reduced pressure on the cache by putting together fields that have high temporal affinity to each other in one block. Reordering structure affects some languages such as C and the program correctness is interfered with and thus the bbcaches recommendations must be verified by a programmer before they are applied to a C program.

Bbcache

To use the bbcache, a program is filed to back up a memory of the access made to it; the file is temporal and contains information such as the frequency of access to the structure field access. The bbcache puts together data that is dynamic together with the statistical analysis. There is an algorithm that is used to create a field structure and is divided to three steps; the first is to put dynamic and static data in one database about the field access of the structure. The database is used to create graphs of field affinity for each structure. bbcache contains a facility for evaluation that comes up with a cost metric which is a representation of the working set of the cache block and also a metric of locality that shows the utilization of the cache block. The bbcache is described to be used to be used in field reordering; the program has two metrics that are used for comparison of layout to an already designed one. The two metrics perform two functions that are, show the active cache blocks when the program is being executed and to give measurements that of cache block.

To come up with the database, an AS toolkit is used, it analyses the source program and produces the data on the structure field access in a file that includes a source file, name of the field, structure type and name. The database contains information on the fields and structures. A field affinity graph is constructed by the bbcache on the structure instance whose nodes show the temporal information that is accessed simultaneously. The instance affinity graphs of from each structure are combined to come up with one affinity graph for each structure. The aim is to increase inherent locality and can only be done by putting fields with temporal affinity that is high so that they get into one cache block. The final configuration of the structure is the total sum of the layout affinities and the field order of the structure is produced by bbcache using a greedy algorithm from the created affinity graphs. The process starts by summing up the pair of fields whose maximum affinity edge is connected and one field is appended at each stage to the layout. The selected field is the one that records largest increase in locality. When it comes to the evaluating the field orders structure the best method of evaluation is to measure the performance and it is done by editing and rerunning the application.

The experimental evaluation of class splitting describes the methodology used to in order to get how effective the splitting algorithm is and also to measure its impact on the performance on the programs especially the java. The methodology involved the use of the vortex compiler that had been optimized aggressively. The programs that were compiled were run on one processor of an ultra-server with 12167 MHz of ultrasparc processors and a memory of 2GB. The large memory is to ensure that the improved performance is the only factor leading to the locality benefit and not the reduced paging activity. The experiments in the first set was to get to know the potential in the java programs for class splitting and also determine how sensitive or insensitive splitting decisions affect the program inputs. The results from the experiments show that five java programs that were benchmarked have a good number of classes that allow for splitting as can be seen from the fact that field access profiles.

The variables that are there to handle errors, store limit values and give a reference to information that is critical to the structure on the traversal path are all stored in the cold fields. On the issue of sensitivity, the algorithm is not sensitive to the input data that is used or profiling accesses. In the split classes the access to its fields, is 45-64 percent of all the program fields that were accessed, the average size of the split dynamic classes were gotten by weighting each class with number of split instances. The algorithm of splitting reduces the size of dynamic class by 17-23 percent and this shows variation in hot and cold portion.

During the experiment, the impact of chilimbi and lams object location technique of the cache conscious application was measured on the five java programs that were being benchmarked and the results of the hot/ cold splitting of classes was also measured. The results of the measurements show that the object location technique works by reducing miss rates by 10-27 percent. On the issue of execution time, the chilimbi and lams technique shows improvement of 6-18 percent.

Structure Field Reordering for C

A 4 processor of a frequency of four hundred megahertz Pentium Xeon system with a memory of 1MB per processor was used. The entire system had a memory of 4GB and with 200 disks. A Microsoft server was instrumented to collect traces of Field access while using TPC-C, Through running of the TPC-C the Microsoft SQL Server was the initially instrument used to assemble trace of configuration field accesses. This traces produced under this system was used by bbcache to produce recommended structure field. Bbcache can interpret up to 98% of the structures for example in 2000 structures in single gut that were defined in SQL Server source active structures were indicted to be about 163. It also indicated that the structures that were in the top 25 among the 163 structures accounted for nearly 85% of the accessed structures. Therefore most of focus is usually made to the top 25% structures that account for high fraction of structures SQL Server that have to be maintained in their order to avoid affecting the compatibility of the disk. These structures are highly dependent, this dependency prevents any reordering without affecting the other structure.

Cache performance can be improved by the two techniques which are class splitting and reordering and they work by restructuring the internal organization of data in a structure and they improve performance of larger structures. Java programs allow splitting of field access profiles to either hot or cold fields using the class splitting technique, the splitting decisions are strong and can be insensitive to data being input for profiling. The splitting technique reduced the miss rates of java programs by 27 percent and improved performance by 6-18 percent. For structures that is large reordering fields in order to place those with temporal affinity that is high in the same cache block to improve utilization.

Cache Conscious software Development

1 Comment