Blog
Performance of Benchmark Program RAY
(Refer to my blog post (here) to compare the results of each benchmark progam's performance to help in understanding why we chose RAY.) At last week's meeting, it became apparent that we would want to look at the benchmark RAY for further testing and data analysis. There were two main reasosn why we chose this: 1) There was a consistently direct linear relationship between the number of SM cores and the miss and stall rates. 2) The miss and stall rates were significantly larger and more visible than for any other benchmark program. The second reason would also mean that...
GTX580 Stall and Miss Rates
Donghyeon and I looked at the miss and stall rates of the benchmark programs when varying the number of cores used. Refer to his blog post (here) for more detailed explanations of the parameters varied. As mentioned in his post, the GTX580 has 16 SM cores. Thus, the number of SM cores were varied as 1, 2, 4, 8, and 16. We looked at benchmark programs with 128 byte block size and variable associativity from 2 to 32 in increments of 2 and variable SetNum of 1, 2, 4, 6, 8, 12, and 16. However, these did not affect the...
Characterisation
BFS finds use in state space searching, graph partitioning, automatic theorem proving, etc., and is one of the most used graph operation in practical graph algorithms. The BFS problem is, given an undirected, unweighted graph G(V,E) and a source vertex S, find the minimum number of edges needed to reach every vertex V in G from source vertex S. The optimal sequential solution for this problem takes O(V +E) time. CUDA implementation of BFS. We solve the BFS problem using level synchronization. BFS traverses the graph in levels; once a level is visited it is not visited again. The BFS...
GTX580 and How to Config in GPGPU-Sim
Introducing GTX580! This is the fully unlocked version of GTX480 that have been provided by GPGPU-Sim. The only real difference is that we are going from 15 SM cores to 16 SM cores. This is done by setting gpgpu_n_clusters parameter in the config file, and everything else remains the same. ====== How to Configure GPU Cores to share L1I cache There are two files that need to be changed: gpgpusim.config config_fermi_islip.icnt Considering how default GTX580 has total of 16SM cores, let NSCALE be the number of SM cores we want to group together. (ex. if NSCALE = 2, we are...
TTR Case Study
I finally got around to making a new blog post. Firstly, I implemented zooming on the x axis and panning for the web based TTR visualization tool. Additionally, I added a slider to zoom on the y axis. The control is a bit clunky and unintuitive, but for my purposes, effective. I have been spending a lot of time familiarizing myself with TTR curves and trying to figure out both how they could be useful and how to use the added information from a TTR curve to make it more useful that miss rate. I tried for a while to...
GTX480 Stall and Miss Rates
[edited/corrected version] I looked at the miss and stall rates of the benchmark programs when varying the number of sets (called NumSet) and ran on the GTX480 machine. In particular, I looked at a Unicore (or one core cluster) with a 64 byte block size and 4-way set associativity. The NumSet was varied as 2, 4, 8, 16, and 60. Here is a table of the miss and stall rate data: Here is a bar graph for the original miss rates. Note that there is basically no variability between BFS and STO. Here is a zoomed in version so that...
Architectures
In this blog, I have detailed some relavant stats about some of the more recent architectures. These stats are in the form of SM Block Diagrams and hardware specifications in tables. All the information used to compile this blog has been taken from NVIDI's white papers talking about the respective architectures. Since we have been discussing the possibility of shifting to either Kepler or Maxwell, I have included a kind-of-qualitative comparison between the 2 architectures. Comparison between Kepler and Maxwell First, some stats about Kepler. It has 4 warp schedulers, 8 instruction dispatch units, four 32-wide vector ALUs and another...
2014-2-25-Minutes
Here are the minutes from 2/25/14. In Attendance: Prof Spjut, Donghyeon, Fabiha, and Akhil Start time: 12:05 pm End time: 1:01 pm Updates: Akhil finished running the instruction counter for all of the benchmarks. This blog post has been updated with the numbers. I finished writing a parser to extract information and calculate the stalls due to the instruction cache. I couldn't work on the instruction duplication because we have not yet edited and looked at the banks. Donghyeon edited the instruction cache output to display the stats for each core's performance and made a more thorough list of each...