Recent posts (max 20) - Browse or Archive for more

Compiling BookSim with clang

I recently played around with using clang to compile BookSim instead of GCC. Here are some results running on one of our dual-quadcore Nehalem machines:

sh-4.0$ g++ -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man \
--infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla \
--enable-bootstrap --enable-shared --enable-threads=posix \
--enable-checking=release --with-system-zlib --enable-__cxa_atexit \
--disable-libunwind-exceptions --enable-gnu-unique-object \
--enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk \
--disable-dssi --enable-plugin \
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre \
--enable-libgcj-multifile --enable-java-maintainer-mode \
--with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib \
--with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 \
--build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.4 20100630 (Red Hat 4.4.4-10) (GCC) 

sh-4.0$ make clean; time make -j32 CPP=g++
[...]
real	0m14.907s
user	1m34.602s
sys	0m6.589s

sh-4.0$ clang++ -v
clang version 1.1 (branches/release_27)
Target: x86_64-redhat-linux-gnu
Thread model: posix

sh-4.0$ make clean; time make -j32 CPP=clang++
[...]
real	0m8.078s
user	1m3.135s
sys	0m4.282s

Not bad at all: Total compilation time was almost cut in half. However, the real improvement over GCC is that clang generates warning and error messages that are actually helpful:

sh-4.0$ make clean; make -j32 CPP=g++
[...]
networks/dragonfly.cpp: In function ‘void min_dragonflynew(const Router*, \
const Flit*, int, OutputSet*, bool)’:
networks/dragonfly.cpp:487: warning: statement has no effect
[...]

sh-4.0$ make clean; make -j32 CPP=clang++
[...]
networks/dragonfly.cpp:487:11: warning: expression result unused
      [-Wunused-value]
    f->ph == 2;
    ~~~~~ ^  ~
[...]

The warning emitted by GCC doesn't go much beyond "there may be something wrong on line 487," while clang pinpoints exactly what is wrong: A comparison operator where an assignment was intended to be. It even marks the specific column. Nice.

How about the speed of the generated code?

sh-4.0$ make clean; make -j32 CPP=g++
[...]

sh-4.0$ time ./booksim examples/mesh88_lat
[...]
real	0m1.645s
user	0m1.624s
sys	0m0.019s

sh-4.0$ make clean; make -j32 CPP=clang++
[...]

sh-4.0$ time ./booksim examples/mesh88_lat
[...]
real	0m1.808s
user	0m1.784s
sys	0m0.021s

So it appears that GCC's level of maturity does result in somewhat better code generation than what clang currently produces. However, if nothing else, between the significantly reduced compile times and the much more helpful warning and error messages, clang seems like a clear choice at least during development.

Efficient One-Hot / Multi-Hot Detect

Here's an efficient, scalable way of checking if a given vector has more than a single bit set (which is the same as checking if it is not a valid one-hot encoded vector):

The basic ideas is to split the vector into pairs of bits, and then for each pair check if both of its bits are set (in which case we know that the vector is not one-hot encoded); additionally, we check if any of the pair's bits are set, and then recursively perform the same evaluation for pairs of pairs, etc.

As an example, let's assume we have an 8-bit vector in[0:7]. The first level of logic will then look like this:

wire [0:7] L1 = in;

wire [0:3] L1_both = {L1[0] & L1[1], L1[2] & L1[3], L1[4] & L1[5], L1[6] & L1[7]};

wire L1_multi_hot = |L1_both;

wire [0:3] L1_any = {L1[0] | L1[1], L1[2] | L1[3], L1[4] | L1[5], L1[6] | L1[7]};

The L1_any vector indicating if one or more bits in a pair are set is then fed into an identical (except for vector widths) second stage:

wire [0:3] L2 = L1_any;

wire [0:1] L2_both = {L2[0] & L2[1], L2[2] & L2[3]};

wire L2_multi_hot = |L2_both;

wire [0:1] L2_any = {L2[0] | L2[1], L2[2] | L2[3]};

This recursion continues with additional stages as necessary until the input vector is reduced to two bits; in our case, with an input vector of width 8, this happens in the third stage:

wire [0:1] L3 = L2_any;

wire L3_both = L3[0] & L3[1];

wire L3_multi_hot = L3_both;

The overall check bit can then simply be computed by ORing the multi_hot signals generated by the individual stages:

wire multi_hot = L1_multi_hot | L2_multi_hot | L3_multi_hot;

This approach yields delay that scales approximately logarithmically with the width of the input vector, and avoids the hardware cost of comparing all possible pairs of bits directly.

EDIT: An implementation of this scheme is available in our subversion repository: c_multi_hot_det.v.

Supercomputing '09

George and I are in Portland, OR, this week for the SC09 conference. We're just about done with the first two days of the conference, which comprise tutorials on various topics; so far, the quality has been a bit of a hit-or-miss affair, but the OpenCL tutorial we're both sitting in as I type this has been pretty interesting so far. Tonight will have the conference opening gala, and then the technical program will kick off tomorrow and run through the end of the week. George and I will be presenting our papers in room E145-146 on Wednesday afternoon, 4:00-4:30 and 4:30-5:00, respectively.

Hello World

In the interest of sharing information and keeping people up-to-date on what's happening in our group, I've changed the access permissions to the blog such that entries can now be made private or public by prefixing their shortname accordingly. Public entries are visible to (and can be commented on by) everybody (no user account required).

ACS Workshop

George and I will be attending the NSA-sponsored 2009 Advanced Computing Systems Research Program Workshop in Annapolis, MD, on Wednesday and Thursday. In particular, we will be presenting a poster about the NoC-related research going on in our group. Curt and JongSoo will be presenting ELM, and Bill will give a talk on Power-Efficient Supercomputing.

Verilog Vector Extension Oddity

So it turns out I was just bitten by the exact same problem I already had in r1282 again; for future reference, I'll quote from the commit message here:

Apparently Verilog (or just VCS?) and I have different ideas about how the expression

cred_count_q - ~credit

should be interpreted; I thought this should subtract a single-bit value, corresponding to the bit complement of credit, from the cred_count_q vector, but what VCS actually ends up doing is first extending credit to match the width of cred_count_q, and then bit-inverting it, effectively resulting in the following expression:

cred_count_q - (-credit-1) = cred_count_q + credit + 1

Interestingly, the unexpected behavior even persists when adding parentheses to override operator precedence:

cred_count_q - (~credit)

It is unclear to me if this is a bug in VCS, undefined behavior according to the Verilog LRM, or just me misunderstanding the rules for bit vector expansion.

STM Research-in-Progress Trip

George and I will be visiting ST Microelectronics in Grenoble (France) between 10/10 and 10/14 to talk about the research activites currently going on in the NoC group. Besides the two of us, Curt will be there to talk about ELM, and students from Prof. Murrmann's and Prof. Nishi's groups will present more device-oriented research.

This should be a great opportunity to have some face-to-face discussions with the people from ST about the test chip we plan to tape out.

  • Posted: 2009-09-10 00:00 (Updated: 2009-09-30 22:23)
  • Author: dub
  • Categories: news
  • Comments (0)

Ungrouping all instances of a design in Design Compiler

Throughout my router RTL, I make extensive use of small helper modules for commonly encountered logic, such as select muxes or flipflops. While this allows me to avoid a rather large amount of unnecessary code duplication, it does add additional levels of hierarchy to the design, which in many cases are unwanted and in fact may be detrimental to optimization.

While I have thus far usually just flattened all the top-level blocks inside the router to avoid this issue, the latter approach makes it somewhat difficult to e.g. get power breakdowns for different parts of the router.

As the common modules usually do not represent actual levels of hierarchy, a better solution is to ungroup all of these modules into their parent module, which can be achieved using the following TCL commands:

set common_designs [find reference "c_*" -hierarchy]
set common_macros [find	reference "c_*_mac*" -hierarchy] 
set common_others [remove_from_collection ${common_designs} ${common_macros}]
ungroup	${common_others}

(Note that this does not ungroup those common modules that actually represent macros.)

  • Posted: 2009-08-31 17:00 (Updated: 2010-02-02 16:00)
  • Author: dub
  • Categories: notes tips
  • Comments (0)

Flit buffer implementation variants

With revision r1431, I added support for a bunch of additional register file types to build the flit buffer from, including latch-based variants and DesignWare implementations. Here are preliminary synthesis results for a mesh router (2x1x1 VCs, 8 buffers each, 4 header buffers, 32 bits per flit, separable input-first allocators).

Implementation Delay (ns) Comb. Area Noncomb. Area Cell Internal Power Net Switching Power Cell Leakage Power
FF 2D 2.1 16850.081075 28909.137822 2.3401E-3 1.3423E-3 45.0393E-6
FF 1D MUX 2.1 16840.908296 28905.609840 2.3186E-3 1.3352E-3 44.8422E-6
FF 1D Tristate 2.1 14304.629081 35643.736943 2.3015E-3 1.2529E-3 46.8056E-6
FF DesignWare 2.3 11274.782586 14027.327833 1.6845E-3 907.9999E-6 30.1393E-6
Latch 2D 2.1 18130.039453 26320.643963 1.9813E-3 1.3119E-3 64.6658E-6
Latch 1D MUX 2.1 17436.434644 24306.508821 1.9470E-3 1.2205E-3 57.6310E-6
Latch 1D Tristate 2.3 13005.972269 28779.307125 1.7287E-3 1.0423E-3 44.8526E-6
Latch Designware 2.3 11269.843385 14032.619832 1.6750E-3 889.3071E-6 30.1315E-6

Surprisingly, the area difference between the latch- and register-based implementations is fairly small; however, the latch-based implementations do use less power and energy per cycle. The DesignWare implementations are slightly slower than most of the self-built ones, but use significantly less area (although once again the difference between register- and latch-based implementations is minimal). EDIT: The DesignWare results are actually invalid, as synthesis was unable to find the corresponding DW_ram_* blocks, and thus left them unresolved.

Router Implementation Tradeoffs

I performed a bunch of synthesis sweeps to evaluate the impact of different router configuration options:

Configuration Ports VCs Buffers Headers Flit Width VC Allocator Switch Allocator
Mesh 5 2x1x1 8 4 32 sep_if sep_if
Fbfly 10 2x2x1 8 4 32 sep_if sep_if

Here are the synthesis results:

Config. Crossbar Header FIFO Arbiters Delay Comb. Area Noncomb. Area Total Area Cell Internal Power Net Switching Power Total Dynamic Power Dynamic Energy per Cycle Cell Leakage Power
Mesh MUX Shifting RR 2.2ns 18067 26933 45000 2.22mW 1.20mW 3.43mW 7.546pJ 44.99uW
Mesh Dist. MUX Shifting RR 2.2ns 18512 26994 45506 2.23mW 1.19mW 3.42mW 7.524pJ 45.63uW
Mesh Tristate Shifting RR 2.2ns 17086 29144 46230 2.22mW 1.20mW 3.42mW 7.524pJ 45.07uW
Mesh MUX Shifting Matrix 2.2ns 18587 27934 46521 2.26mW 1.20mW 3.46mW 7.612pJ 45.70uW
Mesh Dist. MUX Shifting Matrix 2.2ns 18905 27752 46657 2.24mW 1.24mW 3.48mW 7.656pJ 45.76uW
Mesh Tristate Shifting Matrix 2.2ns 17153 30141 47294 2.24mW 1.28mW 3.52mW 7.744pJ 44.06uW
Mesh MUX Indexed RR 2.1ns 18789 27080 45869 2.33mW 1.28mW 3.61mW 7.581pJ 48.00uW
Mesh Dist. MUX Indexed RR 2.1ns 19315 26945 46259 2.33mW 1.24mW 3.57mW 7.497pJ 48.51uW
Mesh Tristate Indexed RR 2.0ns 18728 29139 47867 2.43mW 1.29mW 3.72mW 7.44pJ 49.78uW
Mesh MUX Indexed Matrix 2.1ns 18245 27876 46122 2.33mW 1.33mW 3.66mW 7.686pJ 45.99uW
Mesh Dist. MUX Indexed Matrix 2.1ns 19006 27890 46896 2.36mW 1.28mW 3.64mW 7.644pJ 47.96uW
Mesh Tristate Indexed Matrix 2.1ns 16236 30126 46362 2.29mW 1.25mW 3.54mW 7.434pJ 43.17uW
Fbfly MUX Shifting RR 2.9ns 103242 111082 214324 5.95mW 3.82mW 9.77mW 28.333pJ 249.57uW
Fbfly Dist. MUX Shifting RR 3.0ns 99304 111101 210405 5.69mW 3.57mW 9.26mW 27.78pJ 238.03uW
Fbfly Tristate Shifting RR 3.0ns 91287 121334 212621 5.67mW 3.67mW 9.34mW 28.02pJ 231.23uW
Fbfly MUX Shifting Matrix 2.8ns 121282 128140 249422 8.19mW 4.74mW 12.93mW 36.204pJ 285.55uW
Fbfly Dist. MUX Shifting Matrix 2.9ns 113708 128063 241771 7.86mW 4.58mW 12.44mW 36.076pJ 264.93uW
Fbfly Tristate Shifting Matrix 2.9ns 104154 137647 241801 7.78mW 4.49mW 12.27mW 35.583pJ 254.70uW
Fbfly MUX Indexed RR 2.9ns 96797 111271 208067 5.87mW 3.66mW 9.54mW 27.666pJ 237.82uW
Fbfly Dist. MUX Indexed RR 3.1ns 89714 111164 200878 5.41mW 3.36mW 8.77mW 27.187pJ 218.67uW
Fbfly Tristate Indexed RR 2.8ns 94355 120646 215013 6.03mW 4.16mW 10.19mW 28.532pJ 242.61uW
Fbfly MUX Indexed Matrix 2.8ns 115120 128432 243552 8.13mW 4.75mW 12.89mW 36.092pJ 276.85uW
Fbfly Dist. MUX Indexed Matrix 2.9ns 114956 128337 243293 7.83mW 4.44mW 12.27mW 35.583pJ 273.37uW
Fbfly Tristate Indexed Matrix 2.8ns 110200 138614 248814 8.10mW 4.84mW 12.94mW 36.232pJ 273.31uW

For the mesh configuration with shifting header FIFO, the crossbar is not critical, and thus all variants lead to the same cycle time; area and power differences are minor, but overall the plain MUX-based implementation is probably the best choice.

If an indexed header FIFO is used, the tristate-based implementation is somewhat faster at the expense of somewhat higher power dissipation; the MUX-based implementation, on the other hand, yields lowest area, and the one based on distributed MUXes dissipates the least power.

For the flattened butterfly, overall the MUX-based implementation is fastest when using shifting header FIFOs, while the distributed MUX has least area, and the tristate-based implementation dissipates the least power.

When using indexed header FIFOs, the tristate-based implementation is the fastest but most expensive (both in terms of area and power), while the distributed MUX is the most power and area efficient.

UPDATE: Added results for energy per cycle.

SC09 Paper Accepted!

So it appears my SC09 submission got accepted -- sweet. The final version of the paper is due about two months from now, which should give me plenty of time to re-run all my booksim simulations with the fixed version that properly accounts for switch allocation delay, and possibly even include results for combined allocation; likewise, I need to re-run all synthesis jobs with properly constrained input and output timing.

There's also a few other things I might look into, time permitting; in particular, it might be interesting to compare synthesis results for the entire router, rather than for the allocator by itself. Perhaps this could even include P&R results. Furthermore, I would like to investigate the performance impact of "holding" output ports once they are granted and until either credits or flits run out or the packet ends.

Synthesis Results for Other Router Models

In order to check whether my router implementation is reasonable from an area and cycle time perspective, I'm performing synthesis sweeps for Mullins's and Peh's router models. Initial results seem to indicate that neither model properly accounts for the fact that the router inputs should arrive late in the cycle, having just traversed the channel. I'm kicking off some additional synthesis runs without the port constraints to see what the situation is like for the internal logic.

Update: When allowing for 0.5ns (about 2x the library setup time) of the channel cycle to be used by the first router stage, both Mullins's and Peh's design fail to make timing regardless of cycle time; it seems that both models have non-trivial amounts of logic that occurs before the first retiming stage. The fact that my router design does not require stealing time from the channel cycles is something I should probably stress when publishing my RTL.

Here are the synthesis results for the models I looked at when synthesized without port constraints:

Router modelCellsCombAreaNoncombAreaTotalAreaCycleTimeCellPowerNetPowerDynPowerLeakPower
Peh (2 VCs)240983268751006836932.711.913.7515.650.07
Mullins (nonspec, 2 VCs)94671351819747332652.06.762.959.700.03
Mullins (spec, 2 VCs)91771247719670321482.35.862.548.410.03

Using DesignWare Components

Browsing through the DesignWare overview, it seems I am currently duplicating a lot of functionality that is provided as ready-made IP blocks by DesignWare. In particular, I may be able to use the following DesignWare components to replace modules I had written from scratch:

DW_arb_fcfs matrix_arbiter
DW_arb_rr rr_arbiter
DW01_binenc encoder
DW01_decode decoder
DW_fifo_s1_sf fifo
DW_fifoctl_s1_sf fifo_ctrl
DW_ram_r_w_s_dff, DW_ram_r_w_s_lat regfile

If I do go ahead and implement these, I should probably leave in the current implementation as well so the RTL can be compiled even without DesignWare.

EDIT: The DW_arb_rr component is actually not a suitable replacement for rr_arbiter; even when output_mode is set to 0, grants trail requests by one cycle, which is not what the remaining logic expects. Furthermore, there's no provision to gate the clock on the internal state to keep it unchanged; enable deactivates the arbiter completely, which is not what we want. DW_arb_fcfs, on the other hand, works as expected.

More Notes on Async vs. Sync Reset

Here's some post-synthesis results for a 5x5 router with one VC per message class, 32-bit flit width, 8 flit buffers and 4 header buffers per VC:

resetdelayarea
sync2.045530
async2.146531

Asynchronous vs. Synchronous Reset

It looks like both Mullins and Peh use synchronous reset for their router implementations; the coding guidelines for McKeown's NetFPGA project also require sync reset. So far, I've been using asynchronous reset for my synthesis runs; is there some disadvantage to that approach that I'm missing? Intuitively, it seems to be that sync reset might extend the critical path and make clock gating expressions more complicated (need escape path to enable clocks for reset); I'll run synthesis sweeps with both approaches to try and see what the actual impact is.

More Notes on Traffic Generation

One issue with random traffic generation is the question of how to generate routing information; in particular, the router RTL currently expects the destination addresses for successive resource classes to be distinct, as it makes the basic assumption that at each hop, packets either stay in the same resource class or advance to the next, but no further than that; however, if two successive resource classes had the same destination address, the router at that address would have to advance by two or more resource classes.

For the sim-only testbench, this is not an issue, as I can just iteratively build the routing information by starting with a random destination and then generating the next one by adding a random (but non-zero) offset to it; for the synthesizable traffic generator, however, this is not an option, since I cannot restrict the random number generation appropriately: While an LFSR by itself would produce only non-zero values, its period would be too short; taking the LSBs of a (potentially much) wider LFSR, or using all or part of a CFSR, however, can in fact produce an all-zeroes vector (and thus the same address again).

Router Synthesis Results

Here are a couple of synthesis results:

Configuration:

  • 2D flattened butterfly with 64 nodes
  • 4 nodes per router
  • 8 flit buffers, 4 header buffers
  • 2x2x1 VCs
  • 32 bits per flit (+overhead)
CreditHandlingTrackVCsCellsCombAreaNoncombAreaTotalAreaCycleTimeCellPowerNetPowerDynPowerLeakPower
LevelOutput475925071218972144042.86.133.9010.040.24
PushOutput628963381226262189642.96.083.889.960.25
LevelInput638878051185052063102.85.933.509.430.23
PushInput593929641121142120772.95.903.669.570.24

More Notes on Packet Generation

Regarding the bias issues I mentioned earlier, I found references to so-called complete feedback shift registers (CFSRs), which insert the all-zeroes state after the 'd1 state. This eliminates the bias, and should allow me to directly generate random VC IDs etc.

How To Generate Random Destination VCs

One way to go about this would be to just generate a random VC ID by taking an LFSR output modulo num_vcs. The drawback with that is once again the fact that the range of the LFSR is biased due to the all-zeroes word being illegal. To avoid this, an alternative implementation would be to just specify separate injection rates for each packet class; within each class, VCs can then be used in a round-robin fashion. Another option would be to use the per-class injection rates to drive independent packet generation processes for each VC within the class.

Test Chip Infrastructure Notes

Packet arrivals

To model packet arrivals, we use an LFSR and compare its value against a threshold register; if the threshold value meets or exceeds the generated random value, a packet is generated. This gives us geometrically distributed arrival times. The size of the LFSR solely depends on the desired granularity; 8 bits (256 possible arrival rates) might be sufficient. One open question in this context is whether inter-arrival times should be measured head-to-head or tail-to-head. In the former case, packet generation is free-running and increments a pending packets counter whenever a packet is generated. In the latter case, on the other hand, packet generation is suspended whenever a packet is generated, and only reactivated once the packet has been sent out; with this approach, the LFSR directly triggers packet generation, and no additional counter is required. The first mechanism appears to be the more realistic choice; however, it might be useful to implement both mechanisms.

Packet lengths

Similarly, we can use an LFSR to generate pseudo-random packet lengths. This will generate packets with almost uniformly distributed length; the exception is that the LFSR will never generate zero-length packets. Consequently, we may want to set the packets payload length to the LFSR value minus one. The question here, though, is how we limit the packet length: By itself, the LFSR will generate lengths in the range from one to 2n-1; if we require a shorter maximum packet length, we will have to add logic to handle the out-of-range values. Maybe the best thing to do would be to just make the packet lengths be bimodal and then use an LFSR to select one of two (short packets vs long packets -- or maybe we should have a configurable length per message type?) configurable values; however, in this case, there will be a slight bias towards one of the choices due to the fact that the LFSR will never generate the all-zeroes word. The impact of this can be minimized by making the LFSR wide enough.

Destination addresses

Packet destination addresses can likewise be generated via an LFSR; here, the fact that LFSRs never generate the all-zeroes value is actually beneficial: We can compute the destination address as the current address plus the LFSR value, modulo the total number of nodes in the network (would 64 be a reasonable choice?). For cases like the flattened butterfly where we can have an intermediate node address, we need to make sure that the origin, intermediate and final address are all different. One possible approach would be to generate the intermediate and final address for non-minimal packets by adding an LFSR to the current address for each, and then comparing the resulting intermediate and final address; if both happen to be identical, the packet is forced into the minimal routing class.

Payload data

Finally, for the actual payload data, we can just use a big LFSR to generate each flit, overriding individual bit ranges as appropriate for head flits.