Traditionally, speaking of cosimulation, they mean modeling systems, different parts of which are presented at different levels of abstraction or written in different languages. For example, SystemC models + RTL code, TLM models + RTL. In this case, the simulation of the RTL part can be performed on a simulator or in real time on an FPGA prototype. In the latter case, it is assumed that there is some interface for transactions between the FPGA platform and the host machine that simulates the rest.
For FPGA prototyping, Baikal Electronics uses the Synopsys HAPS®-80 platform, which makes it possible to implement such complex scenarios as OS loading during the development of a microcircuit that it would be impossible to perform RTL modeling within an acceptable time frame.
But FPGA prototyping cannot replace RTL modeling in the process of full verification of individual subsystems, since it is impossible to reproduce the behavior of such elements of a future microcircuit as, for example, PHY interface controllers on FPGAs in all nuances. It is also problematic to implement on FPGA the operation of the number of frequency domains characteristic of modern systems on a chip.
So, in some cases, full-fledged RTL modeling is indispensable, but what about huge runtimes? For example, simulating a DDR4 training code can take 2 weeks. The engineers of “Baikal” were faced with the question: is it possible in this verification environment to select the part that can be fully synthesized on the FPGA platform, and to co-simulate the non-synthesizable part on the simulator with real-time execution on the FPGA of the synthesized part? After all, it is obvious that the lion’s share of the simulation time is spent on reproducing the switching activity of highly parallel structures that are perfectly ported to FPGAs.
The criteria that the RTL + FPGA hybrid co-simulation route must satisfy (if the goal is to speed up the development process) are formulated quite clearly:
– cosimulation should demonstrate significant superiority over pure RTL simulation in terms of runtime;
– the initial verification environment should be subject to minimal and clearly formalized revision, otherwise the gain in simulation time is “eaten up” by the time for adjusting the environment. The same applies to the refinement of a part of the FPGA prototype.
The starting point for the development of a new cosimulation route by the engineers of Baikal was the communication infrastructure of the HAPS-80 platform with the host machine – the proprietary Synopsys UMR-Bus® bus, which connects the platform with the host via a PCIe cable. Synopsys supplies a set of transactors in the form of additional IP blocks that allow the HAPS-80 to be connected to the virtual platform, while on the virtual part of this hybrid prototyping, the simulated device is represented in the form of a TLM or SystemC-model. We, in our idea, needed to pair HAPS with an RTL simulator, specifically – with the Synopsys VCS® simulator.
Synopsys did not have such adapters right out of the box, but it was obvious to us that the UMR Bus infrastructure has everything you need for this. Then “Baikal” contacted the Synopsys development team from Erfurt, which invented the concept of hybrid prototyping and the UMR Bus itself. Our hopes were justified: our colleagues kindly provided us with scripts that allow us to redirect data streams from the VCS simulator to the UMR Bus by calling DPI functions. The other part of the UMR Bus, which goes into the HAPS platform, was disassembled into atoms by the Baikalians themselves, since the colleagues from Erfurt well documented the basis of the UMR Bus infrastructure – the so-called CAPIM modules.
It would seem that now we have a transport for developing our own adapter between HAPS and VCS, it remains to divide the simulated subsystem into a non-synthesized VCS part and a synthesizable FPGA part, then just run the testbench on the VCS side. But how to synchronize the two worlds – RTL simulation and FPGA, if processes in FPGA are simulated thousands of times faster? It is necessary to fence a complex software and hardware interface that aligns the data streams and buffers the signals of the VCS-HAPS adapter. As a reminder, we were not going to invent a fancy cosimulation stand out of academic interest, but wanted to speed up quite specific business processes with rather limited resources. Therefore, the criterion for the minimum effort to improve the existing verification environment was key.
The idea, thanks to which “Baikal” got out of this difficulty, was not, as they later found out, innovative. Similar principles were used in the SCE-MI co-emulation interface, apparently they lay on the surface. But we invented this bike for our environment, and this approach to using specific tools from Synopsys – bodies, IP and hardware prototyping platform – was a definite innovation.
The solution is to stop the simulation on the rising edge of each system frequency, then, by calling DPI functions, send to the FPGA side all the signals of the “harness” connecting the VCS and HAPS parts of the simulated design. After setting the transmitted control actions on the clocked elements of the HAPS-part, one pulse of the system frequency is produced, then the signals that have changed after the clock pulse are sent back to the “bundle” on the VSC-side and the simulation continues. In fact, it is VSC simulation that is the “bottleneck” of the performance of the entire hybrid cosimulation, and the exchange of data between the VSC and HAPS sides in one cycle along the “bundle” is faster than modeling one cycle of the VCS part. It is the VCS testbench that deals with the frequency formation for the HAPS-side, producing for HAPS pulses of a clod with a large duty cycle, between which the simulation of the circuit on HAPS will “freeze” until the new clock front switches to VCS. This achieves an absolute synchrony of cosimulation with an accuracy up to the clock period.
On Figure 1 shows data streams and compares timing diagrams of VCS simulation and FPGA prototyping. The red cone marks the reversal of the time that passes in the FPGA between the moments of stopping the VCS simulation. UMR Bus-interface allows to achieve the exchange rate between hardware and simulation parts of 400 MB. The UMR Сlock frequency in the figure is 100 MHz, and the UMR Bus has 32 bits, so serialization of the input data (interface unit CAPIM1) and deserialization of the output data (CAPIM3) of our “bundle”, which usually consists of several thousand signals, are required. On Figure 1 an example of a “time frame” for a small “bundle” of 128 input and 128 output signals is presented.
On Figure 2 an example of a hardware-software brigade developed by “Baikal” using the infrastructure of the UMR Bus is presented. Obviously, the more signals in the “bundle” between the hardware and simulation worlds, the slower the cosimulation and the longer it takes to serialize and deserialize.
Theoretically, the accuracy of the developed method of hybrid cosimulation can reach the granularity of VCS simulation – it is enough to add the corresponding signal to the list of sensitivity of the bridge hardware, and the sampling of the “bundle” signals will take place on it. The simplest example on Figure 1: in the sensitivity list, a clock of one frequency domain. This list can be expanded to two domains – see. Figure 3…
In this case, we add an additional condition for stopping the simulation and sampling of the “bundle” signals – the front of one more clock. It is clear that a two-domain transition will significantly slow down cosimulation as compared to a single-domain one. A prerequisite for efficient multi-domain cosimulation is the automatic gated clock conversion option in the Synopsys HAPS project synthesis program, the Synopsys ProtoCompiler.
A separate task in hybrid cosimulation is to select those parts of the DUT (Device Under Test) that are suitable for synthesis on FPGAs and transfer them to a HAPS project. They are not always at the same level of the hardware module hierarchy. This is where XMR (Cross Mudule Reference) comes to our rescue – a way in Verilog to refer to signals within modules of different hierarchies without having to “pull” them through the ports of the modules.
Since XMR syntax is supported by both VCS and Protocompiler from Sysnopsys, we can easily split the DUT into VCS and HAPS parts without having to modify the original RTL code. On Figure 4 an example of the XMR application is presented, when the shaded parts of the DUT are transferred using XMR from different levels of the VSC project to one level of the HAPS project without manual modification of the original RTL code and while maintaining absolute logical equivalence.
The described technology was applied for prototyping and early software development in the Baikal-M and Baikal-S projects, and the speedup compared to RTL simulation averaged 2.5 times. For example, to simulate one iteration of software debugging, the DDR4 training routine is 6 days versus 2 weeks.
On Figure 5 an extreme example of accelerating RTL simulation in the Baikal-M project is presented, when the entire DUT was FPGA synthesized. The delivery environment of the IP interconnect from ARM was taken, and the entire DUT (i.e. without separating the synthesizable parts) was sent to HAPS, where it took 2 FPGA chips. On the VCS side, the CHI interface transactor emulated requests from the ARM Cortex-A57 cluster to the interconnect on the HAPS side.
The highest achieved acceleration of the RTL simulation was 20 times, the debug iteration of the simulation took half an hour instead of half a day.
The described technology was presented at the SNUG conference regularly held by Synopsys around the world, and received high marks from colleagues in the field of verification and prototyping.
Typically, hardware emulators such as Synopsys ZeBu® Server are used for cosimulation tasks. These platforms are quite expensive. The engineers of “Baikal Electronics” used their HAPS-70/80 platforms in a non-standard scenario to solve these problems, significantly reducing their costs. Again the Russian Lefties shod a flea.