November 10, 2009
The goal of multi-processor verification, simply put, is to find errors that emerge when individual CPUs interact with each other. This typically involves behaviors encountered when sharing various resources such as caches, registers and main memory. Multi-processor (MP) errors are among the most difficult to debug in all of functional verification, especially given the complexity modern MP systems. This document serves to outline Obsidian’s random testing methodology for verifying caches. Although written specifically for the ARM architecture, the underlying methodology of this paper may be applied generally to encompass other designs as well.
Don’t Begin Multi-Processor Testing Before You’re Ready
One of the core principles of Obsidian’s verification methodology is to catch every possible bug in the simplest possible configuration. That stands to reason that the single processor configuration needs to be thoroughly verified with a high degree of confidence before MP verification begins. The reason for this is simple – if the same bugs can be found in a simpler processor configuration, much less time will go to waste. Multiplied hundreds of times over, this savings in time can result in a time-to-market improvement of weeks or even months.
Beginning Multi-Processor Verification with Non-Sharing Tests
The first stage of MP testing that we implement is non-sharing. The goal at this stage is to find and correct errors that occur with the least amount of sharing as possible to simplify debugging and save time troubleshooting complex scenarios.
In the initial stages of MP verification, it’s best to divide the memory up into large partitions that are each allocated to a single processor core. Some sharing will always exist between the CPUs in spite of restrictions. This is due to the way that multiple CPUs behave upon reset (i.e. beginning the execution stream at the same reset address) and because complex MP systems typically require each CPU to be aware of their own CPU number (i.e. CPU0, CPU1, etc.).
Ideally, sharing should be minimized as much as possible at this stage of verification, although it is impossible to eliminate it altogether. If the architecture permits, it’s also beneficial to start with paging disabled to simplify debugging. This eliminates the possibility of errors where one physical address is mapped to multiple pages by more than one CPU.
With these restrictions imposed, large numbers of random test sequences are generated and debugged as needed. Once errors become extremely sparse in this mode of testing, the memory is divided up into cache lines which are each assigned to allow accesses by only a particular CPU. A more complex environment is then created in which more complex bugs can be found, such as paging errors. As before, copious amounts of random test sequences are generated and more errors will be found.
Some verification teams will completely forgo the large partition phase of false-sharing and start by immediately dividing the memory into cache lines. While this does work, it also tends to complicate debugging and forces the verification team being to prematurely deal with paging issues while debugging simple errors.
False-Sharing Multi-Processor Tests
The false-sharing arrangement is similar to the second stage of non-sharing in that the memory is divided up into cache lines. In this mode, multiple CPUs are allowed to hit the same cache lines, but not the exact same physical addresses. What this does is allow the cache accesses to come very close to each other without hitting the same spots. Several cache problems can be found and corrected using this method.
Debugging caches in this way greatly simplifies the process of testing actual shared memory and better limits the occurrence of simple errors in complex scenarios. At every stage of the process, it is important to perform cacheable and non-cacheable testing as there may be bugs lurking in both of these areas.
True-Sharing Multi-Processor Tests
True-sharing tests are more complex as they allow multiple CPUs to access the same physical memory addresses. In both deterministic and in non-deterministic modes, tests generated by RAVEN are guaranteed to execute correctly regardless of timing issues including fast or slow memory or insertion of external interrupts.
Deterministic True-Sharing
As a simplified example of deterministic true-sharing, let us assume that we are verifying a dual CPU configuration. We begin by dividing up the execution stream using semaphores. Each divided zone may contain different numbers of instructions for each CPU with varying amounts of memory accesses, exceptions, etc. Because of this, some CPUs may complete their zones more quickly than others. During the same semaphore, the CPUs are unaware of values written by the other. Not until the semaphore has been completed will the CPUs sync up again and be aware of the other’s transactions. Once this happens, CPU0 can determine that it is legal to write to a line that CPU1 completed in the last semaphore. Upon completion of all zones, final values are compared, checking for consistency and errors. Although our rules are much more complicated than this, the example should elaborate on the kind of errors that can occur in MP verification. Because our tool implements complex rules, the test generated can determine if memory location will be deterministic or non-deterministic based on execution divided by semaphores. This is how we do deterministic sharing and actual reads and writes to the same physical memory addresses. This can find a lot of bugs.
Non-Deterministic True-Sharing
Non-deterministic true-sharing is the most complex mode of sharing and should always be conducted as the final stage of MP verification. In this method, we don’t use semaphores to divide up the execution stream, but instead allow CPUs to read addresses that are being written by other CPUs. If two CPUs write to the same address, then a read of this address will result in an unknown value. If this value was intended for use in address generation or arithmetic calculation, then the result of these tasks will be unknown, and verification becomes much more difficult. Non-deterministic registers then cannot be used as sources, but only destinations until they take on deterministic values.
If non-deterministic transactions set flags, then the flag register must immediately be re-written. This is especially important in ARM processors where every instruction is conditionally executed. If something is done that results in a flag, the next instruction must be set to always execute regardless of flag values and perform the additional step of re-writing the flags. We have settings and rules that determine what percentage of registers are allowed to become non-deterministic based on their type (i.e FPR, GPR, etc.). So we do have some pretty complicated algorithms for handling determinism and non-determinism and determining legality of actions.
Non-deterministic tests become much more difficult to debug because you will have a list of possible values for each access and the errors can be timing dependent. These areas are important to cover, but they should be done last as the difficulty in debugging errors of this type is extremely high.
Verifying the Snooping Mechanism
One of the most critical areas in MP verification is cache coherence. Errors of this type can be detrimental to practical operation of the design. For example, it is possible to have errors in which data integrity is compromised due to multiple CPUs accessing shared memory.
To further illustrate an error of this type, let us assume that we are verifying two CPUs that both share a single L1 cache. At some point the L1 cache will have an image of data that resides in the main memory. If both CPUs write to the same cache line of the L1, then it is necessary to verify that the correct values get written back to main memory.
Adding an L2 memory to a situation such as this makes matters even more complicated. In this case, cache lines in the L1 will be an image taken the L2, which itself contains portions of data in the main memory. However, there is now the additional problem of L1 data being written back to main memory without crossing the L2. This creates a corresponding line of L2 cache that must now be invalidated to avoid errors.
Further conflicts can occur if a CPU flushes a cache line in the L1 that another CPU has written to, was going to write to, or is in the process of writing to. Fox example, suppose that we have a four CPU situation with two L1 caches, each being shared by a pair of CPUs and an L2 that is shared by all four CPUs. This is a common paradigm in MP designs. In this situation, it is physically possible for CPUs 0 /1 to write to the same line of memory as CPUs 2/3, creating another situation that has to be resolved. Each cluster of CPUs must then ask the other if for permission to read/write to main memory to avoid conflicts.
Verification of Shared Registers
One of the great things about having a random test generator is that it can effectively deal with obscure cases and propagate them across multiple CPUs. In the case of shared registers, we can divide them up into an arrangement similar to our true sharing example or allow only one CPU to handle all the shared registers. We can also use semaphores to do time/execution division of accesses and do non-deterministic testing, although this depends on the type of register. Although shared registers present some unique challenges, they also allow for some different opportunities to take advantage of.
These are all complicated issues which may be further compounded by timing issues. Errors such as these are what MP verification is really about and where the bulk of the errors can be found.