O3 Cpu Iew

Posted Jun 1, 2021

By Jaehyuk Lee 45 min read

IEW: Issue/Execute/Writeback

GEM5 handles both execute and writeback when the execute() function is called on an instruction. Therefore, GEM5 combines Issue, Execute, and Writeback stage into one stage called IEW. This stage (IEW) handles dispatching instructions to the instruction queue, telling the instruction queue to issue instruction, and executing and writing back instructions.

Nice description about the IEW stage provided by the GEM5 Documentation. Also, this documentation provide which functions are mainly designed to achieve those three operations.

        
      
Rename::tick()->Rename::RenameInsts()
IEW::tick()->IEW::dispatchInsts()
IEW::tick()->InstructionQueue::scheduleReadyInsts()
IEW::tick()->IEW::executeInsts()
IEW::tick()->IEW::writebackInsts()

In this posting, I will explain dispatch, schedule, execute, and write back in details. The commit stage will be studied in the other posting. The tick function of the iew stage is the main body of execution as other stages. Therefore, I will explain each part of the iew stage following the tick implementation. The dispatch function tries to dispatch renamed instructions to the LSQ/IQ (Note that already the rename stage checked availability of the LSQ and IQ) and actually issues instructions every cycle. The execute latency is actually tied to the issue latency to allow the IQ to be able to do back-to-back scheduling without having to speculatively schedule instructions. The IEW separates memory instructions from non-memory instructions. (issuing the instruction to different queues, LSQ or IQ) The writeback portion of IEW completes the instructions, wakes up any dependents, and marks the register as ready on the scoreboard. With those information, IQ can tell which instructions can be woke up and to be issued.

Dispatch

        
      
template<class Impl>
void
DefaultIEW<Impl>::tick()
{
   wbNumInst = 0;
   wbCycle = 0;

   wroteToTimeBuffer = false;
   updatedQueues = false;

   ldstQueue.tick();

   sortInsts();

   // Free function units marked as being freed this cycle.
   fuPool->processFreeUnits();

   list<ThreadID>::iterator threads = activeThreads->begin();
   list<ThreadID>::iterator end = activeThreads->end();

   // Check stall and squash signals, dispatch any instructions.
   while (threads != end) {
       ThreadID tid = *threads++;

       DPRINTF(IEW,"Issue: Processing [tid:%i]\n",tid);

       checkSignalsAndUpdate(tid);
       dispatch(tid);
   }

As shown in the tick function, after checking signal such as block and squash, the first job done by the IEW is dispatching the renamed instructions. The main goal of the dispatch is inserting the renamed instruction into the IQ and LSQ based on the instruction’s type.

Dispatch implementation

        
      
template<class Impl>
void
DefaultIEW<Impl>::dispatch(ThreadID tid)
{
   // If status is Running or idle,
   //     call dispatchInsts()
   // If status is Unblocking,
   //     buffer any instructions coming from rename
   //     continue trying to empty skid buffer
   //     check if stall conditions have passed

   if (dispatchStatus[tid] == Blocked) {
       ++iewBlockCycles;

   } else if (dispatchStatus[tid] == Squashing) {
       ++iewSquashCycles;
   }

   // Dispatch should try to dispatch as many instructions as its bandwidth
   // will allow, as long as it is not currently blocked.
   if (dispatchStatus[tid] == Running ||
       dispatchStatus[tid] == Idle) {
       DPRINTF(IEW, "[tid:%i] Not blocked, so attempting to run "
               "dispatch.\n", tid);

       dispatchInsts(tid);
   } else if (dispatchStatus[tid] == Unblocking) {
       // Make sure that the skid buffer has something in it if the
       // status is unblocking.
       assert(!skidsEmpty());

       // If the status was unblocking, then instructions from the skid
       // buffer were used.  Remove those instructions and handle
       // the rest of unblocking.
       dispatchInsts(tid);

       ++iewUnblockCycles;

       if (validInstsFromRename()) {
           // Add the current inputs to the skid buffer so they can be
           // reprocessed when this stage unblocks.
           skidInsert(tid);
       }

       unblock(tid);
   }
}

The dispatch function is just a wrapper function of the dispatchInsts. Based on the current status of the dispatch stage, associated operations should be executed in addition to the main dispatch function, dispatchInsts. Because the dispatchInsts is fairly complex, I will explain one by one.

Checking availability of resources to dispatch instruction

        
      
template <class Impl>
void
DefaultIEW<Impl>::dispatchInsts(ThreadID tid)
{
   // Obtain instructions from skid buffer if unblocking, or queue from rename
   // otherwise.
   std::queue<DynInstPtr> &insts_to_dispatch =
       dispatchStatus[tid] == Unblocking ?
       skidBuffer[tid] : insts[tid];

   int insts_to_add = insts_to_dispatch.size();

   DynInstPtr inst;
   bool add_to_iq = false;
   int dis_num_inst = 0;

   // Loop through the instructions, putting them in the instruction
   // queue.
   for ( ; dis_num_inst < insts_to_add &&
             dis_num_inst < dispatchWidth;
         ++dis_num_inst)
   {
       inst = insts_to_dispatch.front();

       if (dispatchStatus[tid] == Unblocking) {
           DPRINTF(IEW, "[tid:%i] Issue: Examining instruction from skid "
                   "buffer\n", tid);
       }

       // Make sure there's a valid instruction there.
       assert(inst);

       DPRINTF(IEW, "[tid:%i] Issue: Adding PC %s [sn:%lli] [tid:%i] to "
               "IQ.\n",
               tid, inst->pcState(), inst->seqNum, inst->threadNumber);

       // Be sure to mark these instructions as ready so that the
       // commit stage can go ahead and execute them, and mark
       // them as issued so the IQ doesn't reprocess them.

       // Check for squashed instructions.
       if (inst->isSquashed()) {
           DPRINTF(IEW, "[tid:%i] Issue: Squashed instruction encountered, "
                   "not adding to IQ.\n", tid);

           ++iewDispSquashedInsts;

           insts_to_dispatch.pop();

           //Tell Rename That An Instruction has been processed
           if (inst->isLoad()) {
               toRename->iewInfo[tid].dispatchedToLQ++;
           }
           if (inst->isStore() || inst->isAtomic()) {
               toRename->iewInfo[tid].dispatchedToSQ++;
           }

           toRename->iewInfo[tid].dispatched++;
  
           continue;
       }

       // Check for full conditions.
       if (instQueue.isFull(tid)) {
           DPRINTF(IEW, "[tid:%i] Issue: IQ has become full.\n", tid);
  
           // Call function to start blocking.
           block(tid);
  
           // Set unblock to false. Special case where we are using
           // skidbuffer (unblocking) instructions but then we still
           // get full in the IQ.
           toRename->iewUnblock[tid] = false;
  
           ++iewIQFullEvents;
           break;
       }
  
       // Check LSQ if inst is LD/ST
       if ((inst->isAtomic() && ldstQueue.sqFull(tid)) ||
           (inst->isLoad() && ldstQueue.lqFull(tid)) ||
           (inst->isStore() && ldstQueue.sqFull(tid))) {
           DPRINTF(IEW, "[tid:%i] Issue: %s has become full.\n",tid,
                   inst->isLoad() ? "LQ" : "SQ");
  
           // Call function to start blocking.
           block(tid);
  
           // Set unblock to false. Special case where we are using
           // skidbuffer (unblocking) instructions but then we still
           // get full in the IQ.
           toRename->iewUnblock[tid] = false;

           ++iewLSQFullEvents;
           break;
       }

First, it checks whether the current instruction has been already squashed. If yes, then ignore the current instruction and jump to the next ones. If the instructions is not squashed, it checks the availability of resource required for issuing the instruction. Regardless of the instruction type, it requires one entry from the instruction queue. Also, if it is the memory related instruction, it require one entry from the load queue or store queue based on whether it is load or store instruction.

Checking instruction type

        
      
       // Otherwise issue the instruction just fine.
       if (inst->isAtomic()) {
           DPRINTF(IEW, "[tid:%i] Issue: Memory instruction "
                   "encountered, adding to LSQ.\n", tid);

           ldstQueue.insertStore(inst);

           ++iewDispStoreInsts;

           // AMOs need to be set as "canCommit()"
           // so that commit can process them when they reach the
           // head of commit.
           inst->setCanCommit();
           instQueue.insertNonSpec(inst);
           add_to_iq = false;

           ++iewDispNonSpecInsts;

           toRename->iewInfo[tid].dispatchedToSQ++;
       } else if (inst->isLoad()) {
           DPRINTF(IEW, "[tid:%i] Issue: Memory instruction "
                   "encountered, adding to LSQ.\n", tid);

           // Reserve a spot in the load store queue for this
           // memory access.
           ldstQueue.insertLoad(inst);

           ++iewDispLoadInsts;

           add_to_iq = true;

           toRename->iewInfo[tid].dispatchedToLQ++;
       } else if (inst->isStore()) {
           DPRINTF(IEW, "[tid:%i] Issue: Memory instruction "
                   "encountered, adding to LSQ.\n", tid);

           ldstQueue.insertStore(inst);

           ++iewDispStoreInsts;

           if (inst->isStoreConditional()) {
               // Store conditionals need to be set as "canCommit()"
               // so that commit can process them when they reach the
               // head of commit.
               // @todo: This is somewhat specific to Alpha.
               inst->setCanCommit();
               instQueue.insertNonSpec(inst);
               add_to_iq = false;

               ++iewDispNonSpecInsts;
           } else {
               add_to_iq = true;
           }

           toRename->iewInfo[tid].dispatchedToSQ++;
       } else if (inst->isMemBarrier() || inst->isWriteBarrier()) {
           // Same as non-speculative stores.
           inst->setCanCommit();
           instQueue.insertBarrier(inst);
           add_to_iq = false;
       } else if (inst->isNop()) {
           DPRINTF(IEW, "[tid:%i] Issue: Nop instruction encountered, "
                   "skipping.\n", tid);

           inst->setIssued();
           inst->setExecuted();
           inst->setCanCommit();

           instQueue.recordProducer(inst);

           iewExecutedNop[tid]++;

           add_to_iq = false;
       } else {
           assert(!inst->isExecuted());
           add_to_iq = true;
       }

Although it is not clear until we understand the internal of the instQueue and ldstQueue, but the above code pushes the instructions based on the instruction type. For example, for the load operation, it pushes the instruction to the ldstQueue with insertLoad function. For the write operation, it is inserted to the same queue through the insertStore function. For the normal instructions they will be just enqueued to the instQueue.

Issuing instruction

        
      
       if (add_to_iq && inst->isNonSpeculative()) {
           DPRINTF(IEW, "[tid:%i] Issue: Nonspeculative instruction "
                   "encountered, skipping.\n", tid);

           // Same as non-speculative stores.
           inst->setCanCommit();

           // Specifically insert it as nonspeculative.
           instQueue.insertNonSpec(inst);

           ++iewDispNonSpecInsts;

           add_to_iq = false;
       }

       // If the instruction queue is not full, then add the
       // instruction.
       if (add_to_iq) {
           instQueue.insert(inst);
       }

       insts_to_dispatch.pop();

       toRename->iewInfo[tid].dispatched++;

       ++iewDispatchedInsts;

#if TRACING_ON
       inst->dispatchTick = curTick() - inst->fetchTick;
#endif
       ppDispatch->notify(inst);
   }

After each instructions are handled by inserting them to the corresponding queues with the associated method provided by the queues, some of them should also be inserted to the instruction queue. Note that add_to_iq flag is set based on the instruction type, When this flag is set, the instruction should be added to the instQueue (line 1151-1153).

End of the dispatching

        
      
   if (!insts_to_dispatch.empty()) {
       DPRINTF(IEW,"[tid:%i] Issue: Bandwidth Full. Blocking.\n", tid);
       block(tid);
       toRename->iewUnblock[tid] = false;
   }

   if (dispatchStatus[tid] == Idle && dis_num_inst) {
       dispatchStatus[tid] = Running;

       updatedQueues = true;
   }

   dis_num_inst = 0;
}

After dispatching all renamed instructions, it should check whether it still has some instructions in the queue. When the instruction cannot be processed further because of throttling, it should block and handle rest of the instructions at the next cycle.

Instruction Queue and Load/Store queue

Before moving on to the next stage, I’d like to cover some part of the IQ and LSQ.

Instruction queue has several lists to keep issued instructions

Mainly the job of the queue is managing instructions and providing some interfaces to process the enqueued instructions.

gem5/src/cpu/o3/inst_queue.hh

        
      
   //////////////////////////////////////
   // Instruction lists, ready queues, and ordering
   //////////////////////////////////////

   /** List of all the instructions in the IQ (some of which may be issued). */
   std::list<DynInstPtr> instList[Impl::MaxThreads];

   /** List of instructions that are ready to be executed. */
   std::list<DynInstPtr> instsToExecute;

   /** List of instructions waiting for their DTB translation to
    *  complete (hw page table walk in progress).
    */
   std::list<DynInstPtr> deferredMemInsts;

   /** List of instructions that have been cache blocked. */
   std::list<DynInstPtr> blockedMemInsts;

   /** List of instructions that were cache blocked, but a retry has been seen
    * since, so they can now be retried. May fail again go on the blocked list.
    */
   std::list<DynInstPtr> retryMemInsts;

Insert new entries to the instruction queue

The insert function is the essential example of the interface. It inserts new entries to the instruction list managed by the instruction queue.

        
      
template <class Impl>
void
InstructionQueue<Impl>::insert(const DynInstPtr &new_inst)
{
   if (new_inst->isFloating()) {
       fpInstQueueWrites++;
   } else if (new_inst->isVector()) {
       vecInstQueueWrites++;
   } else {
       intInstQueueWrites++;
   }
   // Make sure the instruction is valid
   assert(new_inst);

   DPRINTF(IQ, "Adding instruction [sn:%llu] PC %s to the IQ.\n",
           new_inst->seqNum, new_inst->pcState());

   assert(freeEntries != 0);

   instList[new_inst->threadNumber].push_back(new_inst);

   --freeEntries;

   new_inst->setInIQ();

   // Look through its source registers (physical regs), and mark any
   // dependencies.
   addToDependents(new_inst);

   // Have this instruction set itself as the producer of its destination
   // register(s).
   addToProducers(new_inst);

   if (new_inst->isMemRef()) {
       memDepUnit[new_inst->threadNumber].insert(new_inst);
   } else {
       addIfReady(new_inst);
   }

   ++iqInstsAdded;

   count[new_inst->threadNumber]++;

   assert(freeEntries == (numEntries - countInsts()));
}

Inserting the instruction to the list is done by simple push_back operation of the list. However, it invokes two important functions: addToProducers and addToDependents. These two functions generates producer and consumer dependency among instructions’s operands, registers. When one instruction waits until the specific register’s value become ready (consumer), it should be tracked by some hardware component. Also, when the data becomes ready as a result of execution of one instruction (producer), it should be forwarded to the consumers waiting for the value. For that purpose, GEM5 utilize the DependencyGraph. After producing dependency for the unavailable registers, if the instruction references memory while its execution, it should be specially handled by the memory dependency unit. The details will be explained together with the DependencyGraph later.

        
      
template <class Impl>
void
InstructionQueue<Impl>::addIfReady(const DynInstPtr &inst)
{
   // If the instruction now has all of its source registers
   // available, then add it to the list of ready instructions.
   if (inst->readyToIssue()) {

       //Add the instruction to the proper ready list.
       if (inst->isMemRef()) {

           DPRINTF(IQ, "Checking if memory instruction can issue.\n");

           // Message to the mem dependence unit that this instruction has
           // its registers ready.
           memDepUnit[inst->threadNumber].regsReady(inst);

           return;
       }

       OpClass op_class = inst->opClass();

       DPRINTF(IQ, "Instruction is ready to issue, putting it onto "
               "the ready list, PC %s opclass:%i [sn:%llu].\n",
               inst->pcState(), op_class, inst->seqNum);

       readyInsts[op_class].push(inst);

       // Will need to reorder the list if either a queue is not on the list,
       // or it has an older instruction than last time.
       if (!queueOnList[op_class]) {
           addToOrderList(op_class);
       } else if (readyInsts[op_class].top()->seqNum  <
                  (*readyIt[op_class]).oldestInst) {
           listOrder.erase(readyIt[op_class]);
           addToOrderList(op_class);
       }
   }
}

At the end of the insert function, it adds instruction to the readyInsts buffer if all the registers are available (line 1476). If the instruction is not ready, which means the source registers are not available, the instruction should not be inqueued to the readyInsts buffer. The instructions waiting for the source register to become available will be added to the readyInsts buffer when other dependent instructions complete.

Execute

To understand what should be done after dispatching the instructions, let’s go back to the tick function of the iew stage.

        
      
   if (exeStatus != Squashing) {
       executeInsts();

       writebackInsts();

       // Have the instruction queue try to schedule any ready instructions.
       // (In actuality, this scheduling is for instructions that will
       // be executed next cycle.)
       instQueue.scheduleReadyInsts();

       // Also should advance its own time buffers if the stage ran.
       // Not the best place for it, but this works (hopefully).
       issueToExecQueue.advance();
   }

If the execution stage is not in the squashing state, it will execute instructions stored in the instQueue, particularly readyInsts queue. Here execute() function of the compute instruction is invoked and sent to commit. Please note execute() will write results to the destination registers. Therefore, after executeInsts is invoked, writebackInsts is called to write the result to destination registers. Furthermore, when there are dependent instructions to the currently executed one, those instructions will be added to the ready list for scheduling.

executeInsts

        
      
template <class Impl>
void
DefaultIEW<Impl>::executeInsts()
{
   wbNumInst = 0;
   wbCycle = 0;

   list<ThreadID>::iterator threads = activeThreads->begin();
   list<ThreadID>::iterator end = activeThreads->end();

   while (threads != end) {
       ThreadID tid = *threads++;
       fetchRedirect[tid] = false;
   }

   // Uncomment this if you want to see all available instructions.
   // @todo This doesn't actually work anymore, we should fix it.
   // printAvailableInsts();

   // Execute/writeback any instructions that are available.
   int insts_to_execute = fromIssue->size;
   int inst_num = 0;
   for (; inst_num < insts_to_execute;
         ++inst_num) {

       DPRINTF(IEW, "Execute: Executing instructions from IQ.\n");

       DynInstPtr inst = instQueue.getInstToExecute();

       DPRINTF(IEW, "Execute: Processing PC %s, [tid:%i] [sn:%llu].\n",
               inst->pcState(), inst->threadNumber,inst->seqNum);

       // Notify potential listeners that this instruction has started
       // executing
       ppExecute->notify(inst);

       // Check if the instruction is squashed; if so then skip it
       if (inst->isSquashed()) {
           DPRINTF(IEW, "Execute: Instruction was squashed. PC: %s, [tid:%i]"
                        " [sn:%llu]\n", inst->pcState(), inst->threadNumber,
                        inst->seqNum);

           // Consider this instruction executed so that commit can go
           // ahead and retire the instruction.
           inst->setExecuted();

           // Not sure if I should set this here or just let commit try to
           // commit any squashed instructions.  I like the latter a bit more.
           inst->setCanCommit();

           ++iewExecSquashedInsts;

           continue;
       }

The executeInsts function execute an many instruction as it can afford, which is implemented as the loop in the line 1227 and after. First it retrieves instruction that can be executed by invoking getInstToExecute function of the instQueue. After one instruction is retrieved, it checks if the instruction should be squashed. Although the squashed instructions are not really executed, but it should be treated as executed because it should be committed. After this condition is checked, depending on the type of the instruction, it will process the instruction separately.

execute memory instruction

1259 1260 Fault fault = NoFault; 1261 1262 // Execute instruction. 1263 // Note that if the instruction faults, it will be handled 1264 // at the commit stage. 1265 if (inst->isMemRef()) { 1266 DPRINTF(IEW, “Execute: Calculating address for memory “ 1267 “reference.\n”); 1268 1269 // Tell the LDSTQ to execute this instruction (if it is a load). 1270 if (inst->isAtomic()) { 1271 // AMOs are treated like store requests 1272 fault = ldstQueue.executeStore(inst); 1273 1274 if (inst->isTranslationDelayed() && 1275 fault == NoFault) { 1276 // A hw page table walk is currently going on; the 1277 // instruction must be deferred. 1278 DPRINTF(IEW, “Execute: Delayed translation, deferring “ 1279 “store.\n”); 1280 instQueue.deferMemInst(inst); 1281 continue; 1282 } 1283 } else if (inst->isLoad()) { 1284 // Loads will mark themselves as executed, and their writeback 1285 // event adds the instruction to the queue to commit 1286 fault = ldstQueue.executeLoad(inst); 1287 1288 if (inst->isTranslationDelayed() && 1289 fault == NoFault) { 1290 // A hw page table walk is currently going on; the 1291 // instruction must be deferred. 1292 DPRINTF(IEW, “Execute: Delayed translation, deferring “ 1293 “load.\n”); 1294 instQueue.deferMemInst(inst); 1295 continue; 1296 } 1297 1298 if (inst->isDataPrefetch() || inst->isInstPrefetch()) { 1299 inst->fault = NoFault; 1300 } 1301 } else if (inst->isStore()) { 1302 fault = ldstQueue.executeStore(inst); 1303 1304 if (inst->isTranslationDelayed() && 1305 fault == NoFault) { 1306 // A hw page table walk is currently going on; the 1307 // instruction must be deferred. 1308 DPRINTF(IEW, “Execute: Delayed translation, deferring “ 1309 “store.\n”); 1310 instQueue.deferMemInst(inst); 1311 continue; 1312 } 1313 1314 // If the store had a fault then it may not have a mem req 1315 if (fault != NoFault || !inst->readPredicate() || 1316 !inst->isStoreConditional()) { 1317 // If the instruction faulted, then we need to send it along 1318 // to commit without the instruction completing. 1319 // Send this instruction to commit, also make sure iew stage 1320 // realizes there is activity. 1321 inst->setExecuted(); 1322 instToCommit(inst); 1323 activityThisCycle(); 1324 } 1325 1326 // Store conditionals will mark themselves as 1327 // executed, and their writeback event will add the 1328 // instruction to the queue to commit. 1329 } else { 1330 panic(“Unexpected memory type!\n”); 1331 } 1332 1333 } else {

For the memory operation, it can be one of three instruction type:
atomic, load, store. 
Basically, the loadstore queue in charge of executing memory instructions,
but based on the type of memory operation, it needs to handle 
instruction differently. 
Let's take a look at how the load and store instruction will be processed.

### Execute load instruction
```cpp
1283             } else if (inst->isLoad()) {
1284                 // Loads will mark themselves as executed, and their writeback
1285                 // event adds the instruction to the queue to commit
1286                 fault = ldstQueue.executeLoad(inst);
1287
1288                 if (inst->isTranslationDelayed() &&
1289                     fault == NoFault) {
1290                     // A hw page table walk is currently going on; the
1291                     // instruction must be deferred.
1292                     DPRINTF(IEW, "Execute: Delayed translation, deferring "
1293                             "load.\n");
1294                     instQueue.deferMemInst(inst);
1295                     continue;
1296                 }
1297
1298                 if (inst->isDataPrefetch() || inst->isInstPrefetch()) {
1299                     inst->fault = NoFault;
1300                 }

The main execution of the load instruction is done by the executeLoad function of the ldstQueue. After the execution, it needs to check whether the translation is the bottleneck of making progress on the load operation. Note that when the virtual to physical address resolution is delayed because of long TLB latency, it should be executed at the next or later clock cycle when the TLB is ready. Therefore, when the instruction cannot be executed at this moment, it should set the current load instruction is deferred (deferMemInst). Also, when the load operation was just prefetch, then any fault generated by this operation should be ignored (line 1298-1299). Let’s take our important function executeLoad in detail!

gem5/src/o3/cpu/lsq_impl.hh

        
      
template<class Impl>
Fault
LSQ<Impl>::executeLoad(const DynInstPtr &inst)
{
   ThreadID tid = inst->threadNumber;
 256
   return thread[tid].executeLoad(inst);
}

gem5/src/o3/cpu/lsq.hh

        
      
template <class Impl>
class LSQ

{
......
   /** Total Size of LQ Entries. */
   unsigned LQEntries;
   /** Total Size of SQ Entries. */
   unsigned SQEntries;

   /** Max LQ Size - Used to Enforce Sharing Policies. */
   unsigned maxLQEntries;

   /** Max SQ Size - Used to Enforce Sharing Policies. */
   unsigned maxSQEntries;

   /** Data port. */
   DcachePort dcachePort;

   /** The LSQ units for individual threads. */
   std::vector<LSQUnit> thread;

   /** Number of Threads. */
   ThreadID numThreads;
};

gem5/src/o3/cpu/lsq_unit_impl.hh

        
      
template <class Impl>
Fault
LSQUnit<Impl>::executeLoad(const DynInstPtr &inst)
{  
   using namespace TheISA;
   // Execute a specific load.
   Fault load_fault = NoFault;
  
   DPRINTF(LSQUnit, "Executing load PC %s, [sn:%lli]\n",
           inst->pcState(), inst->seqNum);
  
   assert(!inst->isSquashed());
  
   load_fault = inst->initiateAcc();

   if (load_fault == NoFault && !inst->readMemAccPredicate()) {
       assert(inst->readPredicate());
       inst->setExecuted();
       inst->completeAcc(nullptr);
       iewStage->instToCommit(inst);
       iewStage->activityThisCycle();
       return NoFault;
   }
      
   if (inst->isTranslationDelayed() && load_fault == NoFault)
       return load_fault;
          
   if (load_fault != NoFault && inst->translationCompleted() &&
       inst->savedReq->isPartialFault() && !inst->savedReq->isComplete()) {
       assert(inst->savedReq->isSplit());
       // If we have a partial fault where the mem access is not complete yet
       // then the cache must have been blocked. This load will be re-executed
       // when the cache gets unblocked. We will handle the fault when the
       // mem access is complete.
       return NoFault;
   }  
      
   // If the instruction faulted or predicated false, then we need to send it
   // along to commit without the instruction completing.
   if (load_fault != NoFault || !inst->readPredicate()) {
       // Send this instruction to commit, also make sure iew stage
       // realizes there is activity.  Mark it as executed unless it
       // is a strictly ordered load that needs to hit the head of
       // commit.
       if (!inst->readPredicate())
           inst->forwardOldRegs();
       DPRINTF(LSQUnit, "Load [sn:%lli] not executed from %s\n",
               inst->seqNum,
               (load_fault != NoFault ? "fault" : "predication"));
       if (!(inst->hasRequest() && inst->strictlyOrdered()) ||
           inst->isAtCommit()) {
           inst->setExecuted();
       }
       iewStage->instToCommit(inst);
       iewStage->activityThisCycle();
   } else {
       if (inst->effAddrValid()) {
           auto it = inst->lqIt;
           ++it;

           if (checkLoads)
               return checkViolations(it, inst);
       }
   }

   return load_fault;
}

initiateAcc: handling TLB request

I already covered InitiateAcc of the memory instructions before. However, compared to simple processors, the O3 cpu have different way to process the initateAcc.

        
      
template <class Impl>
Fault
BaseO3DynInst<Impl>::initiateAcc()
{    
   // @todo: Pretty convoluted way to avoid squashing from happening
   // when using the TC during an instruction's execution
   // (specifically for instructions that have side-effects that use
   // the TC).  Fix this.
   bool no_squash_from_TC = this->thread->noSquashFromTC;
   this->thread->noSquashFromTC = true;

   this->fault = this->staticInst->initiateAcc(this, this->traceData);

   this->thread->noSquashFromTC = no_squash_from_TC;

   return this->fault;
}    

Because the staticInst stored in the dynamic instruction is the class object of a specific microoperation, it will invokes the initiateAcc function of that micro-load/store operation. For the memory read operation case, it invokes initiateMemRead function of architecture side. This will end up invoking initiateMemRead function of the CPU side.

        
      
namespace X86ISA
{

/// Initiate a read from memory in timing mode.
static Fault
initiateMemRead(ExecContext *xc, Trace::InstRecord *traceData, Addr addr,
               unsigned dataSize, Request::Flags flags)
{
   return xc->initiateMemRead(addr, dataSize, flags);
}

        
      
template<class Impl>
Fault
BaseDynInst<Impl>::initiateMemRead(Addr addr, unsigned size,
                                  Request::Flags flags,
                                  const std::vector<bool>& byte_enable)
{
   assert(byte_enable.empty() || byte_enable.size() == size);
   return cpu->pushRequest(
           dynamic_cast<typename DynInstPtr::PtrType>(this),
           /* ld */ true, nullptr, size, addr, flags, nullptr, nullptr,
           byte_enable);
}

Because the instruction of the O3 CPU is instance of BaseO3DynInst inheriting the BaseDynInst, when the instruction implementation invokes initateMemRead (invoked through the InitateAcc implementation of the instruction), it invokes the corresponding method implemented in the BaseDynInst class.

pushRequest

        
      
   /** CPU pushRequest function, forwards request to LSQ. */
   Fault pushRequest(const DynInstPtr& inst, bool isLoad, uint8_t *data,
                     unsigned int size, Addr addr, Request::Flags flags,
                     uint64_t *res, AtomicOpFunctorPtr amo_op = nullptr,
                     const std::vector<bool>& byte_enable =
                         std::vector<bool>())

   {
       return iew.ldstQueue.pushRequest(inst, isLoad, data, size, addr,
               flags, res, std::move(amo_op), byte_enable);
   }

Instead of directly handling the load operation, initiateMemRead pushes the request to the load queue through the pushRequest function. This design seems to be odd because the initateAcc function has been invoked by the lsq at the first place, and the instruction forward the request to the loadstore queue once again. It might have been just implemented as simple function that handles the request directly without going through multiple different units. Anyway, initiateMemRead invokes the pushRequest of the CPU side and it will end up invoking pushRequest of the LSQ.

        
      
template<class Impl>
Fault
LSQ<Impl>::pushRequest(const DynInstPtr& inst, bool isLoad, uint8_t *data,
                      unsigned int size, Addr addr, Request::Flags flags,
                      uint64_t *res, AtomicOpFunctorPtr amo_op,
                      const std::vector<bool>& byte_enable)
{
   // This comming request can be either load, store or atomic.
   // Atomic request has a corresponding pointer to its atomic memory
   // operation
   bool isAtomic M5_VAR_USED = !isLoad && amo_op;

   ThreadID tid = cpu->contextToThread(inst->contextId());
   auto cacheLineSize = cpu->cacheLineSize();
   bool needs_burst = transferNeedsBurst(addr, size, cacheLineSize);
   LSQRequest* req = nullptr;

   // Atomic requests that access data across cache line boundary are
   // currently not allowed since the cache does not guarantee corresponding
   // atomic memory operations to be executed atomically across a cache line.
   // For ISAs such as x86 that supports cross-cache-line atomic instructions,
   // the cache needs to be modified to perform atomic update to both cache
   // lines. For now, such cross-line update is not supported.
   assert(!isAtomic || (isAtomic && !needs_burst));

   if (inst->translationStarted()) {
       req = inst->savedReq;
       assert(req);
   } else {
       if (needs_burst) {
           req = new SplitDataRequest(&thread[tid], inst, isLoad, addr,
                   size, flags, data, res);
       } else {
           req = new SingleDataRequest(&thread[tid], inst, isLoad, addr,
                   size, flags, data, res, std::move(amo_op));
       }
       assert(req);
       if (!byte_enable.empty()) {
           req->_byteEnable = byte_enable;
       }
       inst->setRequest();
       req->taskId(cpu->taskId());

       // There might be fault from a previous execution attempt if this is
       // a strictly ordered load
       inst->getFault() = NoFault;

       req->initiateTranslation();
   }

   /* This is the place were instructions get the effAddr. */
   if (req->isTranslationComplete()) {
       if (req->isMemAccessRequired()) {
           inst->effAddr = req->getVaddr();
           inst->effSize = size;
           inst->effAddrValid(true);

           if (cpu->checker) {
               inst->reqToVerify = std::make_shared<Request>(*req->request());
           }
           Fault fault;
           if (isLoad)
               fault = cpu->read(req, inst->lqIdx);
           else
               fault = cpu->write(req, data, inst->sqIdx);
           // inst->getFault() may have the first-fault of a
           // multi-access split request at this point.
           // Overwrite that only if we got another type of fault
           // (e.g. re-exec).
           if (fault != NoFault)
               inst->getFault() = fault;
       } else if (isLoad) {
           inst->setMemAccPredicate(false);
           // Commit will have to clean up whatever happened.  Set this
           // instruction as executed.
           inst->setExecuted();
       }
   }

   if (inst->traceData)
       inst->traceData->setMem(addr, size, flags);

   return inst->getFault();
}

The dynamic instruction can track whether the current instruction has started TLB translation by checking the flag stored in the instruction. It provide the interface to access that information, called translationStarted When the instruction set that flag, it means that the instruction already started the TLB access but waiting response. In the delayed TLB response case, the instruction stores the request information in its instruction object. Therefore, it can retrieve the request that has sent to TLB before. However, if it is the first time of execution, then it should generate new request. As shown in line 722-728, if the request should access two separate cache blocks, it generates SplitDataRequest request object. However, if it only access one block, then SingleDataRequest request object is generated instead. After the request has been produced, it should set proper flags of the instruction object to indicate the instruction initiated the TLB access (line 733). After that, the initiateTranslation function provided by the request object is invoked to actually generate accesses to the TLBs.

        
      
template<class Impl>
void
LSQ<Impl>::SingleDataRequest::initiateTranslation()
{
   assert(_requests.size() == 0);

   this->addRequest(_addr, _size, _byteEnable);

   if (_requests.size() > 0) {
       _requests.back()->setReqInstSeqNum(_inst->seqNum);
       _requests.back()->taskId(_taskId);
       _inst->translationStarted(true);
       setState(State::Translation);
       flags.set(Flag::TranslationStarted);

       _inst->savedReq = this;
       sendFragmentToTranslation(0);
   } else {
       _inst->setMemAccPredicate(false);
   }
}

The addRequest just generates packet need to be sent to the TLB unit. Although the current object can be interpreted as just an request itself that can be directly sent to the TLB unit, but it is a wrapper for all the required interface and data structures to resolve TLB access. For example, it includes the ports connected with the TLB unit so that the generated request and its response can be communicated through that port. Anyway, the addRequest function just generates the real packet understandable by the TLB unit.

        
      
       void
       addRequest(Addr addr, unsigned size,
                  const std::vector<bool>& byte_enable)
       {
           if (byte_enable.empty() ||
               isAnyActiveElement(byte_enable.begin(), byte_enable.end())) {
               auto request = std::make_shared<Request>(_inst->getASID(),
                       addr, size, _flags, _inst->masterId(),
                       _inst->instAddr(), _inst->contextId(),
                       std::move(_amo_op));
               if (!byte_enable.empty()) {
                   request->setByteEnable(byte_enable);
               }
               _requests.push_back(request);                                                                                                                                                                  
           }
       }

The addRequest function of the LSQRquest class just generates the request and save it to the _requests vector to send them later. After the request packets are generated, initiateTranslation invokes sendFragmentToTranslation to send the generated packet(s) to the TLB.

        
      
template<class Impl>
void
LSQ<Impl>::LSQRequest::sendFragmentToTranslation(int i)
{
   numInTranslationFragments++;
   _port.dTLB()->translateTiming(
           this->request(i),
           this->_inst->thread->getTC(), this,
           this->isLoad() ? BaseTLB::Read : BaseTLB::Write);
}

Remember that the SingleDataRequest has only one request packet. Therefore, it has only one entry in the _requests vector. This function sends the request stored in the _requests vector to the TLB. Note that the argument is used to index the entry stored in the _requests vector. You can see that it invokes the translateTiming of the dTLB connected to the LSQ. The details of the translateTiming function of the TLB is explained in the previous posting. Also, note that it passes the this as the translation object parameter. Because the translation object is used to invoke the finish function when the TLB access is resolved.

Response of LSQ for the TLB resolution

        
      
template<class Impl>
void
LSQ<Impl>::SingleDataRequest::finish(const Fault &fault, const RequestPtr &req,
       ThreadContext* tc, BaseTLB::Mode mode)
{
   _fault.push_back(fault);
   numInTranslationFragments = 0;
   numTranslatedFragments = 1;
   /* If the instruction has been squahsed, let the request know
    * as it may have to self-destruct. */
   if (_inst->isSquashed()) {
       this->squashTranslation();
   } else {
       _inst->strictlyOrdered(req->isStrictlyOrdered());

       flags.set(Flag::TranslationFinished);
       if (fault == NoFault) {
           _inst->physEffAddr = req->getPaddr();
           _inst->memReqFlags = req->getFlags();
           if (req->isCondSwap()) {
               assert(_res);
               req->setExtraData(*_res);
           }
           setState(State::Request);
       } else {
           setState(State::Fault);
       }

       LSQRequest::_inst->fault = fault;
       LSQRequest::_inst->translationCompleted(true);
   }
}

When the translation is completed, the finish function provided by the Request generated by the LSQ will be invoked at the end of the translation. As shwon in the above code, it first checks whether the instruction has been squashed while the TLB process the request. If it has not been squashed, it will set required flags indicating the translation is completed for specific instruction. Note that it sets various fields of the instruction that has initiated the TLB request (_inst in the line 790-808). One of the most important field changed by the finish function is _state field of the request. This field indicates current status of the TLB request and can be set to other state by using the setState function. Remind that how the simpleCPU starts memory access after the TLB is resolved. It initiates memory operation at the end of the finish function. However, O3 cpu does not invoke any related functions to generate actual memory request when the TLB is resolved. Then when and where the O3, especially the LSQ initiates the memory operation? The answer is in the pushRequest!

        
      
template<class Impl>
Fault
LSQ<Impl>::pushRequest(const DynInstPtr& inst, bool isLoad, uint8_t *data,
                      unsigned int size, Addr addr, Request::Flags flags,
                      uint64_t *res, AtomicOpFunctorPtr amo_op,
                      const std::vector<bool>& byte_enable)
{
 ......
   /* This is the place were instructions get the effAddr. */
   if (req->isTranslationComplete()) {
       if (req->isMemAccessRequired()) {
           inst->effAddr = req->getVaddr();
           inst->effSize = size;
           inst->effAddrValid(true);
 749
           if (cpu->checker) {
               inst->reqToVerify = std::make_shared<Request>(*req->request());
           }
           Fault fault;
           if (isLoad)
               fault = cpu->read(req, inst->lqIdx);
           else
               fault = cpu->write(req, data, inst->sqIdx);
           // inst->getFault() may have the first-fault of a
           // multi-access split request at this point.
           // Overwrite that only if we got another type of fault
           // (e.g. re-exec).
           if (fault != NoFault)
               inst->getFault() = fault;
       } else if (isLoad) {
           inst->setMemAccPredicate(false);
           // Commit will have to clean up whatever happened.  Set this
           // instruction as executed.
           inst->setExecuted();
       }
   }
 771
   if (inst->traceData)
       inst->traceData->setMem(addr, size, flags);
 774
   return inst->getFault();
}

It first checks the TLB translation is finished by invoking isTranslationComplete.

        
      
       bool
       isInTranslation()
       {
           return _state == State::Translation;
       }

       bool
       isTranslationComplete()
       {
           return flags.isSet(Flag::TranslationStarted) &&
                  !isInTranslation();
       }

You might remember that the _state field was changed when the finish function of the TLB request packet is invoked. Therefore, if the TLB request is already resolved, the isTranslationComplete function will return true. And then the actual memory read or write operation is made based on the instruction type. Because the translation packet req has translated physical address from the virtual address, it should also be passed to the operation because memory operation should target the physical address not the virtual address. Because we care currently dealing with the read operation, let’s take a look at how the O3 access the real memory.

CPU->read->LSQ::read->LSQUnit::read

        
      
   /** CPU read function, forwards read to LSQ. */
   Fault read(LSQRequest* req, int load_idx)
   {
       return this->iew.ldstQueue.read(req, load_idx);
   }

        
      
template <class Impl>
Fault
LSQ<Impl>::read(LSQRequest* req, int load_idx)
{
   ThreadID tid = cpu->contextToThread(req->request()->contextId());

   return thread.at(tid).read(req, load_idx);
}

The processor load function handles four different memory load operations: LLSC (locked load/store), MappedIPR (memory mapped register), store forwarding, and just memory load operation. I will cover the plain memory load operation that will try to access the data from the cache and memory. The store forwarding case will be handled in the other posting.

        
      
LSQUnit<Impl>::read(LSQRequest *req, int load_idx)
{
   LQEntry& load_req = loadQueue[load_idx];
   const DynInstPtr& load_inst = load_req.instruction();

   load_req.setRequest(req);
   assert(load_inst);

   assert(!load_inst->isExecuted());

   // Make sure this isn't a strictly ordered load
   // A bit of a hackish way to get strictly ordered accesses to work
   // only if they're at the head of the LSQ and are ready to commit
   // (at the head of the ROB too).

   if (req->mainRequest()->isStrictlyOrdered() &&
       (load_idx != loadQueue.head() || !load_inst->isAtCommit())) {
       // Tell IQ/mem dep unit that this instruction will need to be
       // rescheduled eventually
       iewStage->rescheduleMemInst(load_inst);
       load_inst->clearIssued();
       load_inst->effAddrValid(false);
       ++lsqRescheduledLoads;
       DPRINTF(LSQUnit, "Strictly ordered load [sn:%lli] PC %s\n",
               load_inst->seqNum, load_inst->pcState());

       // Must delete request now that it wasn't handed off to
       // memory.  This is quite ugly.  @todo: Figure out the proper
       // place to really handle request deletes.
       load_req.setRequest(nullptr);
       req->discard();
       return std::make_shared<GenericISA::M5PanicFault>(
           "Strictly ordered load [sn:%llx] PC %s\n",
           load_inst->seqNum, load_inst->pcState());
   }

   DPRINTF(LSQUnit, "Read called, load idx: %i, store idx: %i, "
           "storeHead: %i addr: %#x%s\n",
           load_idx - 1, load_inst->sqIt._idx, storeQueue.head() - 1,
           req->mainRequest()->getPaddr(), req->isSplit() ? " split" : "");

   if (req->mainRequest()->isLLSC()) {
       // Disable recording the result temporarily.  Writing to misc
       // regs normally updates the result, but this is not the
       // desired behavior when handling store conditionals.
       load_inst->recordResult(false);
       TheISA::handleLockedRead(load_inst.get(), req->mainRequest());
       load_inst->recordResult(true);
   }

   if (req->mainRequest()->isMmappedIpr()) {
       assert(!load_inst->memData);
       load_inst->memData = new uint8_t[MaxDataBytes];

       ThreadContext *thread = cpu->tcBase(lsqID);
       PacketPtr main_pkt = new Packet(req->mainRequest(), MemCmd::ReadReq);

       main_pkt->dataStatic(load_inst->memData);

       Cycles delay = req->handleIprRead(thread, main_pkt);

       WritebackEvent *wb = new WritebackEvent(load_inst, main_pkt, this);
       cpu->schedule(wb, cpu->clockEdge(delay));
       return NoFault;
   }

   // Check the SQ for any previous stores that might lead to forwarding
......
   // If there's no forwarding case, then go access memory
   DPRINTF(LSQUnit, "Doing memory access for inst [sn:%lli] PC %s\n",
           load_inst->seqNum, load_inst->pcState());

   // Allocate memory if this is the first time a load is issued.
   if (!load_inst->memData) {
       load_inst->memData = new uint8_t[req->mainRequest()->getSize()];
   }

   // For now, load throughput is constrained by the number of
   // load FUs only, and loads do not consume a cache port (only
   // stores do).
   // @todo We should account for cache port contention
   // and arbitrate between loads and stores.

   // if we the cache is not blocked, do cache access
   if (req->senderState() == nullptr) {
       LQSenderState *state = new LQSenderState(
               loadQueue.getIterator(load_idx));
       state->isLoad = true;
       state->inst = load_inst;
       state->isSplit = req->isSplit();
       req->senderState(state);
   }
   req->buildPackets();
   req->sendPacketToCache();
   if (!req->isSent())
       iewStage->blockMemInst(load_inst);

   return NoFault;
}

Execute store instruction

Execute non-memory instruction

        
      
       } else {
           // If the instruction has already faulted, then skip executing it.
           // Such case can happen when it faulted during ITLB translation.
           // If we execute the instruction (even if it's a nop) the fault
           // will be replaced and we will lose it.
           if (inst->getFault() == NoFault) {
               inst->execute();
               if (!inst->readPredicate())
                   inst->forwardOldRegs();
           }

           inst->setExecuted();

           instToCommit(inst);
       }

       updateExeInstStats(inst);

1351 // Check if branch prediction was correct, if not then we need 1352 // to tell commit to squash in flight instructions. Only 1353 // handle this if there hasn’t already been something that 1354 // redirects fetch in this group of instructions. 1355 1356 // This probably needs to prioritize the redirects if a different 1357 // scheduler is used. Currently the scheduler schedules the oldest 1358 // instruction first, so the branch resolution order will be correct. 1359 ThreadID tid = inst->threadNumber; 1360 1361 if (!fetchRedirect[tid] || 1362 !toCommit->squash[tid] || 1363 toCommit->squashedSeqNum[tid] > inst->seqNum) { 1364 1365 // Prevent testing for misprediction on load instructions, 1366 // that have not been executed. 1367 bool loadNotExecuted = !inst->isExecuted() && inst->isLoad(); 1368 1369 if (inst->mispredicted() && !loadNotExecuted) { 1370 fetchRedirect[tid] = true; 1371 1372 DPRINTF(IEW, “[tid:%i] [sn:%llu] Execute: “ 1373 “Branch mispredict detected.\n”, 1374 tid,inst->seqNum); 1375 DPRINTF(IEW, “[tid:%i] [sn:%llu] “ 1376 “Predicted target was PC: %s\n”, 1377 tid,inst->seqNum,inst->readPredTarg()); 1378 DPRINTF(IEW, “[tid:%i] [sn:%llu] Execute: “ 1379 “Redirecting fetch to PC: %s\n”, 1380 tid,inst->seqNum,inst->pcState()); 1381 // If incorrect, then signal the ROB that it must be squashed. 1382 squashDueToBranch(inst, tid); 1383 1384 ppMispredict->notify(inst); 1385 1386 if (inst->readPredTaken()) { 1387 predictedTakenIncorrect++; 1388 } else { 1389 predictedNotTakenIncorrect++; 1390 } 1391 } else if (ldstQueue.violation(tid)) { 1392 assert(inst->isMemRef()); 1393 // If there was an ordering violation, then get the 1394 // DynInst that caused the violation. Note that this 1395 // clears the violation signal. 1396 DynInstPtr violator; 1397 violator = ldstQueue.getMemDepViolator(tid); 1398 1399 DPRINTF(IEW, “LDSTQ detected a violation. Violator PC: %s “ 1400 “[sn:%lli], inst PC: %s [sn:%lli]. Addr is: %#x.\n”, 1401 violator->pcState(), violator->seqNum, 1402 inst->pcState(), inst->seqNum, inst->physEffAddr); 1403 1404 fetchRedirect[tid] = true; 1405 1406 // Tell the instruction queue that a violation has occured. 1407 instQueue.violation(inst, violator); 1408 1409 // Squash. 1410 squashDueToMemOrder(violator, tid); 1411 1412 ++memOrderViolationEvents; 1413 } 1414 } else { 1415 // Reset any state associated with redirects that will not 1416 // be used. 1417 if (ldstQueue.violation(tid)) { 1418 assert(inst->isMemRef()); 1419 1420 DynInstPtr violator = ldstQueue.getMemDepViolator(tid); 1421 1422 DPRINTF(IEW, “LDSTQ detected a violation. Violator PC: “ 1423 “%s, inst PC: %s. Addr is: %#x.\n”, 1424 violator->pcState(), inst->pcState(), 1425 inst->physEffAddr); 1426 DPRINTF(IEW, “Violation will not be handled because “ 1427 “already squashing\n”); 1428 1429 ++memOrderViolationEvents; 1430 } 1431 } 1432 } 1433 1434 // Update and record activity if we processed any instructions. 1435 if (inst_num) { 1436 if (exeStatus == Idle) { 1437 exeStatus = Running; 1438 } 1439 1440 updatedQueues = true; 1441 1442 cpu->activityThisCycle(); 1443 } 1444 1445 // Need to reset this in case a writeback event needs to write into the 1446 // iew queue. That way the writeback event will write into the correct 1447 // spot in the queue. 1448 wbNumInst = 0; 1449 1450 } ```

Schedule

Schedule (InstructionQueue::scheduleReadyInsts()) The IQ manages the ready instructions (operands ready) in a ready list, and schedules them to an available FU. The latency of the FU is set here, and instructions are sent to execution when the FU done.

GEM5, Pipeline, O3

This post is licensed under CC BY 4.0 by the author.