Gem5 Memaccess

Posted Jun 5, 2020

By Jaehyuk Lee 25 min read

In my previous post, I discussed the automatic generation of C++ classes for macroops and microops using various GEM5 tools, including a Python-based parser and string-based template substitution. I also provided an example, explaining how a class definition and its constructor for micro-load instructions are generated. Additionally, I presented several definitions that implement the actual semantics of micro-load operations, including the execute function.

The instructions are designed to change internal state of the system. More specifically, by executing certain instructions, they can induce specific changes in registers, memory, or internal states that are represented as architectural elements. Given that GEM5 operates as an architecture-level emulator, the execution of a single micro-op should result in the alteration of a particular data structure representing a segment of the architecture. To achieve this, GEM5 provides the “ExecContext” class, which emulates the entire underlying architecture. Additionally, the “execute” method and other definitions of the micro-operations are designed to modify the “ExecContext” as a consequence of their execution. In essence, these definitions emulate the semantics of the instructions. We will explore how the execution of a micro-op can modify the underlying architectural state through updating “ExecContext. To comprehend how GEM5 executes micro-operations, we will briefly examine the pipeline of a simple processor.

CPU pipeline of the simple processor: fetch-decode-execute

To understand how GEM5 emulates the entire architecture, one crucial question to address is: When and how does GEM5 execute the next instruction? In other words, we must understand who utilizes the automatically generated micro-op class and its functions. Each CPU model features a distinct pipeline architecture, and this difference significantly influences the execution of instructions within the pipeline. To shed light on this, we will examine the TimingSimple CPU model, which is the most basic CPU pipeline model supported by GEM5.

Here are the key characteristics of the SimpleTiming processor model in GEM5:

Single-Cycle Execution: It operates on a single-cycle execution model,
where each instruction is executed in one clock cycle.
Minimal Microarchitecture: It lacks the complexity of multiple pipeline
stages, making it relatively simple and easy to understand.
Idealized Timing: The SimpleTiming model does not account for detailed
timing, such as pipeline hazards or stalls, and assumes that instructions
progress through the pipeline without delays.

Processor invokes fetch at every clock tick

To understand how the TimingSimple processor process the events, we should understand which function will be invoked at the schedule event. Scheduling a specific event requires a EventFunctionWrapper instance which contains information about event handler function.

gem5/src/cpu/simple/timing.hh

        
      
class TimingSimpleCPU : public BaseSimpleCPU
{
......
 private:

   EventFunctionWrapper fetchEvent;

As shown in the above class declaration of the TimingSimpleCPU, I can find that fetchEvent member field is declared as EventFunctionWrapper. To utilize the wrapper to schedule event, proper initialization code is required.

gem5/src/cpu/simple/timing.cc

        
      
TimingSimpleCPU::TimingSimpleCPU(TimingSimpleCPUParams *p)
   : BaseSimpleCPU(p), fetchTranslation(this), icachePort(this),
     dcachePort(this), ifetch_pkt(NULL), dcache_pkt(NULL), previousCycle(0),
     fetchEvent([this]{ fetch(); }, name())
{
   _status = Idle;
}

As shown in the constructor of the TimingSimpleCPU, it initialize the fetchEvent member field with a function called fetch. Therefore, whenever the fetchEvent is scheduled, the GEM5 will invoke the fetch function and start to fetch the instruction from the memory (or cache).

fetch: retrieving next instruction to execute from memory

gem5/src/cpu/simple/timing.cc

        
      
void
TimingSimpleCPU::fetch()
{
   // Change thread if multi-threaded
   swapActiveThread();
 658
   SimpleExecContext &t_info = *threadInfo[curThread];
   SimpleThread* thread = t_info.thread;
 661
   DPRINTF(SimpleCPU, "Fetch\n");
 663
   if (!curStaticInst || !curStaticInst->isDelayedCommit()) {
       checkForInterrupts();
       checkPcEventQueue();
   }
 668
   // We must have just got suspended by a PC event
   if (_status == Idle)
       return;
 672
   TheISA::PCState pcState = thread->pcState();
   bool needToFetch = !isRomMicroPC(pcState.microPC()) &&
                      !curMacroStaticInst;
 676
   if (needToFetch) {
       _status = BaseSimpleCPU::Running;
       RequestPtr ifetch_req = std::make_shared<Request>();
       ifetch_req->taskId(taskId());
       ifetch_req->setContext(thread->contextId());
       setupFetchRequest(ifetch_req);
       DPRINTF(SimpleCPU, "Translating address %#x\n", ifetch_req->getVaddr());
       thread->itb->translateTiming(ifetch_req, thread->getTC(),
               &fetchTranslation, BaseTLB::Execute);
   } else {
       _status = IcacheWaitResponse;
       completeIfetch(NULL);
 689
       updateCycleCounts();
       updateCycleCounters(BaseCPU::CPU_STATE_ON);
   }
}

decode

Processor can decode the memory blocks as instruction after the memory has been fetched from the cache or memory. Because timing simple CPU assume memory access takes more than single cycle, it needs to be notified when the requested memory block has been brought to the processor.

        
      
void
TimingSimpleCPU::IcachePort::ITickEvent::process()
{
   cpu->completeIfetch(pkt);
}

bool
TimingSimpleCPU::IcachePort::recvTimingResp(PacketPtr pkt)
{
   DPRINTF(SimpleCPU, "Received fetch response %#x\n", pkt->getAddr());
   // we should only ever see one response per cycle since we only
   // issue a new request once this response is sunk
   assert(!tickEvent.scheduled());
   // delay processing of returned data until next CPU clock edge
   tickEvent.schedule(pkt, cpu->clockEdge());

   return true;
}

As a processor is connected to a memory subsystem through the bus, bus should be programmed to invoke a function that can handle the fetched instruction, completeIfetch. When IcachePort receive response from memory subsystem, it schedule event with received packet. Because it is scheduled to fire at right next cycle, it ends up invoking completeIfetch function of the TimingSimpleCPU.

        
      
void
TimingSimpleCPU::completeIfetch(PacketPtr pkt)
{
   SimpleExecContext& t_info = *threadInfo[curThread];
 779
   DPRINTF(SimpleCPU, "Complete ICache Fetch for addr %#x\n", pkt ?
           pkt->getAddr() : 0);
 782
   // received a response from the icache: execute the received
   // instruction
   assert(!pkt || !pkt->isError());
   assert(_status == IcacheWaitResponse);
 787
   _status = BaseSimpleCPU::Running;
 789
   updateCycleCounts();
   updateCycleCounters(BaseCPU::CPU_STATE_ON);
 792
   if (pkt)
       pkt->req->setAccessLatency();
 795
 796
   preExecute();
   if (curStaticInst && curStaticInst->isMemRef()) {
       // load or store: just send to dcache
       Fault fault = curStaticInst->initiateAcc(&t_info, traceData);
 801
       // If we're not running now the instruction will complete in a dcache
       // response callback or the instruction faulted and has started an
       // ifetch
       if (_status == BaseSimpleCPU::Running) {
           if (fault != NoFault && traceData) {
               // If there was a fault, we shouldn't trace this instruction.
               delete traceData;
               traceData = NULL;
           }
 811
           postExecute();
           // @todo remove me after debugging with legion done
           if (curStaticInst && (!curStaticInst->isMicroop() ||
                       curStaticInst->isFirstMicroop()))
               instCnt++;
           advanceInst(fault);
       }
   } else if (curStaticInst) {
       // non-memory instruction: execute completely now
       Fault fault = curStaticInst->execute(&t_info, traceData);
 822
       // keep an instruction count
       if (fault == NoFault)
           countInst();
       else if (traceData && !DTRACE(ExecFaulting)) {
           delete traceData;
           traceData = NULL;
       }
 830
       postExecute();
       // @todo remove me after debugging with legion done
       if (curStaticInst && (!curStaticInst->isMicroop() ||
               curStaticInst->isFirstMicroop()))
           instCnt++;
       advanceInst(fault);
   } else {
       advanceInst(NoFault);
   }
 840
   if (pkt) {
       delete pkt;
   }
}

CompleteIfetch instruction consists of four parts: preExecute, instruction execution, postExecute, and advanceInst. By the way, although we cannot see any decoding logic it makes use of curStaticInst to execute fetched instruction. Who decodes the fetched packet and generate curStaticInst? that is a preExecute function.

gem5/src/cpu/simple/base.cc

        
      
void
BaseSimpleCPU::preExecute()
{
   SimpleExecContext &t_info = *threadInfo[curThread];
   SimpleThread* thread = t_info.thread;
486
   // maintain $r0 semantics
   thread->setIntReg(ZeroReg, 0);
#if THE_ISA == ALPHA_ISA
   thread->setFloatReg(ZeroReg, 0);
#endif // ALPHA_ISA
492
   // resets predicates
   t_info.setPredicate(true);
   t_info.setMemAccPredicate(true);
496
   // check for instruction-count-based events
   thread->comInstEventQueue.serviceEvents(t_info.numInst);
499
   // decode the instruction
   TheISA::PCState pcState = thread->pcState();
502
   if (isRomMicroPC(pcState.microPC())) {
       t_info.stayAtPC = false;
       curStaticInst = microcodeRom.fetchMicroop(pcState.microPC(),
                                                 curMacroStaticInst);
   } else if (!curMacroStaticInst) {
       //We're not in the middle of a macro instruction
       StaticInstPtr instPtr = NULL;
510
       TheISA::Decoder *decoder = &(thread->decoder);
512
       //Predecode, ie bundle up an ExtMachInst
       //If more fetch data is needed, pass it in.
       Addr fetchPC = (pcState.instAddr() & PCMask) + t_info.fetchOffset;
       //if (decoder->needMoreBytes())
           decoder->moreBytes(pcState, fetchPC, inst);
       //else
       //    decoder->process();
520
       //Decode an instruction if one is ready. Otherwise, we'll have to
       //fetch beyond the MachInst at the current pc.
       instPtr = decoder->decode(pcState);
       if (instPtr) {
           t_info.stayAtPC = false;
           thread->pcState(pcState);
       } else {
           t_info.stayAtPC = true;
           t_info.fetchOffset += sizeof(MachInst);
       }
531
       //If we decoded an instruction and it's microcoded, start pulling
       //out micro ops
       if (instPtr && instPtr->isMacroop()) {
           curMacroStaticInst = instPtr;
           curStaticInst =
               curMacroStaticInst->fetchMicroop(pcState.microPC());
       } else {
           curStaticInst = instPtr;
       }
   } else {
       //Read the next micro op from the macro op
       curStaticInst = curMacroStaticInst->fetchMicroop(pcState.microPC());
   }
545
   //If we decoded an instruction this "tick", record information about it.
   if (curStaticInst) {
#if TRACING_ON
       traceData = tracer->getInstRecord(curTick(), thread->getTC(),
               curStaticInst, thread->pcState(), curMacroStaticInst);
551
       DPRINTF(Decode,"Decode: Decoded %s instruction: %#x\n",
               curStaticInst->getName(), curStaticInst->machInst);
#endif // TRACING_ON
   }
556
   if (branchPred && curStaticInst &&
       curStaticInst->isControl()) {
       // Use a fake sequence number since we only have one
       // instruction in flight at the same time.
       const InstSeqNum cur_sn(0);
       t_info.predPC = thread->pcState();
       const bool predict_taken(
           branchPred->predict(curStaticInst, cur_sn, t_info.predPC,
                               curThread));
566
       if (predict_taken)
           ++t_info.numPredictedBranches;
   }
}

execute: modify ExecContext based on instruction

gem5/build/X86/arch/x86/generated/exec-ns.cc.inc

        
      
   Fault Ld::execute(ExecContext *xc,
         Trace::InstRecord *traceData) const
   {
       Fault fault = NoFault;
       Addr EA;
19106
       uint64_t Index = 0;
uint64_t Base = 0;
uint64_t Data = 0;
uint64_t SegBase = 0;
uint64_t Mem;
;
       Index = xc->readIntRegOperand(this, 0);
Base = xc->readIntRegOperand(this, 1);
Data = xc->readIntRegOperand(this, 2);
SegBase = xc->readMiscRegOperand(this, 3);
;
       EA = SegBase + bits(scale * Index + Base + disp, addressSize * 8 - 1, 0);;
       DPRINTF(X86, "%s : %s: The address is %#x\n", instMnem, mnemonic, EA);
19120
       fault = readMemAtomic(xc, traceData, EA, Mem, dataSize, memFlags);
19122
       if (fault == NoFault) {
           Data = merge(Data, Mem, dataSize);;
       } else if (memFlags & Request::PREFETCH) {
           // For prefetches, ignore any faults/exceptions.
           return NoFault;
       }
       if(fault == NoFault)
       {
19131
19132
       {
           uint64_t final_val = Data;
           xc->setIntRegOperand(this, 0, final_val);
19136
           if (traceData) { traceData->setData(final_val); }
       };
       }
19140
       return fault;
   }

gem5/src/arch/x86/memhelpers.hh

        
      
static Fault
readMemAtomic(ExecContext *xc, Trace::InstRecord *traceData, Addr addr,
             uint64_t &mem, unsigned dataSize, Request::Flags flags)
{
   memset(&mem, 0, sizeof(mem));
   Fault fault = xc->readMem(addr, (uint8_t *)&mem, dataSize, flags);
   if (fault == NoFault) {
       // If LE to LE, this is a nop, if LE to BE, the actual data ends up
       // in the right place because the LSBs where at the low addresses on
       // access. This doesn't work for BE guests.
       mem = letoh(mem);
       if (traceData)
           traceData->setData(mem);
   }
   return fault;
}

gem5/src/cpu/exec_context.hh

        
      
/**
* The ExecContext is an abstract base class the provides the
* interface used by the ISA to manipulate the state of the CPU model.
*
* Register accessor methods in this class typically provide the index
* of the instruction's operand (e.g., 0 or 1), not the architectural
* register index, to simplify the implementation of register
* renaming.  The architectural register index can be found by
* indexing into the instruction's own operand index table.
*
* @note The methods in this class typically take a raw pointer to the
* StaticInst is provided instead of a ref-counted StaticInstPtr to
* reduce overhead as an argument. This is fine as long as the
* implementation doesn't copy the pointer into any long-term storage
* (which is pretty hard to imagine they would have reason to do).
*/
class ExecContext {
 public:
   typedef TheISA::PCState PCState;
 76
   using VecRegContainer = TheISA::VecRegContainer;
   using VecElem = TheISA::VecElem;
   using VecPredRegContainer = TheISA::VecPredRegContainer;
 ...
   /**
    * @{
    * @name Memory Interface
    */
   /**
    * Perform an atomic memory read operation.  Must be overridden
    * for exec contexts that support atomic memory mode.  Not pure
    * virtual since exec contexts that only support timing memory
    * mode need not override (though in that case this function
    * should never be called).
    */
   virtual Fault readMem(Addr addr, uint8_t *data, unsigned int size,
           Request::Flags flags,
           const std::vector<bool>& byte_enable = std::vector<bool>())
   {
       panic("ExecContext::readMem() should be overridden\n");
   }

As mentioned in the comment, ExecContext class is an abstract base class used to manipulate state of the CPU model. Therefore, each CPU model provides concrete interface that can actually updates CPU context. As an example, let’s take a loot at simple cpu model.

gem5/src/cpu/simple/exec_context.hh

        
      
class SimpleExecContext : public ExecContext {
 protected:
   using VecRegContainer = TheISA::VecRegContainer;
   using VecElem = TheISA::VecElem;
 65
 public:
   BaseSimpleCPU *cpu;
   SimpleThread* thread;
 69
   // This is the offset from the current pc that fetch should be performed
   Addr fetchOffset;
   // This flag says to stay at the current pc. This is useful for
   // instructions which go beyond MachInst boundaries.
   bool stayAtPC;
 75
   // Branch prediction
   TheISA::PCState predPC;
 78
   /** PER-THREAD STATS */
 80
   // Number of simulated instructions
   Counter numInst;
   Stats::Scalar numInsts;
   Counter numOp;
   Stats::Scalar numOps;
 86
   // Number of integer alu accesses
   Stats::Scalar numIntAluAccesses;
...
   Fault
   readMem(Addr addr, uint8_t *data, unsigned int size,
           Request::Flags flags,
           const std::vector<bool>& byte_enable = std::vector<bool>())
       override
   {
       assert(byte_enable.empty() || byte_enable.size() == size);
       return cpu->readMem(addr, data, size, flags, byte_enable);
   }

As shown in the line 437-445, readMem method is overridden by SimpleExecContext class inherited from ExecContext abstract class. Because ExecContext is an interface class, actual memory read operation is done by the corresponding CPU class. For timing CPU, it doesn’t make use of execute, but other autogenerated method to executed ld microop.

InitiateAcc: send memory reference

        
      
def template MicroLoadInitiateAcc {{
   Fault %(class_name)s::initiateAcc(ExecContext * xc,
           Trace::InstRecord * traceData) const

   {
       Fault fault = NoFault;
       Addr EA;
125
       %(op_decl)s;
       %(op_rd)s;
       %(ea_code)s;
       DPRINTF(X86, "%s : %s: The address is %#x\n", instMnem, mnemonic, EA);
130
       fault = initiateMemRead(xc, traceData, EA,
                               %(memDataSize)s, memFlags);
133
       return fault;
   }
}};
137

All the template code is required to generate load address. And the generated address is used by initiateMemRead method to actually access the memory. Note that this method receive ExecContext which is a interface to CPU module and the generated logical address EA. Also, memory flags such as prefetch are delivered to the memory module. Remember that memFlags are passed to the class when the microop is constructed.

gem5/build/X86/arch/x86/generated/exec-ns.cc.inc

        
      
   Fault Ld::initiateAcc(ExecContext * xc,
           Trace::InstRecord * traceData) const
   {
       Fault fault = NoFault;
       Addr EA;
19149
       uint64_t Index = 0;
uint64_t Base = 0;
uint64_t SegBase = 0;
;
       Index = xc->readIntRegOperand(this, 0);
Base = xc->readIntRegOperand(this, 1);
SegBase = xc->readMiscRegOperand(this, 3);
;
       EA = SegBase + bits(scale * Index + Base + disp, addressSize * 8 - 1, 0);;
       DPRINTF(X86, "%s : %s: The address is %#x\n", instMnem, mnemonic, EA);
19160
       fault = initiateMemRead(xc, traceData, EA,
                               dataSize, memFlags);
19163
       return fault;
   }

For memory operation, initiateAcc is the most important function that actually initiate memory access. initiateAcc invokes initiateMemRead function, and each CPU class overrides initiateMemRead method. Before we take a look at the detail implementation, we have to understand that all the CPU specific functions are invoked through the interface ExecContext class.

gem5/src/arch/x86/memhelpers.hh

        
      
/// Initiate a read from memory in timing mode.
static Fault
initiateMemRead(ExecContext *xc, Trace::InstRecord *traceData, Addr addr,
               unsigned dataSize, Request::Flags flags)
{
   return xc->initiateMemRead(addr, dataSize, flags);
}

initiateMemRead helper function defined in x86 arch directory invokes actual initiateMemRead function through the ExecContext interface.

gem5/src/cpu/simple/exec_context.hh

        
      
   Fault
   initiateMemRead(Addr addr, unsigned int size,
                   Request::Flags flags,
                   const std::vector<bool>& byte_enable = std::vector<bool>())
       override
   {
       assert(byte_enable.empty() || byte_enable.size() == size);
       return cpu->initiateMemRead(addr, size, flags, byte_enable);
   }

Because we have interest in timing cpu model, let’s figure how the timing cpu model implements initiateMemRead.

gem5/src/cpu/simple/timing.cc

        
      
Fault
TimingSimpleCPU::initiateMemRead(Addr addr, unsigned size,
                                Request::Flags flags,
                                const std::vector<bool>& byte_enable)
{
   SimpleExecContext &t_info = *threadInfo[curThread];
   SimpleThread* thread = t_info.thread;
 425
   Fault fault;
   const int asid = 0;
   const Addr pc = thread->instAddr();
   unsigned block_size = cacheLineSize();
   BaseTLB::Mode mode = BaseTLB::Read;
 431
   if (traceData)
       traceData->setMem(addr, size, flags);
 434
   RequestPtr req = std::make_shared<Request>(
       asid, addr, size, flags, dataMasterId(), pc,
       thread->contextId());
   if (!byte_enable.empty()) {
       req->setByteEnable(byte_enable);
   }
 441
   req->taskId(taskId());
 443
   Addr split_addr = roundDown(addr + size - 1, block_size);
   assert(split_addr <= addr || split_addr - addr < block_size);
 446
   _status = DTBWaitResponse;
   if (split_addr > addr) {
       RequestPtr req1, req2;
       assert(!req->isLLSC() && !req->isSwap());
       req->splitOnVaddr(split_addr, req1, req2);
 452
       WholeTranslationState *state =
           new WholeTranslationState(req, req1, req2, new uint8_t[size],
                                     NULL, mode);
       DataTranslation<TimingSimpleCPU *> *trans1 =
           new DataTranslation<TimingSimpleCPU *>(this, state, 0);
       DataTranslation<TimingSimpleCPU *> *trans2 =
           new DataTranslation<TimingSimpleCPU *>(this, state, 1);
 460
       thread->dtb->translateTiming(req1, thread->getTC(), trans1, mode);
       thread->dtb->translateTiming(req2, thread->getTC(), trans2, mode);
   } else {
       WholeTranslationState *state =
           new WholeTranslationState(req, new uint8_t[size], NULL, mode);
       DataTranslation<TimingSimpleCPU *> *translation
           = new DataTranslation<TimingSimpleCPU *>(this, state);
       thread->dtb->translateTiming(req, thread->getTC(), translation, mode);
   }
 470
   return NoFault;
}

This function first handles split memory access that needs two memory access requests. When the memory address is not aligned, and the access crosses the memory block boundary, then it should be handled with two separate memory requests. Otherwise, it invokes translateTiming function defined in data tlb object(dtb). Note that initiateMemRead doesn’t actually bring the data from the memory to cache. It first check the tlb for virtual to physical mapping, and if the mapping doesn’t exist, it initiate translation request to TLB

completeAcc: execute memory instruction and bring the data

        
      
def template MicroLoadCompleteAcc {{
   Fault %(class_name)s::completeAcc(PacketPtr pkt, ExecContext * xc,
                                     Trace::InstRecord * traceData) const

   {
       Fault fault = NoFault;
143
       %(op_decl)s;
       %(op_rd)s;
146
       getMem(pkt, Mem, dataSize, traceData);
148
       %(code)s;
150
       if(fault == NoFault)
       {
           %(op_wb)s;
       }
155
       return fault;
   }
}};

        
      
   Fault Ld::completeAcc(PacketPtr pkt, ExecContext * xc,
                                     Trace::InstRecord * traceData) const
   {
       Fault fault = NoFault;
19171
       uint64_t Data = 0;
uint64_t Mem;
;
       Data = xc->readIntRegOperand(this, 2);
;
19177
       getMem(pkt, Mem, dataSize, traceData);
19179
       Data = merge(Data, Mem, dataSize);;
19181
       if(fault == NoFault)
       {
19184
19185
       {
           uint64_t final_val = Data;
           xc->setIntRegOperand(this, 0, final_val);
19189
           if (traceData) { traceData->setData(final_val); }
       };
       }
19193
       return fault;
   }

completeAcc function receives the pkt as its parameter. pkt contains the actual data read from the memory, so getMem function reads the proper amount of the data from the pkt data structure. Because memory operation reads 64bytes of data at once it should be properly feed to the pipeline depending on the data read size.

###execute

###preExecute: decode instruction and predict branch

Instruction execution: execute decoded microop instruction

For memory operation (line 798-819), it invokes initiateAcc method of current microop represented by the curStaticInst (line 800). As we have seen before, for each load/store microop, it defines initiateAcc function, for example, for Ld, it defines Ld::initiateAcc.

Otherwise, for non-memory instruction, it invokes execute method of microop instead of initiateAcc.

###postExecute: manage statistics related with execution

###adnvanceInst: start to fetch next instruction

After finishing translation, and when the fault has not been deteced by the finish function, it starts to read the actual data from the memory.

        
      
void
TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
{
   _status = BaseSimpleCPU::Running;
 631
   if (state->getFault() != NoFault) {
       if (state->isPrefetch()) {
           state->setNoFault();
       }
       delete [] state->data;
       state->deleteReqs();
       translationFault(state->getFault());
   } else {
       if (!state->isSplit) {
           sendData(state->mainReq, state->data, state->res,
                    state->mode == BaseTLB::Read);
       } else {
           sendSplitData(state->sreqLow, state->sreqHigh, state->mainReq,
                         state->data, state->mode == BaseTLB::Read);
       }
   }
 648
   delete state;
}

As shown in line 639-649, when the fault has not been raised during translation, then it sends memory access packet to DRAM through sendData function.

        
      
void
TimingSimpleCPU::sendData(const RequestPtr &req, uint8_t *data, uint64_t *res,
                         bool read)
{
   SimpleExecContext &t_info = *threadInfo[curThread];
   SimpleThread* thread = t_info.thread;
 293
   PacketPtr pkt = buildPacket(req, read);
   pkt->dataDynamic<uint8_t>(data);
 296
   if (req->getFlags().isSet(Request::NO_ACCESS)) {
       assert(!dcache_pkt);
       pkt->makeResponse();
       completeDataAccess(pkt);
   } else if (read) {
       handleReadPacket(pkt);
   } else {
       bool do_access = true;  // flag to suppress cache access
 305
       if (req->isLLSC()) {
           do_access = TheISA::handleLockedWrite(thread, req, dcachePort.cacheBlockMask);
       } else if (req->isCondSwap()) {
           assert(res);
           req->setExtraData(*res);
       }
 312
       if (do_access) {
           dcache_pkt = pkt;
           handleWritePacket();
           threadSnoop(pkt, curThread);
       } else {
           _status = DcacheWaitResponse;
           completeDataAccess(pkt);
       }
   }
}

Currently we are looking at load instruction not the store, we are going to assume that read flag has been set. Therefore, it invoked handleReadPacket(pkt) function in line 301-302. Note that packer pkk is created as a combination of req and read (line 294). As req variable contains all the required address and data size to access memory, it should be contained in the request packet.

        
      
bool
TimingSimpleCPU::handleReadPacket(PacketPtr pkt)
{
   SimpleExecContext &t_info = *threadInfo[curThread];
   SimpleThread* thread = t_info.thread;
 263
   const RequestPtr &req = pkt->req;
 265
   // We're about the issues a locked load, so tell the monitor
   // to start caring about this address
   if (pkt->isRead() && pkt->req->isLLSC()) {
       TheISA::handleLockedRead(thread, pkt->req);
   }
   if (req->isMmappedIpr()) {
       Cycles delay = TheISA::handleIprRead(thread->getTC(), pkt);
       new IprEvent(pkt, this, clockEdge(delay));
       _status = DcacheWaitResponse;
       dcache_pkt = NULL;
   } else if (!dcachePort.sendTimingReq(pkt)) {
       _status = DcacheRetry;
       dcache_pkt = pkt;
   } else {
       _status = DcacheWaitResponse;
       // memory system takes ownership of packet
       dcache_pkt = NULL;
   }
   return dcache_pkt == NULL;
}

Because CPU is connected to memory component through master slave ports in GEM5, it can initiate memory access by sending request packet through a sendTimingReq method. Because CPU goes through the data cache before touching the physical memory, the sendTimingReq is invoked on the DcachePort.

gem5/src/mem/port.hh

        
      
inline bool
MasterPort::sendTimingReq(PacketPtr pkt)
{
   return TimingRequestProtocol::sendReq(_slavePort, pkt);
}

mem/protocol/timing.cc

        
      
/* The request protocol. */

bool
TimingRequestProtocol::sendReq(TimingResponseProtocol *peer, PacketPtr pkt)
{
   assert(pkt->isRequest());
   return peer->recvTimingReq(pkt);
}

recvTimingResp

When the request has been handled by the slave (DCache), recvTimingResp method of DcachePort will be invoked to handle result of memory access.

        
      
bool
TimingSimpleCPU::DcachePort::recvTimingResp(PacketPtr pkt)
{
   DPRINTF(SimpleCPU, "Received load/store response %#x\n", pkt->getAddr());
 982
   // The timing CPU is not really ticked, instead it relies on the
   // memory system (fetch and load/store) to set the pace.
   if (!tickEvent.scheduled()) {
       // Delay processing of returned data until next CPU clock edge
       tickEvent.schedule(pkt, cpu->clockEdge());
       return true;
   } else {
       // In the case of a split transaction and a cache that is
       // faster than a CPU we could get two responses in the
       // same tick, delay the second one
       if (!retryRespEvent.scheduled())
           cpu->schedule(retryRespEvent, cpu->clockEdge(Cycles(1)));
       return false;
   }
}

It seems that it doesn’t handle the received packet. However, it schedules tickEvent
to process the recevied packet.

        
      
void
TimingSimpleCPU::DcachePort::DTickEvent::process()
{
   cpu->completeDataAccess(pkt);
}

GEM5,, Microops

This post is licensed under CC BY 4.0 by the author.