Gem5 X86 Tlb

Posted Jun 3, 2020

By Jaehyuk Lee 55 min read

layout: post tittle: “Pagetable walking and pagefault handling in Gem5” categories: GEM5, TLB — In this posting, we are going to take a look at how the memory accesses can be resolved through the TLB and pagetable walking.

Who initiates TLB access?

TLB maintains a virtual to physical address translation information to reduce time of walking the entire page table at every memory access. In other words, it is a cache of virtual to physical mapping maintained by the processor usually. Then which part of the CPU logic initiates the TLB logic, and what operations should be done by the TLB component?

Interface between CPU pipeline and TLB component

        
      
Fault
TimingSimpleCPU::initiateMemRead(Addr addr, unsigned size,
                                Request::Flags flags,
                                const std::vector<bool>& byte_enable)
{
   SimpleExecContext &t_info = *threadInfo[curThread];
   SimpleThread* thread = t_info.thread;

   Fault fault;
   const int asid = 0;
   const Addr pc = thread->instAddr();
   unsigned block_size = cacheLineSize();
   BaseTLB::Mode mode = BaseTLB::Read;

   if (traceData)
       traceData->setMem(addr, size, flags);

   RequestPtr req = std::make_shared<Request>(
       asid, addr, size, flags, dataMasterId(), pc,
       thread->contextId());
   if (!byte_enable.empty()) {
       req->setByteEnable(byte_enable);
   }
  
   req->taskId(taskId());

   Addr split_addr = roundDown(addr + size - 1, block_size);
   assert(split_addr <= addr || split_addr - addr < block_size);
                                
   _status = DTBWaitResponse;
   if (split_addr > addr) {
       RequestPtr req1, req2;
       assert(!req->isLLSC() && !req->isSwap());
       req->splitOnVaddr(split_addr, req1, req2);
  
       WholeTranslationState *state =
           new WholeTranslationState(req, req1, req2, new uint8_t[size],
                                     NULL, mode);
       DataTranslation<TimingSimpleCPU *> *trans1 =
           new DataTranslation<TimingSimpleCPU *>(this, state, 0);
       DataTranslation<TimingSimpleCPU *> *trans2 =
           new DataTranslation<TimingSimpleCPU *>(this, state, 1);

       thread->dtb->translateTiming(req1, thread->getTC(), trans1, mode);
       thread->dtb->translateTiming(req2, thread->getTC(), trans2, mode);
   } else {
       WholeTranslationState *state =
           new WholeTranslationState(req, new uint8_t[size], NULL, mode);
       DataTranslation<TimingSimpleCPU *> *translation
           = new DataTranslation<TimingSimpleCPU *>(this, state);
       thread->dtb->translateTiming(req, thread->getTC(), translation, mode);
   }

   return NoFault;
}

One of the most important basic capability of processor is accessing memory. GEM5 make each processor implement their own memory access building blocks as member function of each processor class. We are going to take a look at simple processor, TimingSimpleCPU and corresponding memory function, initiateMemRead. Note that at the end of the initiateMemRead function, it generates DataTranslation object and pass it to the translateTiming function defined in the data TLB component of the processor. This translation object will be used to process current TLB access request. Also note that translateTiming function needs threadContext to execute TLB accessing and RequestPtr object containing all the memory access request information such as virtual address.

It’s all about TLB! No actual memory access to the virtual address!

initiateMemRead function does not initiate actual memory access, it only asks TLB component to generate virtual address to physical address mapping in its TLB cache.

It could be confusing because of its name initateMemRead but the actual memory access could only be occured after the TLB request can be successfully resolved. I will describe how actual memory access happens in this posting [] Keep in mind that we will only focus on the translation part!

        
      
//gem5/src/arch/x86/tlb.cc

void
TLB::translateTiming(const RequestPtr &req, ThreadContext *tc,
       Translation *translation, Mode mode)
{
   bool delayedResponse;
   assert(translation);
   Fault fault =
       TLB::translate(req, tc, translation, mode, delayedResponse, true);
449
   if (!delayedResponse)
       translation->finish(fault, req, tc, mode);
   else
       translation->markDelayed();
}

As we assume that GEM5 is compiled for X86 architecture, it will invoke TLB implementation for X86 architecture. Please be aware that the translateTiming function is implemented as part of the TLB class, indicating that we are presently working with TLB components, transitioning away from the processor pipeline. The actual translation is done by TLB::translate function. Depending on whether the target virtual address has previously been resolved and its mapping cached in the TLB or not, the function can either retrieve the TLB entry from the cache or obtain it by traversing the page table.

        
      
// gem5/src/arch/x86/tlb.cc

Fault
TLB::translate(const RequestPtr &req,
       ThreadContext *tc, Translation *translation,
       Mode mode, bool &delayedResponse, bool timing)
{
   Request::Flags flags = req->getFlags();
   int seg = flags & SegmentFlagMask;
   bool storeCheck = flags & (StoreCheck << FlagShift);
...
       // If paging is enabled, do the translation.
       if (m5Reg.paging) {
           DPRINTF(TLB, "Paging enabled.\n");
           // The vaddr already has the segment base applied.
           TlbEntry *entry = lookup(vaddr);
           if (mode == Read) {
               rdAccesses++;
           } else {
               wrAccesses++;
           }
           if (!entry) {
               DPRINTF(TLB, "Handling a TLB miss for "
                       "address %#x at pc %#x.\n",
                       vaddr, tc->instAddr());
               if (mode == Read) {
                   rdMisses++;
               } else {
                   wrMisses++;
               }
               if (FullSystem) {
                   Fault fault = walker->start(tc, translation, req, mode);
                   if (timing || fault != NoFault) {
                       // This gets ignored in atomic mode.
                       delayedResponse = true;
                       return fault;
                   }
                   entry = lookup(vaddr);
                   assert(entry);
               } else {

The initial step in the translate function involves a query to the TLB, inquiring whether the necessary translation entry is present in the TLB (line 345). In cases where the TLB entry is absent, the process then proceeds to navigate through the page table, which is stored in memory, in order to acquire the virtual-to-physical translation (spanning from line 351 to 395). Given the presumed interest in utilizing full-system emulation, I will focus on FullSystem parts of TLB handling. In GEM5’s fullsystem mode, when a TLB miss occurs, the system proceeds to navigate the page table using the “pagetable_walker” object (as indicated in line 361). It’s important to note that the “req” parameter is passed to the pagetable_walker because it contains all the essential information, including the address and flags, necessary for correctly resolving memory access.

Page table walking in TLB

In cases where it is either the first request or the previous TLB entry has been evicted from the TLB cache, it is required to traverse the page table and obtain the virtual to physical mapping. Let’s examine the process by which the TLB effectively navigates the page table and retrieves the final-level page table entry.

WalkerState per request

In contrast to simpler operations, it’s typically not possible to resolve TLB misses in a single cycle.

As the page table is structured with multiple levels, the page table walking demands numerous memory accesses. These accesses are essential for reaching the leaf page table entry that contains the virtual-to-physical mapping and other pertinent flags.

        
      
//gem5/src/arch/x86/pagetable_walker.cc

Fault
Walker::start(ThreadContext * _tc, BaseTLB::Translation *_translation,
             const RequestPtr &_req, BaseTLB::Mode _mode)
{
   // TODO: in timing mode, instead of blocking when there are other
   // outstanding requests, see if this request can be coalesced with
   // another one (i.e. either coalesce or start walk)
   WalkerState * newState = new WalkerState(this, _translation, _req);
   newState->initState(_tc, _mode, sys->isTimingMode());
   if (currStates.size()) {
       assert(newState->isTiming());
       DPRINTF(PageTableWalker, "Walks in progress: %d\n", currStates.size());
       currStates.push_back(newState);
       return NoFault;
   } else {
       currStates.push_back(newState);
       Fault fault = newState->startWalk();
       if (!newState->isTiming()) {
           currStates.pop_front();
           delete newState;
       }
       return fault;
   }
}

It is important to note that TLB misses can occur simultaneously because multiple processors might try to access a memory address for which the virtual-to-physical mapping is not stored in the TLB cache. Additional, since each request cannot be handled in a single clock cycle, there is a need to store the state of page table walking for each request. The “walkerState” is employed for this specific purpose, maintaining all the necessary information for page table walking on a per-request basis.

The “currStates” keeps track of all the outstanding requests, which are those that have been requested previously but have not yet been resolved, in the form of a list. If there are any unresolved TLB misses, the current request is simply added to the list, and the system waits until the preceding requests have been resolved, as seen in lines 80-84. Once the outstanding request has been resolved, the pending requests are then processed one after another.

If there is no remaining requests in the list, as indicated in lines 85-92, a newly generated state should be added, and the “startWalk” function is called with the newly created state. Upon completion of the page table walking by the “startWalk” function, in the case of a timing CPU, there is no need to remove the current state from the “currStates” list, as another stage in the timing CPU model takes care of removing the current state from the list.

startWalk, initiating page table walking

gem5/src/arch/x86/pagetable_walker.cc

        
      
Fault
Walker::WalkerState::startWalk()
{
   Fault fault = NoFault;
   assert(!started);
   started = true;
   setupWalk(req->getVaddr());
   if (timing) {
       nextState = state;
       state = Waiting;
       timingFault = NoFault;
       sendPackets();
   } else {
       do {
           walker->port.sendAtomic(read);
           PacketPtr write = NULL;
           fault = stepWalk(write);
           assert(fault == NoFault || read == NULL);
           state = nextState;
           nextState = Ready;
           if (write)
               walker->port.sendAtomic(write);
       } while (read);
       state = Ready;
       nextState = Waiting;
   }
   return fault;
}

Since the page table is stored in memory or cache, whenever the TLB miss happens it should retrieve page table content from the memory subsystem. To this end, TLB component initiates memory request through sendPackets function.

multi-level page table walking process = multiple packets

        
      
void
Walker::WalkerState::sendPackets()
{
   //If we're already waiting for the port to become available, just return.
   if (retrying)
       return;
667
   //Reads always have priority
   if (read) {
       PacketPtr pkt = read;
       read = NULL;
       inflight++;
       if (!walker->sendTiming(this, pkt)) {
           retrying = true;
           read = pkt;
           inflight--;
           return;
       }
   }
   //Send off as many of the writes as we can.
   while (writes.size()) {
       PacketPtr write = writes.back();
       writes.pop_back();
       inflight++;
       if (!walker->sendTiming(this, write)) {
           retrying = true;
           writes.push_back(write);
           inflight--;
           return;
       }
   }
}

With modern processors making use of multi-level page tables, it becomes challenging to pre-determine which page table entries will be accessed before resolving the memory access at the previous level of page table entry. Because of this interdependence among page table access, the accesses to these entries must be carried out sequentially rather than in parallel Consequently, TLB accesses are structured into multiple stages, with each stage responsible for accessing one level of the page table.

Since TLB should request memory subsystem to fetch next level of page table entry one by one, it should send send different packets at different stage to access a specific level of the page table.

When you look at the “sendPackets” function, you will notice a familiar function name, “sendTiming,” which dispatches page-table-access-request-packets to the memory subsystem (e.g., cache or memory).

Initial page table access packet creation

When you take a look at the “sendPackets” function, you won’t observe any packet creation within it. However, you will notice that the “sendTiming” function receives a parameter named pkt. So, where does this pkt come from? The “setupWalk” function within the “startWalk” function is responsible for populating the appropriate request packet, which initiates the access to the page table.

        
      
void
Walker::WalkerState::setupWalk(Addr vaddr)
{
   VAddr addr = vaddr;
   CR3 cr3 = tc->readMiscRegNoEffect(MISCREG_CR3);
   // Check if we're in long mode or not
   Efer efer = tc->readMiscRegNoEffect(MISCREG_EFER);
   dataSize = 8;
   Addr topAddr;
   if (efer.lma) {
       // Do long mode.
       state = LongPML4;
       topAddr = (cr3.longPdtb << 12) + addr.longl4 * dataSize;
       enableNX = efer.nxe;
   } else {
       // We're in some flavor of legacy mode.
       CR4 cr4 = tc->readMiscRegNoEffect(MISCREG_CR4);
       if (cr4.pae) {
           // Do legacy PAE.
           state = PAEPDP;
           topAddr = (cr3.paePdtb << 5) + addr.pael3 * dataSize;
           enableNX = efer.nxe;
       } else {
           dataSize = 4;
           topAddr = (cr3.pdtb << 12) + addr.norml2 * dataSize;
           if (cr4.pse) {
               // Do legacy PSE.
               state = PSEPD;
           } else {
               // Do legacy non PSE.
               state = PD;
           }
           enableNX = false;
       }
   }
586
   nextState = Ready;
   entry.vaddr = vaddr;
589
   Request::Flags flags = Request::PHYSICAL;
   if (cr3.pcd)
       flags.set(Request::UNCACHEABLE);
593
   RequestPtr request = std::make_shared<Request>(
       topAddr, dataSize, flags, walker->masterId);
596
   read = new Packet(request, MemCmd::ReadReq);
   read->allocate();
}

We’ve learned that the “sendPackets” function is employed to transmit multiple page table access requests, depending on the various stages of the page table walking process. So, how are the packets for the subsequent stages created and provided to the “sendPackets” function? Please bear with me as we progress through one complete step of page table walking; I will address this aspect shortly.

SendTiming function: sends request and save current state

Now, let’s explore how the “sendTiming” function transmits the generated page table access request packet to the memory subsystem via the designated port.

        
      
bool Walker::sendTiming(WalkerState* sendingState, PacketPtr pkt)
{
   WalkerSenderState* walker_state = new WalkerSenderState(sendingState);
   pkt->pushSenderState(walker_state);
   if (port.sendTimingReq(pkt)) {
       return true;
   } else {
       // undo the adding of the sender state and delete it, as we
       // will do it again the next time we attempt to send it
       pkt->popSenderState();
       delete walker_state;
       return false;
   }
169
}

It’s worth noting that the “sendTiming” function initially generates a separate state called “WalkerSenderState.” This state variable is essential for handling the requested page table access and for processing the response from the memory subsystem once the page table access has been completed.

Handling return packet from memory sub-system

When memory sub-system successfully handled the page table access request, pagetable_walker receives the result packet through the port. When the packet arrives to the port connecting pagetable_walker and memory sub-system, it invokes recvTimingResp function of the walker.

        
      
bool
Walker::WalkerPort::recvTimingResp(PacketPtr pkt)
{
   return walker->recvTimingResp(pkt);
}
109
bool
Walker::recvTimingResp(PacketPtr pkt)
{
   WalkerSenderState * senderState =
       dynamic_cast<WalkerSenderState *>(pkt->popSenderState());
   WalkerState * senderWalk = senderState->senderWalk;
   bool walkComplete = senderWalk->recvPacket(pkt);
   delete senderState;
   if (walkComplete) {
       std::list<WalkerState *>::iterator iter;
       for (iter = currStates.begin(); iter != currStates.end(); iter++) {
           WalkerState * walkerState = *(iter);
           if (walkerState == senderWalk) {
               iter = currStates.erase(iter);
               break;
           }
       }
       delete senderWalk;
       // Since we block requests when another is outstanding, we
       // need to check if there is a waiting request to be serviced
       if (currStates.size() && !startWalkWrapperEvent.scheduled())
           // delay sending any new requests until we are finished
           // with the responses
           schedule(startWalkWrapperEvent, clockEdge());
   }
   return true;
}

As we’ve seen before, WalkerSenderState wraps up the walker instance (WalkerState) which has been used to send pagetable access request associated with currently received packet.

recvPacket handles received packet and send another packet for next stage pagetable access

Retrieved WalkerState instance handles the received packet by calling recvPacket function of the WalkerState.

        
      
bool
Walker::WalkerState::recvPacket(PacketPtr pkt)
{
   assert(pkt->isResponse());
   assert(inflight);
   assert(state == Waiting);
   inflight--;
   if (squashed) {
       // if were were squashed, return true once inflight is zero and
       // this WalkerState will be freed there.
       return (inflight == 0);
   }
   if (pkt->isRead()) {
       // should not have a pending read it we also had one outstanding
       assert(!read);
617
       // @todo someone should pay for this
       pkt->headerDelay = pkt->payloadDelay = 0;
620
       state = nextState;
       nextState = Ready;
       PacketPtr write = NULL;
       read = pkt;
       timingFault = stepWalk(write);
       state = Waiting;
       assert(timingFault == NoFault || read == NULL);
       if (write) {
           writes.push_back(write);
       }
       sendPackets();
   } else {
       sendPackets();
   }
   if (inflight == 0 && read == NULL && writes.size() == 0) {
       state = Ready;
       nextState = Waiting;
       if (timingFault == NoFault) {
           /*
            * Finish the translation. Now that we know the right entry is
            * in the TLB, this should work with no memory accesses.
            * There could be new faults unrelated to the table walk like
            * permissions violations, so we'll need the return value as
            * well.
            */
           bool delayedResponse;
           Fault fault = walker->tlb->translate(req, tc, NULL, mode,
                                                delayedResponse, true);
           assert(!delayedResponse);
           // Let the CPU continue.
           translation->finish(fault, req, tc, mode);
       } else {
           // There was a fault during the walk. Let the CPU know.
           translation->finish(timingFault, req, tc, mode);
       }
       return true;
   }
658
   return false;
}

Because the recvPacket function has been invoked as a result of memory read (initial pagetable access) 614-634 will be executed. There are some functions that we don’t know, but it finally invokes sendPackets function again. Wait why sendPackets once again in receive function?

Remember! Page table walking is not a single memory access

Note that we are currently dealing with the result packet from the memory sub-system as a result of sending initial pagetable access request (accessing first level of pagetable) Therefore, the received packet should contain next level page table information not the Page table entry which actually contains physical address to virtual address mapping. Therefore, to acquire the last level page table entry, it needs additional memory accesses to the sub levels of pagetables, which should requires another sendPackets.

Preparing packets for the next pagetable access requests

As we generated initiating packet with the help of setupWalk, packets required for accessing further page table layers are prepared by stepWalk function.

        
      
Fault
Walker::WalkerState::stepWalk(PacketPtr &write)
{
   assert(state != Ready && state != Waiting);
   Fault fault = NoFault;
   write = NULL;
   PageTableEntry pte;
   if (dataSize == 8)
       pte = read->getLE<uint64_t>();
   else
       pte = read->getLE<uint32_t>();
   VAddr vaddr = entry.vaddr;
   bool uncacheable = pte.pcd;
   Addr nextRead = 0;
   bool doWrite = false;
   bool doTLBInsert = false;
   bool doEndWalk = false;
   bool badNX = pte.nx && mode == BaseTLB::Execute && enableNX;
   switch(state) {
     case LongPML4:
       DPRINTF(PageTableWalker,
               "Got long mode PML4 entry %#016x.\n", (uint64_t)pte);
       nextRead = ((uint64_t)pte & (mask(40) << 12)) + vaddr.longl3 * dataSize;
       doWrite = !pte.a;
       pte.a = 1;
       entry.writable = pte.w;
       entry.user = pte.u;
       if (badNX || !pte.p) {
           doEndWalk = true;
           fault = pageFault(pte.p);
           break;
       }
       entry.noExec = pte.nx;
       nextState = LongPDP;
       break;
     case LongPDP:
       DPRINTF(PageTableWalker,
               "Got long mode PDP entry %#016x.\n", (uint64_t)pte);
       nextRead = ((uint64_t)pte & (mask(40) << 12)) + vaddr.longl2 * dataSize;
       doWrite = !pte.a;
       pte.a = 1;
       entry.writable = entry.writable && pte.w;
       entry.user = entry.user && pte.u;
       if (badNX || !pte.p) {
           doEndWalk = true;
           fault = pageFault(pte.p);
           break;
       }
       nextState = LongPD;
       break;
     case LongPD:
       DPRINTF(PageTableWalker,
               "Got long mode PD entry %#016x.\n", (uint64_t)pte);
       doWrite = !pte.a;
       pte.a = 1;
       entry.writable = entry.writable && pte.w;
       entry.user = entry.user && pte.u;
       if (badNX || !pte.p) {
           doEndWalk = true;
           fault = pageFault(pte.p);
           break;
       }
       if (!pte.ps) {
           // 4 KB page
           entry.logBytes = 12;
           nextRead =
               ((uint64_t)pte & (mask(40) << 12)) + vaddr.longl1 * dataSize;
           nextState = LongPTE;
           break;
       } else {
           // 2 MB page
           entry.logBytes = 21;
           entry.paddr = (uint64_t)pte & (mask(31) << 21);
           entry.uncacheable = uncacheable;
           entry.global = pte.g;
           entry.patBit = bits(pte, 12);
           entry.vaddr = entry.vaddr & ~((2 * (1 << 20)) - 1);
           doTLBInsert = true;
           doEndWalk = true;
           break;
       }
     case LongPTE:
       DPRINTF(PageTableWalker,
               "Got long mode PTE entry %#016x.\n", (uint64_t)pte);
       doWrite = !pte.a;
       pte.a = 1;
       entry.writable = entry.writable && pte.w;
       entry.user = entry.user && pte.u;
       if (badNX || !pte.p) {
           doEndWalk = true;
           fault = pageFault(pte.p);
           break;
       }
       entry.paddr = (uint64_t)pte & (mask(40) << 12);
       entry.uncacheable = uncacheable;
       entry.global = pte.g;
       entry.patBit = bits(pte, 12);
       entry.vaddr = entry.vaddr & ~((4 * (1 << 10)) - 1);
       doTLBInsert = true;
       doEndWalk = true;
       break;
     case PAEPDP:
       DPRINTF(PageTableWalker,
               "Got legacy mode PAE PDP entry %#08x.\n", (uint32_t)pte);
       nextRead = ((uint64_t)pte & (mask(40) << 12)) + vaddr.pael2 * dataSize;
       if (!pte.p) {
           doEndWalk = true;
           fault = pageFault(pte.p);
           break;
       }
       nextState = PAEPD;
       break;
     case PAEPD:
       DPRINTF(PageTableWalker,
               "Got legacy mode PAE PD entry %#08x.\n", (uint32_t)pte);
       doWrite = !pte.a;
       pte.a = 1;
       entry.writable = pte.w;
       entry.user = pte.u;
       if (badNX || !pte.p) {
           doEndWalk = true;
           fault = pageFault(pte.p);
           break;
       }
       if (!pte.ps) {
           // 4 KB page
           entry.logBytes = 12;
           nextRead = ((uint64_t)pte & (mask(40) << 12)) + vaddr.pael1 * dataSize;
           nextState = PAEPTE;
           break;
       } else {
           // 2 MB page
           entry.logBytes = 21;
           entry.paddr = (uint64_t)pte & (mask(31) << 21);
           entry.uncacheable = uncacheable;
           entry.global = pte.g;
           entry.patBit = bits(pte, 12);
           entry.vaddr = entry.vaddr & ~((2 * (1 << 20)) - 1);
           doTLBInsert = true;
           doEndWalk = true;
           break;
       break;
     case PSEPD:                                                                                                                       445         DPRINTF(PageTableWalker,
               "Got legacy mode PSE PD entry %#08x.\n", (uint32_t)pte);
       doWrite = !pte.a;
       pte.a = 1;
       entry.writable = pte.w;
       entry.user = pte.u;
       if (!pte.p) {
           doEndWalk = true;
           fault = pageFault(pte.p);
           break;
       }
       if (!pte.ps) {
           // 4 KB page
           entry.logBytes = 12;
           nextRead =
               ((uint64_t)pte & (mask(20) << 12)) + vaddr.norml2 * dataSize;
           nextState = PTE;
           break;
       } else {
           // 4 MB page
           entry.logBytes = 21;
           entry.paddr = bits(pte, 20, 13) << 32 | bits(pte, 31, 22) << 22;
           entry.uncacheable = uncacheable;
           entry.global = pte.g;
           entry.patBit = bits(pte, 12);
           entry.vaddr = entry.vaddr & ~((4 * (1 << 20)) - 1);
           doTLBInsert = true;
           doEndWalk = true;
           break;
       }
     case PD:
       DPRINTF(PageTableWalker,
               "Got legacy mode PD entry %#08x.\n", (uint32_t)pte);
       doWrite = !pte.a;
       pte.a = 1;
       entry.writable = pte.w;
       entry.user = pte.u;
       if (!pte.p) {
           doEndWalk = true;
           fault = pageFault(pte.p);
           break;
       }
       // 4 KB page
       entry.logBytes = 12;
       nextRead = ((uint64_t)pte & (mask(20) << 12)) + vaddr.norml2 * dataSize;
       nextState = PTE;
       break;
     case PTE:
       DPRINTF(PageTableWalker,
               "Got legacy mode PTE entry %#08x.\n", (uint32_t)pte);
       doWrite = !pte.a;
       pte.a = 1;
       entry.writable = pte.w;
       entry.user = pte.u;
       if (!pte.p) {
           doEndWalk = true;
           fault = pageFault(pte.p);
           break;
       }
       entry.paddr = (uint64_t)pte & (mask(20) << 12);
       entry.uncacheable = uncacheable;
       entry.global = pte.g;
       entry.patBit = bits(pte, 7);
       entry.vaddr = entry.vaddr & ~((4 * (1 << 10)) - 1);
       doTLBInsert = true;
       doEndWalk = true;
       break;
     default:
       panic("Unknown page table walker state %d!\n");
   }
   if (doEndWalk) {
       if (doTLBInsert)
           if (!functional)
               walker->tlb->insert(entry.vaddr, entry, tc);
       endWalk();
   } else {
       PacketPtr oldRead = read;
       //If we didn't return, we're setting up another read.
       Request::Flags flags = oldRead->req->getFlags();
       flags.set(Request::UNCACHEABLE, uncacheable);
       RequestPtr request = std::make_shared<Request>(
           nextRead, oldRead->getSize(), flags, walker->masterId);
       read = new Packet(request, MemCmd::ReadReq);
       read->allocate();
       // If we need to write, adjust the read packet to write the modified
       // value back to memory.
       if (doWrite) {
           write = oldRead;
           write->setLE<uint64_t>(pte);
           write->cmd = MemCmd::WriteReq;
       } else {
           write = NULL;
           delete oldRead;
       }
   }
   return fault;
}

Even though it is very long, depending on the current state representing a level of the page table accessed by the currently received packet, next level page table address and corresponding packet is populated.

Because we are here as a result of accessing PML4 (first level page table), line 301-316 will be executed and prepare the information to access the next level pagetable. Note that the next state is set to the next level of page table level, PDP.

After setting fields associated with next page table level access, it generates another read packet (Line 520-539) to convey all the information required to access the next level pagetable. Note that the newly populated packet is assigned to the read field of the current WalkerState object. This read packet is used by the sendPackets function to access further pagetable layers.

These sending and receiving steps are repeated until the final PTE is read. When PTE is read from the memory sub-system, it sets the doEndWalk flag and doTLBInsert flag When the flags are set, new TLB entry is inserted to the TLB module (line 515-520).

Finish TLB translation

After the translation has been finished, whether it ends up TLB hit, TLB miss and page table walking, or unexpected TLB fault, it invokes finish function through a translation object.

gem5/src/arch/x86/tlb.cc

        
      
void
TLB::translateTiming(const RequestPtr &req, ThreadContext *tc,
       Translation *translation, Mode mode)
{
   bool delayedResponse;
   assert(translation);
   Fault fault =
       TLB::translate(req, tc, translation, mode, delayedResponse, true);
449
   if (!delayedResponse)
       translation->finish(fault, req, tc, mode);
   else
       translation->markDelayed();
}

Translation object

Wait, what is the translation object? We haven’t deal with it before. Let’s go back to initiateMemRead function again to understand what is the translation object.

DataTranslation class and finish method

        
      
       WholeTranslationState *state =
           new WholeTranslationState(req, new uint8_t[size], NULL, mode);
       DataTranslation<TimingSimpleCPU *> *translation
           = new DataTranslation<TimingSimpleCPU *>(this, state);
       thread->dtb->translateTiming(req, thread->getTC(), translation, mode);

At line 464-468, we can find that it is an object of DataTranslation class. To find out implementation of finish function, let’s take a look at DataTranslation class.

gem5/src/cpu/translation.hh

        
      
/**
* This class represents part of a data address translation.  All state for
* the translation is held in WholeTranslationState (above).  Therefore this
* class does not need to know whether the translation is split or not.  The
* index variable determines this but is simply passed on to the state class.
* When this part of the translation is completed, finish is called.  If the
* translation state class indicate that the whole translation is complete
* then the execution context is informed.
*/
template <class ExecContextPtr>
class DataTranslation : public BaseTLB::Translation
{
 protected:
   ExecContextPtr xc;
   WholeTranslationState *state;
   int index;
224
 public:
   DataTranslation(ExecContextPtr _xc, WholeTranslationState* _state)
       : xc(_xc), state(_state), index(0)
   {
   }
230
   DataTranslation(ExecContextPtr _xc, WholeTranslationState* _state,
                   int _index)
       : xc(_xc), state(_state), index(_index)
   {
   }
236
   /**
    * Signal the translation state that the translation has been delayed due
    * to a hw page table walk.  Split requests are transparently handled.
    */
   void
   markDelayed()
   {
       state->delay = true;
   }
246
   /**
    * Finish this part of the translation and indicate that the whole
    * translation is complete if the state says so.
    */
   void
   finish(const Fault &fault, const RequestPtr &req, ThreadContext *tc,
          BaseTLB::Mode mode)
   {
       assert(state);
       assert(mode == state->mode);
       if (state->finish(fault, index)) {
           if (state->getFault() == NoFault) {
               // Don't access the request if faulted (due to squash)
               req->setTranslateLatency();
           }
           xc->finishTranslation(state);
       }
       delete this;
   }
266
   bool
   squashed() const
   {
       return xc->isSquashed();
   }
};

We can find that finish function is implemented in the DataTranslation class. The finish function defined in DataTranslation class re-invokes another finish function through the state member field (line 257). Also after invoking finish function, it invokes finishTranslation method of ThreadContext when Fault has been raised as a consequence of TLB processing.

When we look at the initiateMemRead function again, WholeTranslationState instance is passed to the DataTranslation constructor as a state parameter.

Therefore, state->finish of the DataTranslation invokes WholeTranslationState::finish method. Note that WholeTranslationState contains actual request used for accessing page table entry from tlb.

gem5/src/cpu/translation.hh

        
      
/**
* This class captures the state of an address translation.  A translation
* can be split in two if the ISA supports it and the memory access crosses
* a page boundary.  In this case, this class is shared by two data
* translations (below).  Otherwise it is used by a single data translation
* class.  When each part of the translation is finished, the finish
* function is called which will indicate whether the whole translation is
* completed or not.  There are also functions for accessing parts of the
* translation state which deal with the possible split correctly.
*/
class WholeTranslationState
{
 protected:
   int outstanding;
   Fault faults[2];
 66
 public:
   bool delay;
   bool isSplit;
   RequestPtr mainReq;
   RequestPtr sreqLow;
   RequestPtr sreqHigh;
   uint8_t *data;
   uint64_t *res;
   BaseTLB::Mode mode;
 76
   /**
    * Single translation state.  We set the number of outstanding
    * translations to one and indicate that it is not split.
    */
   WholeTranslationState(const RequestPtr &_req, uint8_t *_data,
                         uint64_t *_res, BaseTLB::Mode _mode)
       : outstanding(1), delay(false), isSplit(false), mainReq(_req),
         sreqLow(NULL), sreqHigh(NULL), data(_data), res(_res), mode(_mode)
   {
       faults[0] = faults[1] = NoFault;
       assert(mode == BaseTLB::Read || mode == BaseTLB::Write);
   }
 89
   /**
    * Split translation state.  We copy all state into this class, set the
    * number of outstanding translations to two and then mark this as a
    * split translation.
    */
   WholeTranslationState(const RequestPtr &_req, const RequestPtr &_sreqLow,
                         const RequestPtr &_sreqHigh, uint8_t *_data,
                         uint64_t *_res, BaseTLB::Mode _mode)
       : outstanding(2), delay(false), isSplit(true), mainReq(_req),
         sreqLow(_sreqLow), sreqHigh(_sreqHigh), data(_data), res(_res),
         mode(_mode)
   {
       faults[0] = faults[1] = NoFault;
       assert(mode == BaseTLB::Read || mode == BaseTLB::Write);
   }
105
   /**
    * Finish part of a translation.  If there is only one request then this
    * translation is completed.  If the request has been split in two then
    * the outstanding count determines whether the translation is complete.
    * In this case, flags from the split request are copied to the main
    * request to make it easier to access them later on.
    */
   bool
   finish(const Fault &fault, int index)
   {
       assert(outstanding);
       faults[index] = fault;
       outstanding--;
       if (isSplit && outstanding == 0) {
120
           // For ease later, we copy some state to the main request.
           if (faults[0] == NoFault) {
               mainReq->setPaddr(sreqLow->getPaddr());
           }
           mainReq->setFlags(sreqLow->getFlags());
           mainReq->setFlags(sreqHigh->getFlags());
       }
       return outstanding == 0;
   }

the finish function of WholeTranslationState stores generated fault on its internal buffer when meaningful fault has been raised in translation process (line 117). After the fault has been stored by WholeTranslationState’s finish function, remaining part of DataTranslation’s finish function invokes xc->finishTranslation(state) function. Note that finishTranslation function requires WholeTranslationStation instance as state argument.

To understand details, we have to look at what is the xc variable. Because DataTranslation class is declared as a template class, and xc is declared as template type, it will be an instance of template type class.

Now the time of processor, not the TLB

DataTranslation as an interface to interact with CPU

        
       DataTranslation<TimingSimpleCPU *> *translation
           = new DataTranslation<TimingSimpleCPU *>(this, state);

Because the translation variable has been declared with *DataTranslation* type, xc variable is as an instance of *TimingSimpleCpu*. Therefore, when the xc->finishTranslation(state) is called, it will invoke TimingSimpleCPU::finishTranslation function. Note that we are jumping into the CPU code from the TLB module.

What CPU has to do after the TLB finish its job

        
      
void
TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
{
   _status = BaseSimpleCPU::Running;
 631
   if (state->getFault() != NoFault) {
       if (state->isPrefetch()) {
           state->setNoFault();
       }
       delete [] state->data;
       state->deleteReqs();
       translationFault(state->getFault());
   } else {
       if (!state->isSplit) {
           sendData(state->mainReq, state->data, state->res,
                    state->mode == BaseTLB::Read);
       } else {
           sendSplitData(state->sreqLow, state->sreqHigh, state->mainReq,
                         state->data, state->mode == BaseTLB::Read);
       }
   }
 648
   delete state;
}

When there exists translation fault, it ends up ivnoking translationFault function of the CPU with a previously stored fault(line 638). Note that state->getFault method returns the fault previously stored by WholeTranslationState’s finish. When a translation has happended because of prefetch instruction, it suppress generated fault because it is not critical for execution.

However, when no fault has been encountered during the translation, it invokes sendData function. We will cover this later.

Let CPU handle the TLB fault

        
      
void
TimingSimpleCPU::translationFault(const Fault &fault)
{
   // fault may be NoFault in cases where a fault is suppressed,
   // for instance prefetches.
   updateCycleCounts();
   updateCycleCounters(BaseCPU::CPU_STATE_ON);
 368
   if (traceData) {
       // Since there was a fault, we shouldn't trace this instruction.
       delete traceData;
       traceData = NULL;
   }
 374
   postExecute();
 376
   advanceInst(fault);
}

The translationFault function invokes postExecute and advanceInst function. By looking at the function argument, we can infer that advanceInst function actually deal with the fault. The postExecute function doesn’t invoke any meaningful function to proceed pipeline, but it updates stat of the processor such as power model, load instruction counter, etc. Therefore, let’s jump into the advanceInst function.

advanceInst to process generated translation fault

gem5/src/cpu/simple/timing.cc

        
      
void
TimingSimpleCPU::advanceInst(const Fault &fault)
{
   SimpleExecContext &t_info = *threadInfo[curThread];
 738
   if (_status == Faulting)
       return;
 741
   if (fault != NoFault) {
       DPRINTF(SimpleCPU, "Fault occured. Handling the fault\n");
 744
       advancePC(fault);
 746
       // A syscall fault could suspend this CPU (e.g., futex_wait)
       // If the _status is not Idle, schedule an event to fetch the next
       // instruction after 'stall' ticks.
       // If the cpu has been suspended (i.e., _status == Idle), another
       // cpu will wake this cpu up later.
       if (_status != Idle) {
           DPRINTF(SimpleCPU, "Scheduling fetch event after the Fault\n");
 754
           Tick stall = dynamic_pointer_cast<SyscallRetryFault>(fault) ?
                        clockEdge(syscallRetryLatency) : clockEdge();
           reschedule(fetchEvent, stall, true);
           _status = Faulting;
       }
 760
       return;
   }
 763
   if (!t_info.stayAtPC)
       advancePC(fault);
 766
   if (tryCompleteDrain())
       return;
 769
   if (_status == BaseSimpleCPU::Running) {
       // kick off fetch of next instruction... callback from icache
       // response will cause that instruction to be executed,
       // keeping the CPU running.
       fetch();
   }
}

When there is a pending translation fault, it delegates fault exception to the advancePC function which actually controls the PC register of the CPU. TimingSimpleCPU inherits this function from BaseSimpleCPU, we will look at the BaseSimpleCPU class.

gem5/src/cpu/simple/base.cc

        
      
void
BaseSimpleCPU::advancePC(const Fault &fault)
{
   SimpleExecContext &t_info = *threadInfo[curThread];
   SimpleThread* thread = t_info.thread;
666
   const bool branching(thread->pcState().branching());
668
   //Since we're moving to a new pc, zero out the offset
   t_info.fetchOffset = 0;
   if (fault != NoFault) {
       curMacroStaticInst = StaticInst::nullStaticInstPtr;
       fault->invoke(threadContexts[curThread], curStaticInst);
       thread->decoder.reset();
   } else {
       if (curStaticInst) {
           if (curStaticInst->isLastMicroop())
               curMacroStaticInst = StaticInst::nullStaticInstPtr;
           TheISA::PCState pcState = thread->pcState();
           TheISA::advancePC(pcState, curStaticInst);
           thread->pcState(pcState);
       }
   }
684
   if (branchPred && curStaticInst && curStaticInst->isControl()) {
       // Use a fake sequence number since we only have one
       // instruction in flight at the same time.
       const InstSeqNum cur_sn(0);
689
       if (t_info.predPC == thread->pcState()) {
           // Correctly predicted branch
           branchPred->update(cur_sn, curThread);
       } else {
           // Mis-predicted branch
           branchPred->squash(cur_sn, thread->pcState(), branching, curThread);
           ++t_info.numBranchMispred;
       }
   }
}

In general, the advancePC function updates current CPU context. However, depending on the current CPU state, whether the fault has been raised or not, it chooses different options to handle the generated fault and redirect the PC to move on. The invoke function called through the fault object handles the generated fault usually with the help of pre-defined ROM code. Also, it resets decoder and make curMacroStaticInst as Null. This is because we have to move on to the new PC after handling fault.

On the other hand, as usually taken path, when the fault has not been raised during the current instruction execution, it updates micropc of the processor to the next instruction (line 676-682).

pre-defined ROM code handles generated fault!

Then let’s take a look at how the fault can be handled by the invoke function implemented in the fault class.

gem5/srch/arch/x86/faults.cc

        
      
namespace X86ISA
{
   void X86FaultBase::invoke(ThreadContext * tc, const StaticInstPtr &inst)
   {
       if (!FullSystem) {
           FaultBase::invoke(tc, inst);
           return;
       }
 61
       PCState pcState = tc->pcState();
       Addr pc = pcState.pc();
       DPRINTF(Faults, "RIP %#x: vector %d: %s\n",
               pc, vector, describe());
       using namespace X86ISAInst::RomLabels;
       HandyM5Reg m5reg = tc->readMiscRegNoEffect(MISCREG_M5_REG);
       MicroPC entry;
       if (m5reg.mode == LongMode) {
           if (isSoft()) {
               entry = extern_label_longModeSoftInterrupt;
           } else {
               entry = extern_label_longModeInterrupt;
           }
       } else {
           entry = extern_label_legacyModeInterrupt;
       }
       tc->setIntReg(INTREG_MICRO(1), vector);
       tc->setIntReg(INTREG_MICRO(7), pc);
       if (errorCode != (uint64_t)(-1)) {
           if (m5reg.mode == LongMode) {
               entry = extern_label_longModeInterruptWithError;
           } else {
               panic("Legacy mode interrupts with error codes "
                       "aren't implementde.\n");
           }
           // Software interrupts shouldn't have error codes. If one
           // does, there would need to be microcode to set it up.
           assert(!isSoft());
           tc->setIntReg(INTREG_MICRO(15), errorCode);
       }
       pcState.upc(romMicroPC(entry));
       pcState.nupc(romMicroPC(entry) + 1);
       tc->pcState(pcState);
   }

To look at the behavior of the invoke function of the fault, we have to look at the fault related classes first. GEM5 provides base interface for every faults defined in the x86 architecture. x86 provides different types of events that can intervene the execution flow, which are fault, abort, trap, interrupts. All those events inherit from the base x86 fault class X86FaultBase which provides general interfaces and semantics of the x86 fault events.

However, depending on type of events, different classes inheriting the X86FaultBase can override invoke function to define their own semantics of fault events. For example, PageFault class inherits X86FaultBase class and overrides invoke function to add its own pagefault related semantics before invoking the parent’s invoke function provided by the X86FaultBase class.

Invoke change current RIP to pre-defined microops

Basically, invoke function makes the processor jump to the pre-defined microcode function that implements actual semantics of x86 fault handling. When the fault or interrupt is reported to the processor, first of all, it should stores current context of the processor. And then, it transfers a control flow of the processor to the designated fault handler represented by the IDTR register in x86.

To jump to the pre-defined ROM code from the invoke function, it makes use of ROM labels that statically stores sequence of x86 microops. All the available ROM labels are defined in the RomLabels namespace as show in the below.

gem5/build/X86/arch/x86/generated/decoder-ns.hh.inc

        
      
namespace RomLabels {
const static uint64_t label_longModeSoftInterrupt_stackSwitched = 92;
const static uint64_t label_longModeInterrupt_processDescriptor = 11;
const static uint64_t label_longModeInterruptWithError_cplStackSwitch = 152;
const static uint64_t label_longModeInterrupt_istStackSwitch = 28;
const static uint64_t label_jmpFarWork = 192;
const static uint64_t label_farJmpSystemDescriptor = 207;
const static uint64_t label_longModeSoftInterrupt_globalDescriptor = 71;
const static uint64_t label_farJmpGlobalDescriptor = 199;
const static uint64_t label_initIntHalt = 186;
const static uint64_t label_longModeInterruptWithError_istStackSwitch = 150;
const static uint64_t label_legacyModeInterrupt = 184;
const static uint64_t label_longModeInterruptWithError_globalDescriptor = 132;
const static uint64_t label_longModeSoftInterrupt_processDescriptor = 72;
const static uint64_t label_longModeInterruptWithError = 122;
const static uint64_t label_farJmpProcessDescriptor = 200;
const static uint64_t label_longModeSoftInterrupt = 61;
const static uint64_t label_longModeSoftInterrupt_istStackSwitch = 89;
const static uint64_t label_longModeInterrupt_globalDescriptor = 10;
const static uint64_t label_longModeInterrupt_cplStackSwitch = 30;
const static uint64_t label_longModeInterrupt = 0;
const static uint64_t label_longModeInterruptWithError_processDescriptor = 133;
const static uint64_t label_longModeInterruptWithError_stackSwitched = 153;
const static uint64_t label_longModeInterrupt_stackSwitched = 31;
const static uint64_t label_longModeSoftInterrupt_cplStackSwitch = 91;
const static MicroPC extern_label_initIntHalt = 186;
const static MicroPC extern_label_longModeInterruptWithError = 122;
const static MicroPC extern_label_longModeInterrupt = 0;
const static MicroPC extern_label_longModeSoftInterrupt = 61;
const static MicroPC extern_label_legacyModeInterrupt = 184;
const static MicroPC extern_label_jmpFarWork = 192;
}

PageFault handling ROM code

Although we are looking at translation fault, note that is can be described as PageFault in x86.

gem5/src/arch/x86/faults.cc

        
      
   void PageFault::invoke(ThreadContext * tc, const StaticInstPtr &inst)
   {
       if (FullSystem) {
           /* Invalidate any matching TLB entries before handling the page fault */
           tc->getITBPtr()->demapPage(addr, 0);
           tc->getDTBPtr()->demapPage(addr, 0);
           HandyM5Reg m5reg = tc->readMiscRegNoEffect(MISCREG_M5_REG);
           X86FaultBase::invoke(tc);
           /*
            * If something bad happens while trying to enter the page fault
            * handler, I'm pretty sure that's a double fault and then all
            * bets are off. That means it should be safe to update this
            * state now.
            */
           if (m5reg.mode == LongMode) {
               tc->setMiscReg(MISCREG_CR2, addr);
           } else {
               tc->setMiscReg(MISCREG_CR2, (uint32_t)addr);
           }
       } else {
           PageFaultErrorCode code = errorCode;
           const char *modeStr = "";
           if (code.fetch)
               modeStr = "execute";
           else if (code.write)
               modeStr = "write";
           else
               modeStr = "read";
165
           // print information about what we are panic'ing on
           if (!inst) {
               panic("Tried to %s unmapped address %#x.\n", modeStr, addr);
           } else {
               panic("Tried to %s unmapped address %#x.\nPC: %#x, Instr: %s",
                     modeStr, addr, tc->pcState().pc(),
                     inst->disassemble(tc->pcState().pc(), debugSymbolTable));
           }
       }
   }

Because most of the fault handling logic of the PageFault class overlaps with X86FaultBase, after handling TLB related issues, it just calls invoke function of X86FaultBase class. Because translation fault mainly happens in longmode, and generated fault is not software interrupt, we will take a look at the ROM label named label_longModeInterrupt.

Pass arguments to the ROM code

Also, before jumping to the ROM label, it sets micro architectural registers to pass interrupt number and PC address to the ROM code. Additionaly, when the interrupt makes use of error code, it should also be passed to the microcode

To pass the arguments to the microcode world, it invokes setIntReg functions defined in the threadcontext. Threadcontext is instance of SimpleThread class defined in cpu/simple_thread.hh (When you use the o3 out-of-order cpu model, you have to look at O3ThreadContext class). Regardless of your processor model, both classes inherit ThreadContext class which provide generic register context and interface for manipulating the registers.

gem5/src/cpu/simple_thread.hh

        
      
class SimpleThread : public ThreadState, public ThreadContext
{
 protected:
   typedef TheISA::MachInst MachInst;
   using VecRegContainer = TheISA::VecRegContainer;
   using VecElem = TheISA::VecElem;
   using VecPredRegContainer = TheISA::VecPredRegContainer;
 public:
   typedef ThreadContext::Status Status;
107
 protected:
   std::array<RegVal, TheISA::NumFloatRegs> floatRegs;
   std::array<RegVal, TheISA::NumIntRegs> intRegs;
   std::array<VecRegContainer, TheISA::NumVecRegs> vecRegs;
   std::array<VecPredRegContainer, TheISA::NumVecPredRegs> vecPredRegs;
   std::array<RegVal, TheISA::NumCCRegs> ccRegs;
   TheISA::ISA *const isa;    // one "instance" of the current ISA.
115
   TheISA::PCState _pcState;

   void
   setIntReg(RegIndex reg_idx, RegVal val) override
   {
       int flatIndex = isa->flattenIntIndex(reg_idx);
       assert(flatIndex < TheISA::NumIntRegs);
       DPRINTF(IntRegs, "Setting int reg %d (%d) to %#x.\n",
               reg_idx, flatIndex, val);
       setIntRegFlat(flatIndex, val);
   }

Detour to TheISA namespace

Although SimpleThread class can be seen as providing generic registers regardless of architectures, it declares ISA dependent registers. The magic is TheISA symbol. TheISA symbol will be translated to architecture specific namespace depending on the architecture that the Gem5 has been compiled to. Let’s little bit detour and figure out how TheISA namespace works.

When you don’t know what is the TheISA namesapce, you may want to grep “namespace TheISA” to find out files that define TheISA namespace. However, unfortunately, you can only find very few places where the TheISA namespace has been declared with a handful of member functions. Then where those functions and variables of the TheISA namespace come from? To understand the TheISA:: namespace, we should look at the build files not the source file.

build/X86/config/the_isa.hh

        
      
#ifndef __CONFIG_THE_ISA_HH__
#define __CONFIG_THE_ISA_HH__
  3
#define ALPHA_ISA 1
#define ARM_ISA 2
#define MIPS_ISA 3
#define NULL_ISA 4
#define POWER_ISA 5
#define RISCV_ISA 6
#define SPARC_ISA 7
#define X86_ISA 8
 12
enum class Arch {
 AlphaISA = ALPHA_ISA,
 ArmISA = ARM_ISA,
 MipsISA = MIPS_ISA,
 NullISA = NULL_ISA,
 PowerISA = POWER_ISA,
 RiscvISA = RISCV_ISA,
 SparcISA = SPARC_ISA,
 X86ISA = X86_ISA
};
 23
#define THE_ISA X86_ISA
#define TheISA X86ISA
#define THE_ISA_STR "x86"
 27
#endif // __CONFIG_THE_ISA_HH__

Here, we can easily find that TheISA is defined as X86ISA. Also when we look at the SConScript file, we can find python function names makeTheISA that actually fills out content of config/the_isa.hh file. Here, because I compiled GEM5 with the X86 configuration, it defines the TheISA as X86ISA.

Therefore, when the TheISA has been used on the cpu related files, it is not a actual namespace called “TheISA”, but the architecture dependent ISA namespace. Consequently, when you encounter namespace TheISA, first check whether the config/the_isa.hh header file has been included in your target source file; and when the answer is yes, you have to look at the architecture dependent namespace defined in the gem5/src/arch/YOUR_ARCHITECTURE directory. In my case, because I use the X86 it should be X86ISA namespace.

SetIntReg with TheISA

Now let’s go back to SimpleThread class. In addition to the architecture specific register context, it provides setIntReg function. It allows the processor to store the data on intRegs array located by the index.

        
      
   void
   setIntReg(RegIndex reg_idx, RegVal val) override
   {
       int flatIndex = isa->flattenIntIndex(reg_idx);
       assert(flatIndex < TheISA::NumIntRegs);
       DPRINTF(IntRegs, "Setting int reg %d (%d) to %#x.\n",
               reg_idx, flatIndex, val);
       setIntRegFlat(flatIndex, val);
   }

   void
   setIntRegFlat(RegIndex idx, RegVal val) override
   {
       intRegs[idx] = val;
   }

Note that the val is stored in the intRegs array through the unified interface setIntReg function. The IntRegs contains not only the architecture registers such as rsi,rdi,rcx in x86, but also the integer type micro-registers used only by the microops.

Because x86 in GEM5 defines 16 Integer registers available to the microops, (look at gem5/src/arch/x86/x86_traits.hh) it can pass up to 16 Integer value to the microcode through the setIntReg function. As shown in the invoke function, micro register 1,7, and 15 has been used to pass the fault related arguments to the microops.

Jump to the ROM code!

After finishing setting the required parameters now, it jumps to the stored ROM code pointed to by the label. This control flow transition is done by updating _pcState memeber field of the SimpleThread class object.

gem5/srch/arch/x86/faults.cc

       pcState.upc(romMicroPC(entry));
       pcState.nupc(romMicroPC(entry) + 1);
       tc->pcState(pcState);
   }

When we look at the above code in the invoke function of X86FaultBase class, we can find that it updates upc field of the pcState to location of the ROM code.

gem5/src/base/types.hh

        
      
typedef uint16_t MicroPC;
145
static const MicroPC MicroPCRomBit = 1 << (sizeof(MicroPC) * 8 - 1);
147
static inline MicroPC
romMicroPC(MicroPC upc)
{
   return upc | MicroPCRomBit;
}
153
static inline MicroPC
normalMicroPC(MicroPC upc)
{
   return upc & ~MicroPCRomBit;
}
159
static inline bool
isRomMicroPC(MicroPC upc)
{
   return MicroPCRomBit & upc;
}

Note that romMicroPC function sets flag to specify upc points to start address of ROM code. Here the flag is just bit-wise ored to the upc address.

arch/generic/types.hh

        
      
// A PC and microcode PC.
template <class MachInst>
class UPCState : public SimplePCState<MachInst>
{
 protected:
   typedef SimplePCState<MachInst> Base;

   MicroPC _upc;
   MicroPC _nupc;

 public:

   MicroPC upc() const { return _upc; }
   void upc(MicroPC val) { _upc = val; }

   MicroPC nupc() const { return _nupc; }
   void nupc(MicroPC val) { _nupc = val; }

After the upc address is generated, it needs to update the pcState variable to change the current upc address. You can also update the upc address of the current processor’s pcState variable, it is recommended to pass newly initialized pcState object to the processor context. Therefore, new pcState variable invokes upc function to update its upc address. After that, by invoking tc->pcState(pcState), it update member field _pcState of a threadContexts to a new pcState, which makes the processor run from the updated micro pc address when the next fetch happens.

However, note that this function just updates _pcState member field of the ThreadContex. Then who actually redirects the pipeline to fetch the new instructions from the ROM not from the faulting instruction? Let’s go back to the advancePC function that called the invoke function.

Let’s go back to advancePC & advanceInst

gem5/src/cpu/simple/base.cc

        
       fault->invoke(threadContexts[curThread], curStaticInst);
       thread->decoder.reset();

After the invoke function is called as part of the advancePC function, it resets the decoder, which updates decoder state as ResetState.

gem5/src/cpu/simple/timing.cc

        
      
void
TimingSimpleCPU::advanceInst(const Fault &fault)
{
   SimpleExecContext &t_info = *threadInfo[curThread];
 734
   if (_status == Faulting)
       return;
 737
   if (fault != NoFault) {
       DPRINTF(SimpleCPU, "Fault occured. Handling the fault\n");
 740
       advancePC(fault);
   if (fault != NoFault) {
       DPRINTF(SimpleCPU, "Fault occured. Handling the fault\n");
 744
       advancePC(fault);
 746
       // A syscall fault could suspend this CPU (e.g., futex_wait)
       // If the _status is not Idle, schedule an event to fetch the next
       // instruction after 'stall' ticks.
       // If the cpu has been suspended (i.e., _status == Idle), another
       // cpu will wake this cpu up later.
       if (_status != Idle) {
           DPRINTF(SimpleCPU, "Scheduling fetch event after the Fault\n");
 754
           Tick stall = dynamic_pointer_cast<SyscallRetryFault>(fault) ?
                        clockEdge(syscallRetryLatency) : clockEdge();
           reschedule(fetchEvent, stall, true);
           _status = Faulting;
       }
 760
       return;
   }

After returning from the advancePC instruction, advanceInst function checks status of the current processor. When the processor is not in idle state, it reschedules fetchEvent to be executed again after stall ticks. Also note that status of the processor has been changed to Faulting.

fetchEvent invokes fetch() function

By the way what is the fetchEvent?

        
      
TimingSimpleCPU::TimingSimpleCPU(TimingSimpleCPUParams *p)
   : BaseSimpleCPU(p), fetchTranslation(this), icachePort(this),
     dcachePort(this), ifetch_pkt(NULL), dcache_pkt(NULL), previousCycle(0),
     fetchEvent([this]{ fetch(); }, name())
{
   _status = Idle;
}

Because fetchEvent is initialized to invoke fetch() function at TimingSimpleCPU constructor, after a stall time passed, it will invoke fetch function to fetch new instruction from the faulting address.

fetchEvent is defined as a EventFunctionWrapper type used for registering event in GEM5. Also, the fetchEvent is initiated by the constructor of the TimingSimpleCPU class to invoke fetch() function. Therefore, after the stall ticks passed, it invokes fetch() function defined in the TimingSimpleCPU class.

Now start to fetch from updatd RIP, the ROM code!

        
      
void
TimingSimpleCPU::fetch()
{
   // Change thread if multi-threaded
   swapActiveThread();
 658
   SimpleExecContext &t_info = *threadInfo[curThread];
   SimpleThread* thread = t_info.thread;
 661
   DPRINTF(SimpleCPU, "Fetch\n");
 663
   if (!curStaticInst || !curStaticInst->isDelayedCommit()) {
       checkForInterrupts();
       checkPcEventQueue();
   }
 668
   // We must have just got suspended by a PC event
   if (_status == Idle)
       return;
 672
   TheISA::PCState pcState = thread->pcState();
   bool needToFetch = !isRomMicroPC(pcState.microPC()) &&
                      !curMacroStaticInst;
 676
   if (needToFetch) {
       _status = BaseSimpleCPU::Running;
       RequestPtr ifetch_req = std::make_shared<Request>();
       ifetch_req->taskId(taskId());
       ifetch_req->setContext(thread->contextId());
       setupFetchRequest(ifetch_req);
       DPRINTF(SimpleCPU, "Translating address %#x\n", ifetch_req->getVaddr());
       thread->itb->translateTiming(ifetch_req, thread->getTC(),
               &fetchTranslation, BaseTLB::Execute);
   } else {
       _status = IcacheWaitResponse;
       completeIfetch(NULL);
 689
       updateCycleCounts();
       updateCycleCounters(BaseCPU::CPU_STATE_ON);
   }
}

Remeber that curMacroStaticInst has been set to StaticInst::nullStaticInstPtr by advancePC. Also, upc has been updated to the ROM code address with MicroPCRomBit flag. Therefore, it sets needToFetch True and start to fetch new instructions from the ROM code.

This post is licensed under CC BY 4.0 by the author.

Who initiates TLB access?

Interface between CPU pipeline and TLB component

It’s all about TLB! No actual memory access to the virtual address!

Page table walking in TLB

WalkerState per request

startWalk, initiating page table walking

multi-level page table walking process = multiple packets

Initial page table access packet creation

SendTiming function: sends request and save current state

Handling return packet from memory sub-system

recvPacket handles received packet and send another packet for next stage pagetable access

Remember! Page table walking is not a single memory access

Preparing packets for the next pagetable access requests

Finish TLB translation

Translation object

DataTranslation class and finish method

WholeTranslationState class object contains translation related info

Now the time of processor, not the TLB

DataTranslation as an interface to interact with CPU

What CPU has to do after the TLB finish its job

Let CPU handle the TLB fault

advanceInst to process generated translation fault

pre-defined ROM code handles generated fault!

Invoke change current RIP to pre-defined microops

PageFault handling ROM code

Pass arguments to the ROM code

Detour to TheISA namespace

SetIntReg with TheISA

Jump to the ROM code!

Let’s go back to advancePC & advanceInst

fetchEvent invokes fetch() function

Now start to fetch from updatd RIP, the ROM code!