Post

Gem5 X86 Tlb


layout: post tittle: “Pagetable walking and pagefault handling in Gem5” categories: GEM5, TLB — In this posting, we are going to take a look at how the memory accesses can be resolved through the TLB and pagetable walking.

Who initiates TLB access?

TLB maintains a virtual to physical address translation information to reduce time of walking the entire page table at every memory access. In other words, it is a cache of virtual to physical mapping maintained by the processor usually. Then which part of the CPU logic initiates the TLB logic, and what operations should be done by the TLB component?

Interface between CPU pipeline and TLB component

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
 425 Fault
 426 TimingSimpleCPU::initiateMemRead(Addr addr, unsigned size,
 427                                  Request::Flags flags,
 428                                  const std::vector<bool>& byte_enable)
 429 {
 430     SimpleExecContext &t_info = *threadInfo[curThread];
 431     SimpleThread* thread = t_info.thread;
 432 
 439     Fault fault;
 440     const int asid = 0;
 441     const Addr pc = thread->instAddr();
 442     unsigned block_size = cacheLineSize();
 443     BaseTLB::Mode mode = BaseTLB::Read;
 444 
 445     if (traceData)
 446         traceData->setMem(addr, size, flags);
 447 
 448     RequestPtr req = std::make_shared<Request>(
 449         asid, addr, size, flags, dataMasterId(), pc,
 450         thread->contextId());
 451     if (!byte_enable.empty()) {
 452         req->setByteEnable(byte_enable);
 453     }
 454    
 455     req->taskId(taskId());
 456 
 457     Addr split_addr = roundDown(addr + size - 1, block_size);
 458     assert(split_addr <= addr || split_addr - addr < block_size);
 459                                  
 460     _status = DTBWaitResponse;
 461     if (split_addr > addr) {
 462         RequestPtr req1, req2;
 463         assert(!req->isLLSC() && !req->isSwap());
 464         req->splitOnVaddr(split_addr, req1, req2);
 465    
 466         WholeTranslationState *state =
 467             new WholeTranslationState(req, req1, req2, new uint8_t[size],
 468                                       NULL, mode);
 469         DataTranslation<TimingSimpleCPU *> *trans1 =
 470             new DataTranslation<TimingSimpleCPU *>(this, state, 0);
 471         DataTranslation<TimingSimpleCPU *> *trans2 =
 472             new DataTranslation<TimingSimpleCPU *>(this, state, 1);
 473 
 474         thread->dtb->translateTiming(req1, thread->getTC(), trans1, mode);
 475         thread->dtb->translateTiming(req2, thread->getTC(), trans2, mode);
 476     } else {
 477         WholeTranslationState *state =
 478             new WholeTranslationState(req, new uint8_t[size], NULL, mode);
 479         DataTranslation<TimingSimpleCPU *> *translation
 480             = new DataTranslation<TimingSimpleCPU *>(this, state);
 481         thread->dtb->translateTiming(req, thread->getTC(), translation, mode);
 482     }
 483 
 484     return NoFault;
 485 }

One of the most important basic capability of processor is accessing memory. GEM5 make each processor implement their own memory access building blocks as member function of each processor class. We are going to take a look at simple processor, TimingSimpleCPU and corresponding memory function, initiateMemRead. Note that at the end of the initiateMemRead function, it generates DataTranslation object and pass it to the translateTiming function defined in the data TLB component of the processor. This translation object will be used to process current TLB access request. Also note that translateTiming function needs threadContext to execute TLB accessing and RequestPtr object containing all the memory access request information such as virtual address.

It’s all about TLB! No actual memory access to the virtual address!

initiateMemRead function does not initiate actual memory access, it only asks TLB component to generate virtual address to physical address mapping in its TLB cache.

It could be confusing because of its name initateMemRead but the actual memory access could only be occured after the TLB request can be successfully resolved. I will describe how actual memory access happens in this posting [] Keep in mind that we will only focus on the translation part!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
//gem5/src/arch/x86/tlb.cc

441 void
442 TLB::translateTiming(const RequestPtr &req, ThreadContext *tc,
443         Translation *translation, Mode mode)
444 {
445     bool delayedResponse;
446     assert(translation);
447     Fault fault =
448         TLB::translate(req, tc, translation, mode, delayedResponse, true);
449
450     if (!delayedResponse)
451         translation->finish(fault, req, tc, mode);
452     else
453         translation->markDelayed();
454 }

As we assume that GEM5 is compiled for X86 architecture, it will invoke TLB implementation for X86 architecture. Please be aware that the translateTiming function is implemented as part of the TLB class, indicating that we are presently working with TLB components, transitioning away from the processor pipeline. The actual translation is done by TLB::translate function. Depending on whether the target virtual address has previously been resolved and its mapping cached in the TLB or not, the function can either retrieve the TLB entry from the cache or obtain it by traversing the page table.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// gem5/src/arch/x86/tlb.cc

277 Fault
278 TLB::translate(const RequestPtr &req,
279         ThreadContext *tc, Translation *translation,
280         Mode mode, bool &delayedResponse, bool timing)
281 {
282     Request::Flags flags = req->getFlags();
283     int seg = flags & SegmentFlagMask;
284     bool storeCheck = flags & (StoreCheck << FlagShift);
...
341         // If paging is enabled, do the translation.
342         if (m5Reg.paging) {
343             DPRINTF(TLB, "Paging enabled.\n");
344             // The vaddr already has the segment base applied.
345             TlbEntry *entry = lookup(vaddr);
346             if (mode == Read) {
347                 rdAccesses++;
348             } else {
349                 wrAccesses++;
350             }
351             if (!entry) {
352                 DPRINTF(TLB, "Handling a TLB miss for "
353                         "address %#x at pc %#x.\n",
354                         vaddr, tc->instAddr());
355                 if (mode == Read) {
356                     rdMisses++;
357                 } else {
358                     wrMisses++;
359                 }
360                 if (FullSystem) {
361                     Fault fault = walker->start(tc, translation, req, mode);
362                     if (timing || fault != NoFault) {
363                         // This gets ignored in atomic mode.
364                         delayedResponse = true;
365                         return fault;
366                     }
367                     entry = lookup(vaddr);
368                     assert(entry);
369                 } else {

The initial step in the translate function involves a query to the TLB, inquiring whether the necessary translation entry is present in the TLB (line 345). In cases where the TLB entry is absent, the process then proceeds to navigate through the page table, which is stored in memory, in order to acquire the virtual-to-physical translation (spanning from line 351 to 395). Given the presumed interest in utilizing full-system emulation, I will focus on FullSystem parts of TLB handling. In GEM5’s fullsystem mode, when a TLB miss occurs, the system proceeds to navigate the page table using the “pagetable_walker” object (as indicated in line 361). It’s important to note that the “req” parameter is passed to the pagetable_walker because it contains all the essential information, including the address and flags, necessary for correctly resolving memory access.

Page table walking in TLB

In cases where it is either the first request or the previous TLB entry has been evicted from the TLB cache, it is required to traverse the page table and obtain the virtual to physical mapping. Let’s examine the process by which the TLB effectively navigates the page table and retrieves the final-level page table entry.

WalkerState per request

In contrast to simpler operations, it’s typically not possible to resolve TLB misses in a single cycle.

As the page table is structured with multiple levels, the page table walking demands numerous memory accesses. These accesses are essential for reaching the leaf page table entry that contains the virtual-to-physical mapping and other pertinent flags.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
//gem5/src/arch/x86/pagetable_walker.cc

 71 Fault
 72 Walker::start(ThreadContext * _tc, BaseTLB::Translation *_translation,
 73               const RequestPtr &_req, BaseTLB::Mode _mode)
 74 {
 75     // TODO: in timing mode, instead of blocking when there are other
 76     // outstanding requests, see if this request can be coalesced with
 77     // another one (i.e. either coalesce or start walk)
 78     WalkerState * newState = new WalkerState(this, _translation, _req);
 79     newState->initState(_tc, _mode, sys->isTimingMode());
 80     if (currStates.size()) {
 81         assert(newState->isTiming());
 82         DPRINTF(PageTableWalker, "Walks in progress: %d\n", currStates.size());
 83         currStates.push_back(newState);
 84         return NoFault;
 85     } else {
 86         currStates.push_back(newState);
 87         Fault fault = newState->startWalk();
 88         if (!newState->isTiming()) {
 89             currStates.pop_front();
 90             delete newState;
 91         }
 92         return fault;
 93     }
 94 }

It is important to note that TLB misses can occur simultaneously because multiple processors might try to access a memory address for which the virtual-to-physical mapping is not stored in the TLB cache. Additional, since each request cannot be handled in a single clock cycle, there is a need to store the state of page table walking for each request. The “walkerState” is employed for this specific purpose, maintaining all the necessary information for page table walking on a per-request basis.

The “currStates” keeps track of all the outstanding requests, which are those that have been requested previously but have not yet been resolved, in the form of a list. If there are any unresolved TLB misses, the current request is simply added to the list, and the system waits until the preceding requests have been resolved, as seen in lines 80-84. Once the outstanding request has been resolved, the pending requests are then processed one after another.

If there is no remaining requests in the list, as indicated in lines 85-92, a newly generated state should be added, and the “startWalk” function is called with the newly created state. Upon completion of the page table walking by the “startWalk” function, in the case of a timing CPU, there is no need to remove the current state from the “currStates” list, as another stage in the timing CPU model takes care of removing the current state from the list.

startWalk, initiating page table walking

gem5/src/arch/x86/pagetable_walker.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
229 Fault
230 Walker::WalkerState::startWalk()
231 {
232     Fault fault = NoFault;
233     assert(!started);
234     started = true;
235     setupWalk(req->getVaddr());
236     if (timing) {
237         nextState = state;
238         state = Waiting;
239         timingFault = NoFault;
240         sendPackets();
241     } else {
242         do {
243             walker->port.sendAtomic(read);
244             PacketPtr write = NULL;
245             fault = stepWalk(write);
246             assert(fault == NoFault || read == NULL);
247             state = nextState;
248             nextState = Ready;
249             if (write)
250                 walker->port.sendAtomic(write);
251         } while (read);
252         state = Ready;
253         nextState = Waiting;
254     }
255     return fault;
256 }

Since the page table is stored in memory or cache, whenever the TLB miss happens it should retrieve page table content from the memory subsystem. To this end, TLB component initiates memory request through sendPackets function.

multi-level page table walking process = multiple packets

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
661 void
662 Walker::WalkerState::sendPackets()
663 {
664     //If we're already waiting for the port to become available, just return.
665     if (retrying)
666         return;
667
668     //Reads always have priority
669     if (read) {
670         PacketPtr pkt = read;
671         read = NULL;
672         inflight++;
673         if (!walker->sendTiming(this, pkt)) {
674             retrying = true;
675             read = pkt;
676             inflight--;
677             return;
678         }
679     }
680     //Send off as many of the writes as we can.
681     while (writes.size()) {
682         PacketPtr write = writes.back();
683         writes.pop_back();
684         inflight++;
685         if (!walker->sendTiming(this, write)) {
686             retrying = true;
687             writes.push_back(write);
688             inflight--;
689             return;
690         }
691     }
692 }

With modern processors making use of multi-level page tables, it becomes challenging to pre-determine which page table entries will be accessed before resolving the memory access at the previous level of page table entry. Because of this interdependence among page table access, the accesses to these entries must be carried out sequentially rather than in parallel Consequently, TLB accesses are structured into multiple stages, with each stage responsible for accessing one level of the page table.

Since TLB should request memory subsystem to fetch next level of page table entry one by one, it should send send different packets at different stage to access a specific level of the page table.

When you look at the “sendPackets” function, you will notice a familiar function name, “sendTiming,” which dispatches page-table-access-request-packets to the memory subsystem (e.g., cache or memory).

Initial page table access packet creation

When you take a look at the “sendPackets” function, you won’t observe any packet creation within it. However, you will notice that the “sendTiming” function receives a parameter named pkt. So, where does this pkt come from? The “setupWalk” function within the “startWalk” function is responsible for populating the appropriate request packet, which initiates the access to the page table.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
551 void
552 Walker::WalkerState::setupWalk(Addr vaddr)
553 {
554     VAddr addr = vaddr;
555     CR3 cr3 = tc->readMiscRegNoEffect(MISCREG_CR3);
556     // Check if we're in long mode or not
557     Efer efer = tc->readMiscRegNoEffect(MISCREG_EFER);
558     dataSize = 8;
559     Addr topAddr;
560     if (efer.lma) {
561         // Do long mode.
562         state = LongPML4;
563         topAddr = (cr3.longPdtb << 12) + addr.longl4 * dataSize;
564         enableNX = efer.nxe;
565     } else {
566         // We're in some flavor of legacy mode.
567         CR4 cr4 = tc->readMiscRegNoEffect(MISCREG_CR4);
568         if (cr4.pae) {
569             // Do legacy PAE.
570             state = PAEPDP;
571             topAddr = (cr3.paePdtb << 5) + addr.pael3 * dataSize;
572             enableNX = efer.nxe;
573         } else {
574             dataSize = 4;
575             topAddr = (cr3.pdtb << 12) + addr.norml2 * dataSize;
576             if (cr4.pse) {
577                 // Do legacy PSE.
578                 state = PSEPD;
579             } else {
580                 // Do legacy non PSE.
581                 state = PD;
582             }
583             enableNX = false;
584         }
585     }
586
587     nextState = Ready;
588     entry.vaddr = vaddr;
589
590     Request::Flags flags = Request::PHYSICAL;
591     if (cr3.pcd)
592         flags.set(Request::UNCACHEABLE);
593
594     RequestPtr request = std::make_shared<Request>(
595         topAddr, dataSize, flags, walker->masterId);
596
597     read = new Packet(request, MemCmd::ReadReq);
598     read->allocate();
599 }

We’ve learned that the “sendPackets” function is employed to transmit multiple page table access requests, depending on the various stages of the page table walking process. So, how are the packets for the subsequent stages created and provided to the “sendPackets” function? Please bear with me as we progress through one complete step of page table walking; I will address this aspect shortly.

SendTiming function: sends request and save current state

Now, let’s explore how the “sendTiming” function transmits the generated page table access request packet to the memory subsystem via the designated port.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
156 bool Walker::sendTiming(WalkerState* sendingState, PacketPtr pkt)
157 {
158     WalkerSenderState* walker_state = new WalkerSenderState(sendingState);
159     pkt->pushSenderState(walker_state);
160     if (port.sendTimingReq(pkt)) {
161         return true;
162     } else {
163         // undo the adding of the sender state and delete it, as we
164         // will do it again the next time we attempt to send it
165         pkt->popSenderState();
166         delete walker_state;
167         return false;
168     }
169
170 }

It’s worth noting that the “sendTiming” function initially generates a separate state called “WalkerSenderState.” This state variable is essential for handling the requested page table access and for processing the response from the memory subsystem once the page table access has been completed.

Handling return packet from memory sub-system

When memory sub-system successfully handled the page table access request, pagetable_walker receives the result packet through the port. When the packet arrives to the port connecting pagetable_walker and memory sub-system, it invokes recvTimingResp function of the walker.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
104 bool
105 Walker::WalkerPort::recvTimingResp(PacketPtr pkt)
106 {
107     return walker->recvTimingResp(pkt);
108 }
109
110 bool
111 Walker::recvTimingResp(PacketPtr pkt)
112 {
113     WalkerSenderState * senderState =
114         dynamic_cast<WalkerSenderState *>(pkt->popSenderState());
115     WalkerState * senderWalk = senderState->senderWalk;
116     bool walkComplete = senderWalk->recvPacket(pkt);
117     delete senderState;
118     if (walkComplete) {
119         std::list<WalkerState *>::iterator iter;
120         for (iter = currStates.begin(); iter != currStates.end(); iter++) {
121             WalkerState * walkerState = *(iter);
122             if (walkerState == senderWalk) {
123                 iter = currStates.erase(iter);
124                 break;
125             }
126         }
127         delete senderWalk;
128         // Since we block requests when another is outstanding, we
129         // need to check if there is a waiting request to be serviced
130         if (currStates.size() && !startWalkWrapperEvent.scheduled())
131             // delay sending any new requests until we are finished
132             // with the responses
133             schedule(startWalkWrapperEvent, clockEdge());
134     }
135     return true;
136 }

As we’ve seen before, WalkerSenderState wraps up the walker instance (WalkerState) which has been used to send pagetable access request associated with currently received packet.

recvPacket handles received packet and send another packet for next stage pagetable access

Retrieved WalkerState instance handles the received packet by calling recvPacket function of the WalkerState.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
602 bool
603 Walker::WalkerState::recvPacket(PacketPtr pkt)
604 {
605     assert(pkt->isResponse());
606     assert(inflight);
607     assert(state == Waiting);
608     inflight--;
609     if (squashed) {
610         // if were were squashed, return true once inflight is zero and
611         // this WalkerState will be freed there.
612         return (inflight == 0);
613     }
614     if (pkt->isRead()) {
615         // should not have a pending read it we also had one outstanding
616         assert(!read);
617
618         // @todo someone should pay for this
619         pkt->headerDelay = pkt->payloadDelay = 0;
620
621         state = nextState;
622         nextState = Ready;
623         PacketPtr write = NULL;
624         read = pkt;
625         timingFault = stepWalk(write);
626         state = Waiting;
627         assert(timingFault == NoFault || read == NULL);
628         if (write) {
629             writes.push_back(write);
630         }
631         sendPackets();
632     } else {
633         sendPackets();
634     }
635     if (inflight == 0 && read == NULL && writes.size() == 0) {
636         state = Ready;
637         nextState = Waiting;
638         if (timingFault == NoFault) {
639             /*
640              * Finish the translation. Now that we know the right entry is
641              * in the TLB, this should work with no memory accesses.
642              * There could be new faults unrelated to the table walk like
643              * permissions violations, so we'll need the return value as
644              * well.
645              */
646             bool delayedResponse;
647             Fault fault = walker->tlb->translate(req, tc, NULL, mode,
648                                                  delayedResponse, true);
649             assert(!delayedResponse);
650             // Let the CPU continue.
651             translation->finish(fault, req, tc, mode);
652         } else {
653             // There was a fault during the walk. Let the CPU know.
654             translation->finish(timingFault, req, tc, mode);
655         }
656         return true;
657     }
658
659     return false;
660 }

Because the recvPacket function has been invoked as a result of memory read (initial pagetable access) 614-634 will be executed. There are some functions that we don’t know, but it finally invokes sendPackets function again. Wait why sendPackets once again in receive function?

Remember! Page table walking is not a single memory access

Note that we are currently dealing with the result packet from the memory sub-system as a result of sending initial pagetable access request (accessing first level of pagetable) Therefore, the received packet should contain next level page table information not the Page table entry which actually contains physical address to virtual address mapping. Therefore, to acquire the last level page table entry, it needs additional memory accesses to the sub levels of pagetables, which should requires another sendPackets.

Preparing packets for the next pagetable access requests

As we generated initiating packet with the help of setupWalk, packets required for accessing further page table layers are prepared by stepWalk function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
282 Fault
283 Walker::WalkerState::stepWalk(PacketPtr &write)
284 {
285     assert(state != Ready && state != Waiting);
286     Fault fault = NoFault;
287     write = NULL;
288     PageTableEntry pte;
289     if (dataSize == 8)
290         pte = read->getLE<uint64_t>();
291     else
292         pte = read->getLE<uint32_t>();
293     VAddr vaddr = entry.vaddr;
294     bool uncacheable = pte.pcd;
295     Addr nextRead = 0;
296     bool doWrite = false;
297     bool doTLBInsert = false;
298     bool doEndWalk = false;
299     bool badNX = pte.nx && mode == BaseTLB::Execute && enableNX;
300     switch(state) {
301       case LongPML4:
302         DPRINTF(PageTableWalker,
303                 "Got long mode PML4 entry %#016x.\n", (uint64_t)pte);
304         nextRead = ((uint64_t)pte & (mask(40) << 12)) + vaddr.longl3 * dataSize;
305         doWrite = !pte.a;
306         pte.a = 1;
307         entry.writable = pte.w;
308         entry.user = pte.u;
309         if (badNX || !pte.p) {
310             doEndWalk = true;
311             fault = pageFault(pte.p);
312             break;
313         }
314         entry.noExec = pte.nx;
315         nextState = LongPDP;
316         break;
317       case LongPDP:
318         DPRINTF(PageTableWalker,
319                 "Got long mode PDP entry %#016x.\n", (uint64_t)pte);
320         nextRead = ((uint64_t)pte & (mask(40) << 12)) + vaddr.longl2 * dataSize;
321         doWrite = !pte.a;
322         pte.a = 1;
323         entry.writable = entry.writable && pte.w;
324         entry.user = entry.user && pte.u;
325         if (badNX || !pte.p) {
326             doEndWalk = true;
327             fault = pageFault(pte.p);
328             break;
329         }
330         nextState = LongPD;
331         break;
332       case LongPD:
333         DPRINTF(PageTableWalker,
334                 "Got long mode PD entry %#016x.\n", (uint64_t)pte);
335         doWrite = !pte.a;
336         pte.a = 1;
337         entry.writable = entry.writable && pte.w;
338         entry.user = entry.user && pte.u;
339         if (badNX || !pte.p) {
340             doEndWalk = true;
341             fault = pageFault(pte.p);
342             break;
343         }
344         if (!pte.ps) {
345             // 4 KB page
346             entry.logBytes = 12;
347             nextRead =
348                 ((uint64_t)pte & (mask(40) << 12)) + vaddr.longl1 * dataSize;
349             nextState = LongPTE;
350             break;
351         } else {
352             // 2 MB page
353             entry.logBytes = 21;
354             entry.paddr = (uint64_t)pte & (mask(31) << 21);
355             entry.uncacheable = uncacheable;
356             entry.global = pte.g;
357             entry.patBit = bits(pte, 12);
358             entry.vaddr = entry.vaddr & ~((2 * (1 << 20)) - 1);
359             doTLBInsert = true;
360             doEndWalk = true;
361             break;
362         }
363       case LongPTE:
364         DPRINTF(PageTableWalker,
365                 "Got long mode PTE entry %#016x.\n", (uint64_t)pte);
366         doWrite = !pte.a;
367         pte.a = 1;
368         entry.writable = entry.writable && pte.w;
369         entry.user = entry.user && pte.u;
370         if (badNX || !pte.p) {
371             doEndWalk = true;
372             fault = pageFault(pte.p);
373             break;
374         }
375         entry.paddr = (uint64_t)pte & (mask(40) << 12);
376         entry.uncacheable = uncacheable;
377         entry.global = pte.g;
378         entry.patBit = bits(pte, 12);
379         entry.vaddr = entry.vaddr & ~((4 * (1 << 10)) - 1);
380         doTLBInsert = true;
381         doEndWalk = true;
382         break;
383       case PAEPDP:
384         DPRINTF(PageTableWalker,
385                 "Got legacy mode PAE PDP entry %#08x.\n", (uint32_t)pte);
386         nextRead = ((uint64_t)pte & (mask(40) << 12)) + vaddr.pael2 * dataSize;
387         if (!pte.p) {
388             doEndWalk = true;
389             fault = pageFault(pte.p);
390             break;
391         }
392         nextState = PAEPD;
393         break;
394       case PAEPD:
395         DPRINTF(PageTableWalker,
396                 "Got legacy mode PAE PD entry %#08x.\n", (uint32_t)pte);
397         doWrite = !pte.a;
398         pte.a = 1;
399         entry.writable = pte.w;
400         entry.user = pte.u;
401         if (badNX || !pte.p) {
402             doEndWalk = true;
403             fault = pageFault(pte.p);
404             break;
405         }
406         if (!pte.ps) {
407             // 4 KB page
408             entry.logBytes = 12;
409             nextRead = ((uint64_t)pte & (mask(40) << 12)) + vaddr.pael1 * dataSize;
410             nextState = PAEPTE;
411             break;
412         } else {
413             // 2 MB page
414             entry.logBytes = 21;
415             entry.paddr = (uint64_t)pte & (mask(31) << 21);
416             entry.uncacheable = uncacheable;
417             entry.global = pte.g;
418             entry.patBit = bits(pte, 12);
419             entry.vaddr = entry.vaddr & ~((2 * (1 << 20)) - 1);
420             doTLBInsert = true;
421             doEndWalk = true;
422             break;
443         break;
444       case PSEPD:                                                                                                                       445         DPRINTF(PageTableWalker,
446                 "Got legacy mode PSE PD entry %#08x.\n", (uint32_t)pte);
447         doWrite = !pte.a;
448         pte.a = 1;
449         entry.writable = pte.w;
450         entry.user = pte.u;
451         if (!pte.p) {
452             doEndWalk = true;
453             fault = pageFault(pte.p);
454             break;
455         }
456         if (!pte.ps) {
457             // 4 KB page
458             entry.logBytes = 12;
459             nextRead =
460                 ((uint64_t)pte & (mask(20) << 12)) + vaddr.norml2 * dataSize;
461             nextState = PTE;
462             break;
463         } else {
464             // 4 MB page
465             entry.logBytes = 21;
466             entry.paddr = bits(pte, 20, 13) << 32 | bits(pte, 31, 22) << 22;
467             entry.uncacheable = uncacheable;
468             entry.global = pte.g;
469             entry.patBit = bits(pte, 12);
470             entry.vaddr = entry.vaddr & ~((4 * (1 << 20)) - 1);
471             doTLBInsert = true;
472             doEndWalk = true;
473             break;
474         }
475       case PD:
476         DPRINTF(PageTableWalker,
477                 "Got legacy mode PD entry %#08x.\n", (uint32_t)pte);
478         doWrite = !pte.a;
479         pte.a = 1;
480         entry.writable = pte.w;
481         entry.user = pte.u;
482         if (!pte.p) {
483             doEndWalk = true;
484             fault = pageFault(pte.p);
485             break;
486         }
487         // 4 KB page
488         entry.logBytes = 12;
489         nextRead = ((uint64_t)pte & (mask(20) << 12)) + vaddr.norml2 * dataSize;
490         nextState = PTE;
491         break;
492       case PTE:
493         DPRINTF(PageTableWalker,
494                 "Got legacy mode PTE entry %#08x.\n", (uint32_t)pte);
495         doWrite = !pte.a;
496         pte.a = 1;
497         entry.writable = pte.w;
498         entry.user = pte.u;
499         if (!pte.p) {
500             doEndWalk = true;
501             fault = pageFault(pte.p);
502             break;
503         }
504         entry.paddr = (uint64_t)pte & (mask(20) << 12);
505         entry.uncacheable = uncacheable;
506         entry.global = pte.g;
507         entry.patBit = bits(pte, 7);
508         entry.vaddr = entry.vaddr & ~((4 * (1 << 10)) - 1);
509         doTLBInsert = true;
510         doEndWalk = true;
511         break;
512       default:
513         panic("Unknown page table walker state %d!\n");
514     }
515     if (doEndWalk) {
516         if (doTLBInsert)
517             if (!functional)
518                 walker->tlb->insert(entry.vaddr, entry, tc);
519         endWalk();
520     } else {
521         PacketPtr oldRead = read;
522         //If we didn't return, we're setting up another read.
523         Request::Flags flags = oldRead->req->getFlags();
524         flags.set(Request::UNCACHEABLE, uncacheable);
525         RequestPtr request = std::make_shared<Request>(
526             nextRead, oldRead->getSize(), flags, walker->masterId);
527         read = new Packet(request, MemCmd::ReadReq);
528         read->allocate();
529         // If we need to write, adjust the read packet to write the modified
530         // value back to memory.
531         if (doWrite) {
532             write = oldRead;
533             write->setLE<uint64_t>(pte);
534             write->cmd = MemCmd::WriteReq;
535         } else {
536             write = NULL;
537             delete oldRead;
538         }
539     }
540     return fault;
541 }

Even though it is very long, depending on the current state representing a level of the page table accessed by the currently received packet, next level page table address and corresponding packet is populated.

Because we are here as a result of accessing PML4 (first level page table), line 301-316 will be executed and prepare the information to access the next level pagetable. Note that the next state is set to the next level of page table level, PDP.

After setting fields associated with next page table level access, it generates another read packet (Line 520-539) to convey all the information required to access the next level pagetable. Note that the newly populated packet is assigned to the read field of the current WalkerState object. This read packet is used by the sendPackets function to access further pagetable layers.

These sending and receiving steps are repeated until the final PTE is read. When PTE is read from the memory sub-system, it sets the doEndWalk flag and doTLBInsert flag When the flags are set, new TLB entry is inserted to the TLB module (line 515-520).

Finish TLB translation

After the translation has been finished, whether it ends up TLB hit, TLB miss and page table walking, or unexpected TLB fault, it invokes finish function through a translation object.

gem5/src/arch/x86/tlb.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
441 void
442 TLB::translateTiming(const RequestPtr &req, ThreadContext *tc,
443         Translation *translation, Mode mode)
444 {
445     bool delayedResponse;
446     assert(translation);
447     Fault fault =
448         TLB::translate(req, tc, translation, mode, delayedResponse, true);
449
450     if (!delayedResponse)
451         translation->finish(fault, req, tc, mode);
452     else
453         translation->markDelayed();
454 }

Translation object

Wait, what is the translation object? We haven’t deal with it before. Let’s go back to initiateMemRead function again to understand what is the translation object.

DataTranslation class and finish method

1
2
3
4
5
 464         WholeTranslationState *state =
 465             new WholeTranslationState(req, new uint8_t[size], NULL, mode);
 466         DataTranslation<TimingSimpleCPU *> *translation
 467             = new DataTranslation<TimingSimpleCPU *>(this, state);
 468         thread->dtb->translateTiming(req, thread->getTC(), translation, mode);

At line 464-468, we can find that it is an object of DataTranslation class. To find out implementation of finish function, let’s take a look at DataTranslation class.

gem5/src/cpu/translation.hh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
208 /**
209  * This class represents part of a data address translation.  All state for
210  * the translation is held in WholeTranslationState (above).  Therefore this
211  * class does not need to know whether the translation is split or not.  The
212  * index variable determines this but is simply passed on to the state class.
213  * When this part of the translation is completed, finish is called.  If the
214  * translation state class indicate that the whole translation is complete
215  * then the execution context is informed.
216  */
217 template <class ExecContextPtr>
218 class DataTranslation : public BaseTLB::Translation
219 {
220   protected:
221     ExecContextPtr xc;
222     WholeTranslationState *state;
223     int index;
224
225   public:
226     DataTranslation(ExecContextPtr _xc, WholeTranslationState* _state)
227         : xc(_xc), state(_state), index(0)
228     {
229     }
230
231     DataTranslation(ExecContextPtr _xc, WholeTranslationState* _state,
232                     int _index)
233         : xc(_xc), state(_state), index(_index)
234     {
235     }
236
237     /**
238      * Signal the translation state that the translation has been delayed due
239      * to a hw page table walk.  Split requests are transparently handled.
240      */
241     void
242     markDelayed()
243     {
244         state->delay = true;
245     }
246
247     /**
248      * Finish this part of the translation and indicate that the whole
249      * translation is complete if the state says so.
250      */
251     void
252     finish(const Fault &fault, const RequestPtr &req, ThreadContext *tc,
253            BaseTLB::Mode mode)
254     {
255         assert(state);
256         assert(mode == state->mode);
257         if (state->finish(fault, index)) {
258             if (state->getFault() == NoFault) {
259                 // Don't access the request if faulted (due to squash)
260                 req->setTranslateLatency();
261             }
262             xc->finishTranslation(state);
263         }
264         delete this;
265     }
266
267     bool
268     squashed() const
269     {
270         return xc->isSquashed();
271     }
272 };

We can find that finish function is implemented in the DataTranslation class. The finish function defined in DataTranslation class re-invokes another finish function through the state member field (line 257). Also after invoking finish function, it invokes finishTranslation method of ThreadContext when Fault has been raised as a consequence of TLB processing.

When we look at the initiateMemRead function again, WholeTranslationState instance is passed to the DataTranslation constructor as a state parameter.

Therefore, state->finish of the DataTranslation invokes WholeTranslationState::finish method. Note that WholeTranslationState contains actual request used for accessing page table entry from tlb.

gem5/src/cpu/translation.hh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
 51 /**
 52  * This class captures the state of an address translation.  A translation
 53  * can be split in two if the ISA supports it and the memory access crosses
 54  * a page boundary.  In this case, this class is shared by two data
 55  * translations (below).  Otherwise it is used by a single data translation
 56  * class.  When each part of the translation is finished, the finish
 57  * function is called which will indicate whether the whole translation is
 58  * completed or not.  There are also functions for accessing parts of the
 59  * translation state which deal with the possible split correctly.
 60  */
 61 class WholeTranslationState
 62 {
 63   protected:
 64     int outstanding;
 65     Fault faults[2];
 66
 67   public:
 68     bool delay;
 69     bool isSplit;
 70     RequestPtr mainReq;
 71     RequestPtr sreqLow;
 72     RequestPtr sreqHigh;
 73     uint8_t *data;
 74     uint64_t *res;
 75     BaseTLB::Mode mode;
 76
 77     /**
 78      * Single translation state.  We set the number of outstanding
 79      * translations to one and indicate that it is not split.
 80      */
 81     WholeTranslationState(const RequestPtr &_req, uint8_t *_data,
 82                           uint64_t *_res, BaseTLB::Mode _mode)
 83         : outstanding(1), delay(false), isSplit(false), mainReq(_req),
 84           sreqLow(NULL), sreqHigh(NULL), data(_data), res(_res), mode(_mode)
 85     {
 86         faults[0] = faults[1] = NoFault;
 87         assert(mode == BaseTLB::Read || mode == BaseTLB::Write);
 88     }
 89
 90     /**
 91      * Split translation state.  We copy all state into this class, set the
 92      * number of outstanding translations to two and then mark this as a
 93      * split translation.
 94      */
 95     WholeTranslationState(const RequestPtr &_req, const RequestPtr &_sreqLow,
 96                           const RequestPtr &_sreqHigh, uint8_t *_data,
 97                           uint64_t *_res, BaseTLB::Mode _mode)
 98         : outstanding(2), delay(false), isSplit(true), mainReq(_req),
 99           sreqLow(_sreqLow), sreqHigh(_sreqHigh), data(_data), res(_res),
100           mode(_mode)
101     {
102         faults[0] = faults[1] = NoFault;
103         assert(mode == BaseTLB::Read || mode == BaseTLB::Write);
104     }
105
106     /**
107      * Finish part of a translation.  If there is only one request then this
108      * translation is completed.  If the request has been split in two then
109      * the outstanding count determines whether the translation is complete.
110      * In this case, flags from the split request are copied to the main
111      * request to make it easier to access them later on.
112      */
113     bool
114     finish(const Fault &fault, int index)
115     {
116         assert(outstanding);
117         faults[index] = fault;
118         outstanding--;
119         if (isSplit && outstanding == 0) {
120
121             // For ease later, we copy some state to the main request.
122             if (faults[0] == NoFault) {
123                 mainReq->setPaddr(sreqLow->getPaddr());
124             }
125             mainReq->setFlags(sreqLow->getFlags());
126             mainReq->setFlags(sreqHigh->getFlags());
127         }
128         return outstanding == 0;
129     }

the finish function of WholeTranslationState stores generated fault on its internal buffer when meaningful fault has been raised in translation process (line 117). After the fault has been stored by WholeTranslationState’s finish function, remaining part of DataTranslation’s finish function invokes xc->finishTranslation(state) function. Note that finishTranslation function requires WholeTranslationStation instance as state argument.

To understand details, we have to look at what is the xc variable. Because DataTranslation class is declared as a template class, and xc is declared as template type, it will be an instance of template type class.

Now the time of processor, not the TLB

DataTranslation as an interface to interact with CPU

1
2
 466         DataTranslation<TimingSimpleCPU *> *translation
 467             = new DataTranslation<TimingSimpleCPU *>(this, state);

Because the translation variable has been declared with *DataTranslation* type, xc variable is as an instance of *TimingSimpleCpu*. Therefore, when the xc->finishTranslation(state) is called, it will invoke TimingSimpleCPU::finishTranslation function. Note that we are jumping into the CPU code from the TLB module.

What CPU has to do after the TLB finish its job

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
 627 void
 628 TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
 629 {
 630     _status = BaseSimpleCPU::Running;
 631
 632     if (state->getFault() != NoFault) {
 633         if (state->isPrefetch()) {
 634             state->setNoFault();
 635         }
 636         delete [] state->data;
 637         state->deleteReqs();
 638         translationFault(state->getFault());
 639     } else {
 640         if (!state->isSplit) {
 641             sendData(state->mainReq, state->data, state->res,
 642                      state->mode == BaseTLB::Read);
 643         } else {
 644             sendSplitData(state->sreqLow, state->sreqHigh, state->mainReq,
 645                           state->data, state->mode == BaseTLB::Read);
 646         }
 647     }
 648
 649     delete state;
 650 }

When there exists translation fault, it ends up ivnoking translationFault function of the CPU with a previously stored fault(line 638). Note that state->getFault method returns the fault previously stored by WholeTranslationState’s finish. When a translation has happended because of prefetch instruction, it suppress generated fault because it is not critical for execution.

However, when no fault has been encountered during the translation, it invokes sendData function. We will cover this later.

Let CPU handle the TLB fault

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
 361 void
 362 TimingSimpleCPU::translationFault(const Fault &fault)
 363 {
 364     // fault may be NoFault in cases where a fault is suppressed,
 365     // for instance prefetches.
 366     updateCycleCounts();
 367     updateCycleCounters(BaseCPU::CPU_STATE_ON);
 368
 369     if (traceData) {
 370         // Since there was a fault, we shouldn't trace this instruction.
 371         delete traceData;
 372         traceData = NULL;
 373     }
 374
 375     postExecute();
 376
 377     advanceInst(fault);
 378 }

The translationFault function invokes postExecute and advanceInst function. By looking at the function argument, we can infer that advanceInst function actually deal with the fault. The postExecute function doesn’t invoke any meaningful function to proceed pipeline, but it updates stat of the processor such as power model, load instruction counter, etc. Therefore, let’s jump into the advanceInst function.

advanceInst to process generated translation fault

gem5/src/cpu/simple/timing.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
 734 void
 735 TimingSimpleCPU::advanceInst(const Fault &fault)
 736 {
 737     SimpleExecContext &t_info = *threadInfo[curThread];
 738
 739     if (_status == Faulting)
 740         return;
 741
 742     if (fault != NoFault) {
 743         DPRINTF(SimpleCPU, "Fault occured. Handling the fault\n");
 744
 745         advancePC(fault);
 746
 747         // A syscall fault could suspend this CPU (e.g., futex_wait)
 748         // If the _status is not Idle, schedule an event to fetch the next
 749         // instruction after 'stall' ticks.
 750         // If the cpu has been suspended (i.e., _status == Idle), another
 751         // cpu will wake this cpu up later.
 752         if (_status != Idle) {
 753             DPRINTF(SimpleCPU, "Scheduling fetch event after the Fault\n");
 754
 755             Tick stall = dynamic_pointer_cast<SyscallRetryFault>(fault) ?
 756                          clockEdge(syscallRetryLatency) : clockEdge();
 757             reschedule(fetchEvent, stall, true);
 758             _status = Faulting;
 759         }
 760
 761         return;
 762     }
 763
 764     if (!t_info.stayAtPC)
 765         advancePC(fault);
 766
 767     if (tryCompleteDrain())
 768         return;
 769
 770     if (_status == BaseSimpleCPU::Running) {
 771         // kick off fetch of next instruction... callback from icache
 772         // response will cause that instruction to be executed,
 773         // keeping the CPU running.
 774         fetch();
 775     }
 776 }

When there is a pending translation fault, it delegates fault exception to the advancePC function which actually controls the PC register of the CPU. TimingSimpleCPU inherits this function from BaseSimpleCPU, we will look at the BaseSimpleCPU class.

gem5/src/cpu/simple/base.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
661 void
662 BaseSimpleCPU::advancePC(const Fault &fault)
663 {
664     SimpleExecContext &t_info = *threadInfo[curThread];
665     SimpleThread* thread = t_info.thread;
666
667     const bool branching(thread->pcState().branching());
668
669     //Since we're moving to a new pc, zero out the offset
670     t_info.fetchOffset = 0;
671     if (fault != NoFault) {
672         curMacroStaticInst = StaticInst::nullStaticInstPtr;
673         fault->invoke(threadContexts[curThread], curStaticInst);
674         thread->decoder.reset();
675     } else {
676         if (curStaticInst) {
677             if (curStaticInst->isLastMicroop())
678                 curMacroStaticInst = StaticInst::nullStaticInstPtr;
679             TheISA::PCState pcState = thread->pcState();
680             TheISA::advancePC(pcState, curStaticInst);
681             thread->pcState(pcState);
682         }
683     }
684
685     if (branchPred && curStaticInst && curStaticInst->isControl()) {
686         // Use a fake sequence number since we only have one
687         // instruction in flight at the same time.
688         const InstSeqNum cur_sn(0);
689
690         if (t_info.predPC == thread->pcState()) {
691             // Correctly predicted branch
692             branchPred->update(cur_sn, curThread);
693         } else {
694             // Mis-predicted branch
695             branchPred->squash(cur_sn, thread->pcState(), branching, curThread);
696             ++t_info.numBranchMispred;
697         }
698     }
699 }

In general, the advancePC function updates current CPU context. However, depending on the current CPU state, whether the fault has been raised or not, it chooses different options to handle the generated fault and redirect the PC to move on. The invoke function called through the fault object handles the generated fault usually with the help of pre-defined ROM code. Also, it resets decoder and make curMacroStaticInst as Null. This is because we have to move on to the new PC after handling fault.

On the other hand, as usually taken path, when the fault has not been raised during the current instruction execution, it updates micropc of the processor to the next instruction (line 676-682).

pre-defined ROM code handles generated fault!

Then let’s take a look at how the fault can be handled by the invoke function implemented in the fault class.

gem5/srch/arch/x86/faults.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
 53 namespace X86ISA
 54 {
 55     void X86FaultBase::invoke(ThreadContext * tc, const StaticInstPtr &inst)
 56     {
 57         if (!FullSystem) {
 58             FaultBase::invoke(tc, inst);
 59             return;
 60         }
 61
 62         PCState pcState = tc->pcState();
 63         Addr pc = pcState.pc();
 64         DPRINTF(Faults, "RIP %#x: vector %d: %s\n",
 65                 pc, vector, describe());
 66         using namespace X86ISAInst::RomLabels;
 67         HandyM5Reg m5reg = tc->readMiscRegNoEffect(MISCREG_M5_REG);
 68         MicroPC entry;
 69         if (m5reg.mode == LongMode) {
 70             if (isSoft()) {
 71                 entry = extern_label_longModeSoftInterrupt;
 72             } else {
 73                 entry = extern_label_longModeInterrupt;
 74             }
 75         } else {
 76             entry = extern_label_legacyModeInterrupt;
 77         }
 78         tc->setIntReg(INTREG_MICRO(1), vector);
 79         tc->setIntReg(INTREG_MICRO(7), pc);
 80         if (errorCode != (uint64_t)(-1)) {
 81             if (m5reg.mode == LongMode) {
 82                 entry = extern_label_longModeInterruptWithError;
 83             } else {
 84                 panic("Legacy mode interrupts with error codes "
 85                         "aren't implementde.\n");
 86             }
 87             // Software interrupts shouldn't have error codes. If one
 88             // does, there would need to be microcode to set it up.
 89             assert(!isSoft());
 90             tc->setIntReg(INTREG_MICRO(15), errorCode);
 91         }
 92         pcState.upc(romMicroPC(entry));
 93         pcState.nupc(romMicroPC(entry) + 1);
 94         tc->pcState(pcState);
 95     }

To look at the behavior of the invoke function of the fault, we have to look at the fault related classes first. GEM5 provides base interface for every faults defined in the x86 architecture. x86 provides different types of events that can intervene the execution flow, which are fault, abort, trap, interrupts. All those events inherit from the base x86 fault class X86FaultBase which provides general interfaces and semantics of the x86 fault events.

However, depending on type of events, different classes inheriting the X86FaultBase can override invoke function to define their own semantics of fault events. For example, PageFault class inherits X86FaultBase class and overrides invoke function to add its own pagefault related semantics before invoking the parent’s invoke function provided by the X86FaultBase class.

Invoke change current RIP to pre-defined microops

Basically, invoke function makes the processor jump to the pre-defined microcode function that implements actual semantics of x86 fault handling. When the fault or interrupt is reported to the processor, first of all, it should stores current context of the processor. And then, it transfers a control flow of the processor to the designated fault handler represented by the IDTR register in x86.

To jump to the pre-defined ROM code from the invoke function, it makes use of ROM labels that statically stores sequence of x86 microops. All the available ROM labels are defined in the RomLabels namespace as show in the below.

gem5/build/X86/arch/x86/generated/decoder-ns.hh.inc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
 4587 namespace RomLabels {
 4588 const static uint64_t label_longModeSoftInterrupt_stackSwitched = 92;
 4589 const static uint64_t label_longModeInterrupt_processDescriptor = 11;
 4590 const static uint64_t label_longModeInterruptWithError_cplStackSwitch = 152;
 4591 const static uint64_t label_longModeInterrupt_istStackSwitch = 28;
 4592 const static uint64_t label_jmpFarWork = 192;
 4593 const static uint64_t label_farJmpSystemDescriptor = 207;
 4594 const static uint64_t label_longModeSoftInterrupt_globalDescriptor = 71;
 4595 const static uint64_t label_farJmpGlobalDescriptor = 199;
 4596 const static uint64_t label_initIntHalt = 186;
 4597 const static uint64_t label_longModeInterruptWithError_istStackSwitch = 150;
 4598 const static uint64_t label_legacyModeInterrupt = 184;
 4599 const static uint64_t label_longModeInterruptWithError_globalDescriptor = 132;
 4600 const static uint64_t label_longModeSoftInterrupt_processDescriptor = 72;
 4601 const static uint64_t label_longModeInterruptWithError = 122;
 4602 const static uint64_t label_farJmpProcessDescriptor = 200;
 4603 const static uint64_t label_longModeSoftInterrupt = 61;
 4604 const static uint64_t label_longModeSoftInterrupt_istStackSwitch = 89;
 4605 const static uint64_t label_longModeInterrupt_globalDescriptor = 10;
 4606 const static uint64_t label_longModeInterrupt_cplStackSwitch = 30;
 4607 const static uint64_t label_longModeInterrupt = 0;
 4608 const static uint64_t label_longModeInterruptWithError_processDescriptor = 133;
 4609 const static uint64_t label_longModeInterruptWithError_stackSwitched = 153;
 4610 const static uint64_t label_longModeInterrupt_stackSwitched = 31;
 4611 const static uint64_t label_longModeSoftInterrupt_cplStackSwitch = 91;
 4612 const static MicroPC extern_label_initIntHalt = 186;
 4613 const static MicroPC extern_label_longModeInterruptWithError = 122;
 4614 const static MicroPC extern_label_longModeInterrupt = 0;
 4615 const static MicroPC extern_label_longModeSoftInterrupt = 61;
 4616 const static MicroPC extern_label_legacyModeInterrupt = 184;
 4617 const static MicroPC extern_label_jmpFarWork = 192;
 4618 }

PageFault handling ROM code

Although we are looking at translation fault, note that is can be described as PageFault in x86.

gem5/src/arch/x86/faults.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
137     void PageFault::invoke(ThreadContext * tc, const StaticInstPtr &inst)
138     {
139         if (FullSystem) {
140             /* Invalidate any matching TLB entries before handling the page fault */
141             tc->getITBPtr()->demapPage(addr, 0);
142             tc->getDTBPtr()->demapPage(addr, 0);
143             HandyM5Reg m5reg = tc->readMiscRegNoEffect(MISCREG_M5_REG);
144             X86FaultBase::invoke(tc);
145             /*
146              * If something bad happens while trying to enter the page fault
147              * handler, I'm pretty sure that's a double fault and then all
148              * bets are off. That means it should be safe to update this
149              * state now.
150              */
151             if (m5reg.mode == LongMode) {
152                 tc->setMiscReg(MISCREG_CR2, addr);
153             } else {
154                 tc->setMiscReg(MISCREG_CR2, (uint32_t)addr);
155             }
156         } else {
157             PageFaultErrorCode code = errorCode;
158             const char *modeStr = "";
159             if (code.fetch)
160                 modeStr = "execute";
161             else if (code.write)
162                 modeStr = "write";
163             else
164                 modeStr = "read";
165
166             // print information about what we are panic'ing on
167             if (!inst) {
168                 panic("Tried to %s unmapped address %#x.\n", modeStr, addr);
169             } else {
170                 panic("Tried to %s unmapped address %#x.\nPC: %#x, Instr: %s",
171                       modeStr, addr, tc->pcState().pc(),
172                       inst->disassemble(tc->pcState().pc(), debugSymbolTable));
173             }
174         }
175     }

Because most of the fault handling logic of the PageFault class overlaps with X86FaultBase, after handling TLB related issues, it just calls invoke function of X86FaultBase class. Because translation fault mainly happens in longmode, and generated fault is not software interrupt, we will take a look at the ROM label named label_longModeInterrupt.

Pass arguments to the ROM code

Also, before jumping to the ROM label, it sets micro architectural registers to pass interrupt number and PC address to the ROM code. Additionaly, when the interrupt makes use of error code, it should also be passed to the microcode

To pass the arguments to the microcode world, it invokes setIntReg functions defined in the threadcontext. Threadcontext is instance of SimpleThread class defined in cpu/simple_thread.hh (When you use the o3 out-of-order cpu model, you have to look at O3ThreadContext class). Regardless of your processor model, both classes inherit ThreadContext class which provide generic register context and interface for manipulating the registers.

gem5/src/cpu/simple_thread.hh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
 98 class SimpleThread : public ThreadState, public ThreadContext
 99 {
100   protected:
101     typedef TheISA::MachInst MachInst;
102     using VecRegContainer = TheISA::VecRegContainer;
103     using VecElem = TheISA::VecElem;
104     using VecPredRegContainer = TheISA::VecPredRegContainer;
105   public:
106     typedef ThreadContext::Status Status;
107
108   protected:
109     std::array<RegVal, TheISA::NumFloatRegs> floatRegs;
110     std::array<RegVal, TheISA::NumIntRegs> intRegs;
111     std::array<VecRegContainer, TheISA::NumVecRegs> vecRegs;
112     std::array<VecPredRegContainer, TheISA::NumVecPredRegs> vecPredRegs;
113     std::array<RegVal, TheISA::NumCCRegs> ccRegs;
114     TheISA::ISA *const isa;    // one "instance" of the current ISA.
115
116     TheISA::PCState _pcState;

477     void
478     setIntReg(RegIndex reg_idx, RegVal val) override
479     {
480         int flatIndex = isa->flattenIntIndex(reg_idx);
481         assert(flatIndex < TheISA::NumIntRegs);
482         DPRINTF(IntRegs, "Setting int reg %d (%d) to %#x.\n",
483                 reg_idx, flatIndex, val);
484         setIntRegFlat(flatIndex, val);
485     }

Detour to TheISA namespace

Although SimpleThread class can be seen as providing generic registers regardless of architectures, it declares ISA dependent registers. The magic is TheISA symbol. TheISA symbol will be translated to architecture specific namespace depending on the architecture that the Gem5 has been compiled to. Let’s little bit detour and figure out how TheISA namespace works.

When you don’t know what is the TheISA namesapce, you may want to grep “namespace TheISA” to find out files that define TheISA namespace. However, unfortunately, you can only find very few places where the TheISA namespace has been declared with a handful of member functions. Then where those functions and variables of the TheISA namespace come from? To understand the TheISA:: namespace, we should look at the build files not the source file.

build/X86/config/the_isa.hh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
  1 #ifndef __CONFIG_THE_ISA_HH__
  2 #define __CONFIG_THE_ISA_HH__
  3
  4 #define ALPHA_ISA 1
  5 #define ARM_ISA 2
  6 #define MIPS_ISA 3
  7 #define NULL_ISA 4
  8 #define POWER_ISA 5
  9 #define RISCV_ISA 6
 10 #define SPARC_ISA 7
 11 #define X86_ISA 8
 12
 13 enum class Arch {
 14   AlphaISA = ALPHA_ISA,
 15   ArmISA = ARM_ISA,
 16   MipsISA = MIPS_ISA,
 17   NullISA = NULL_ISA,
 18   PowerISA = POWER_ISA,
 19   RiscvISA = RISCV_ISA,
 20   SparcISA = SPARC_ISA,
 21   X86ISA = X86_ISA
 22 };
 23
 24 #define THE_ISA X86_ISA
 25 #define TheISA X86ISA
 26 #define THE_ISA_STR "x86"
 27
 28 #endif // __CONFIG_THE_ISA_HH__

Here, we can easily find that TheISA is defined as X86ISA. Also when we look at the SConScript file, we can find python function names makeTheISA that actually fills out content of config/the_isa.hh file. Here, because I compiled GEM5 with the X86 configuration, it defines the TheISA as X86ISA.

Therefore, when the TheISA has been used on the cpu related files, it is not a actual namespace called “TheISA”, but the architecture dependent ISA namespace. Consequently, when you encounter namespace TheISA, first check whether the config/the_isa.hh header file has been included in your target source file; and when the answer is yes, you have to look at the architecture dependent namespace defined in the gem5/src/arch/YOUR_ARCHITECTURE directory. In my case, because I use the X86 it should be X86ISA namespace.

SetIntReg with TheISA

Now let’s go back to SimpleThread class. In addition to the architecture specific register context, it provides setIntReg function. It allows the processor to store the data on intRegs array located by the index.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
477     void
478     setIntReg(RegIndex reg_idx, RegVal val) override
479     {
480         int flatIndex = isa->flattenIntIndex(reg_idx);
481         assert(flatIndex < TheISA::NumIntRegs);
482         DPRINTF(IntRegs, "Setting int reg %d (%d) to %#x.\n",
483                 reg_idx, flatIndex, val);
484         setIntRegFlat(flatIndex, val);
485     }

618     void
619     setIntRegFlat(RegIndex idx, RegVal val) override
620     {
621         intRegs[idx] = val;
622     }

Note that the val is stored in the intRegs array through the unified interface setIntReg function. The IntRegs contains not only the architecture registers such as rsi,rdi,rcx in x86, but also the integer type micro-registers used only by the microops.

Because x86 in GEM5 defines 16 Integer registers available to the microops, (look at gem5/src/arch/x86/x86_traits.hh) it can pass up to 16 Integer value to the microcode through the setIntReg function. As shown in the invoke function, micro register 1,7, and 15 has been used to pass the fault related arguments to the microops.

Jump to the ROM code!

After finishing setting the required parameters now, it jumps to the stored ROM code pointed to by the label. This control flow transition is done by updating _pcState memeber field of the SimpleThread class object.

gem5/srch/arch/x86/faults.cc

1
2
3
4
 92         pcState.upc(romMicroPC(entry));
 93         pcState.nupc(romMicroPC(entry) + 1);
 94         tc->pcState(pcState);
 95     }

When we look at the above code in the invoke function of X86FaultBase class, we can find that it updates upc field of the pcState to location of the ROM code.

gem5/src/base/types.hh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
144 typedef uint16_t MicroPC;
145
146 static const MicroPC MicroPCRomBit = 1 << (sizeof(MicroPC) * 8 - 1);
147
148 static inline MicroPC
149 romMicroPC(MicroPC upc)
150 {
151     return upc | MicroPCRomBit;
152 }
153
154 static inline MicroPC
155 normalMicroPC(MicroPC upc)
156 {
157     return upc & ~MicroPCRomBit;
158 }
159
160 static inline bool
161 isRomMicroPC(MicroPC upc)
162 {
163     return MicroPCRomBit & upc;
164 }

Note that romMicroPC function sets flag to specify upc points to start address of ROM code. Here the flag is just bit-wise ored to the upc address.

arch/generic/types.hh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
193 // A PC and microcode PC.
194 template <class MachInst>
195 class UPCState : public SimplePCState<MachInst>
196 {
197   protected:
198     typedef SimplePCState<MachInst> Base;
199 
200     MicroPC _upc;
201     MicroPC _nupc;
202 
203   public:
204 
205     MicroPC upc() const { return _upc; }
206     void upc(MicroPC val) { _upc = val; }
207 
208     MicroPC nupc() const { return _nupc; }
209     void nupc(MicroPC val) { _nupc = val; }

After the upc address is generated, it needs to update the pcState variable to change the current upc address. You can also update the upc address of the current processor’s pcState variable, it is recommended to pass newly initialized pcState object to the processor context. Therefore, new pcState variable invokes upc function to update its upc address. After that, by invoking tc->pcState(pcState), it update member field _pcState of a threadContexts to a new pcState, which makes the processor run from the updated micro pc address when the next fetch happens.

However, note that this function just updates _pcState member field of the ThreadContex. Then who actually redirects the pipeline to fetch the new instructions from the ROM not from the faulting instruction? Let’s go back to the advancePC function that called the invoke function.

Let’s go back to advancePC & advanceInst

gem5/src/cpu/simple/base.cc

1
2
673         fault->invoke(threadContexts[curThread], curStaticInst);
674         thread->decoder.reset();

After the invoke function is called as part of the advancePC function, it resets the decoder, which updates decoder state as ResetState.

gem5/src/cpu/simple/timing.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
 730 void
 731 TimingSimpleCPU::advanceInst(const Fault &fault)
 732 {
 733     SimpleExecContext &t_info = *threadInfo[curThread];
 734
 735     if (_status == Faulting)
 736         return;
 737
 738     if (fault != NoFault) {
 739         DPRINTF(SimpleCPU, "Fault occured. Handling the fault\n");
 740
 741         advancePC(fault);
 742     if (fault != NoFault) {
 743         DPRINTF(SimpleCPU, "Fault occured. Handling the fault\n");
 744
 745         advancePC(fault);
 746
 747         // A syscall fault could suspend this CPU (e.g., futex_wait)
 748         // If the _status is not Idle, schedule an event to fetch the next
 749         // instruction after 'stall' ticks.
 750         // If the cpu has been suspended (i.e., _status == Idle), another
 751         // cpu will wake this cpu up later.
 752         if (_status != Idle) {
 753             DPRINTF(SimpleCPU, "Scheduling fetch event after the Fault\n");
 754
 755             Tick stall = dynamic_pointer_cast<SyscallRetryFault>(fault) ?
 756                          clockEdge(syscallRetryLatency) : clockEdge();
 757             reschedule(fetchEvent, stall, true);
 758             _status = Faulting;
 759         }
 760
 761         return;
 762     }

After returning from the advancePC instruction, advanceInst function checks status of the current processor. When the processor is not in idle state, it reschedules fetchEvent to be executed again after stall ticks. Also note that status of the processor has been changed to Faulting.

fetchEvent invokes fetch() function

By the way what is the fetchEvent?

1
2
3
4
5
6
7
  79 TimingSimpleCPU::TimingSimpleCPU(TimingSimpleCPUParams *p)
  80     : BaseSimpleCPU(p), fetchTranslation(this), icachePort(this),
  81       dcachePort(this), ifetch_pkt(NULL), dcache_pkt(NULL), previousCycle(0),
  82       fetchEvent([this]{ fetch(); }, name())
  83 {
  84     _status = Idle;
  85 }

Because fetchEvent is initialized to invoke fetch() function at TimingSimpleCPU constructor, after a stall time passed, it will invoke fetch function to fetch new instruction from the faulting address.

fetchEvent is defined as a EventFunctionWrapper type used for registering event in GEM5. Also, the fetchEvent is initiated by the constructor of the TimingSimpleCPU class to invoke fetch() function. Therefore, after the stall ticks passed, it invokes fetch() function defined in the TimingSimpleCPU class.

Now start to fetch from updatd RIP, the ROM code!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
 653 void
 654 TimingSimpleCPU::fetch()
 655 {
 656     // Change thread if multi-threaded
 657     swapActiveThread();
 658
 659     SimpleExecContext &t_info = *threadInfo[curThread];
 660     SimpleThread* thread = t_info.thread;
 661
 662     DPRINTF(SimpleCPU, "Fetch\n");
 663
 664     if (!curStaticInst || !curStaticInst->isDelayedCommit()) {
 665         checkForInterrupts();
 666         checkPcEventQueue();
 667     }
 668
 669     // We must have just got suspended by a PC event
 670     if (_status == Idle)
 671         return;
 672
 673     TheISA::PCState pcState = thread->pcState();
 674     bool needToFetch = !isRomMicroPC(pcState.microPC()) &&
 675                        !curMacroStaticInst;
 676
 677     if (needToFetch) {
 678         _status = BaseSimpleCPU::Running;
 679         RequestPtr ifetch_req = std::make_shared<Request>();
 680         ifetch_req->taskId(taskId());
 681         ifetch_req->setContext(thread->contextId());
 682         setupFetchRequest(ifetch_req);
 683         DPRINTF(SimpleCPU, "Translating address %#x\n", ifetch_req->getVaddr());
 684         thread->itb->translateTiming(ifetch_req, thread->getTC(),
 685                 &fetchTranslation, BaseTLB::Execute);
 686     } else {
 687         _status = IcacheWaitResponse;
 688         completeIfetch(NULL);
 689
 690         updateCycleCounts();
 691         updateCycleCounters(BaseCPU::CPU_STATE_ON);
 692     }
 693 }

Remeber that curMacroStaticInst has been set to StaticInst::nullStaticInstPtr by advancePC. Also, upc has been updated to the ROM code address with MicroPCRomBit flag. Therefore, it sets needToFetch True and start to fetch new instructions from the ROM code.

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.