Gem5 X86 Tlb
layout: post tittle: “Pagetable walking and pagefault handling in Gem5” categories: GEM5, TLB — In this posting, we are going to take a look at how the memory accesses can be resolved through the TLB and pagetable walking.
Who initiates TLB access?
TLB maintains a virtual to physical address translation information to reduce time of walking the entire page table at every memory access. In other words, it is a cache of virtual to physical mapping maintained by the processor usually. Then which part of the CPU logic initiates the TLB logic, and what operations should be done by the TLB component?
Interface between CPU pipeline and TLB component
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
425 Fault
426 TimingSimpleCPU::initiateMemRead(Addr addr, unsigned size,
427 Request::Flags flags,
428 const std::vector<bool>& byte_enable)
429 {
430 SimpleExecContext &t_info = *threadInfo[curThread];
431 SimpleThread* thread = t_info.thread;
432
439 Fault fault;
440 const int asid = 0;
441 const Addr pc = thread->instAddr();
442 unsigned block_size = cacheLineSize();
443 BaseTLB::Mode mode = BaseTLB::Read;
444
445 if (traceData)
446 traceData->setMem(addr, size, flags);
447
448 RequestPtr req = std::make_shared<Request>(
449 asid, addr, size, flags, dataMasterId(), pc,
450 thread->contextId());
451 if (!byte_enable.empty()) {
452 req->setByteEnable(byte_enable);
453 }
454
455 req->taskId(taskId());
456
457 Addr split_addr = roundDown(addr + size - 1, block_size);
458 assert(split_addr <= addr || split_addr - addr < block_size);
459
460 _status = DTBWaitResponse;
461 if (split_addr > addr) {
462 RequestPtr req1, req2;
463 assert(!req->isLLSC() && !req->isSwap());
464 req->splitOnVaddr(split_addr, req1, req2);
465
466 WholeTranslationState *state =
467 new WholeTranslationState(req, req1, req2, new uint8_t[size],
468 NULL, mode);
469 DataTranslation<TimingSimpleCPU *> *trans1 =
470 new DataTranslation<TimingSimpleCPU *>(this, state, 0);
471 DataTranslation<TimingSimpleCPU *> *trans2 =
472 new DataTranslation<TimingSimpleCPU *>(this, state, 1);
473
474 thread->dtb->translateTiming(req1, thread->getTC(), trans1, mode);
475 thread->dtb->translateTiming(req2, thread->getTC(), trans2, mode);
476 } else {
477 WholeTranslationState *state =
478 new WholeTranslationState(req, new uint8_t[size], NULL, mode);
479 DataTranslation<TimingSimpleCPU *> *translation
480 = new DataTranslation<TimingSimpleCPU *>(this, state);
481 thread->dtb->translateTiming(req, thread->getTC(), translation, mode);
482 }
483
484 return NoFault;
485 }
One of the most important basic capability of processor is accessing memory. GEM5 make each processor implement their own memory access building blocks as member function of each processor class. We are going to take a look at simple processor, TimingSimpleCPU and corresponding memory function, initiateMemRead. Note that at the end of the initiateMemRead function, it generates DataTranslation object and pass it to the translateTiming function defined in the data TLB component of the processor. This translation object will be used to process current TLB access request. Also note that translateTiming function needs threadContext to execute TLB accessing and RequestPtr object containing all the memory access request information such as virtual address.
It’s all about TLB! No actual memory access to the virtual address!
initiateMemRead function does not initiate actual memory access, it only asks TLB component to generate virtual address to physical address mapping in its TLB cache.
It could be confusing because of its name initateMemRead but the actual memory access could only be occured after the TLB request can be successfully resolved. I will describe how actual memory access happens in this posting [] Keep in mind that we will only focus on the translation part!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
//gem5/src/arch/x86/tlb.cc
441 void
442 TLB::translateTiming(const RequestPtr &req, ThreadContext *tc,
443 Translation *translation, Mode mode)
444 {
445 bool delayedResponse;
446 assert(translation);
447 Fault fault =
448 TLB::translate(req, tc, translation, mode, delayedResponse, true);
449
450 if (!delayedResponse)
451 translation->finish(fault, req, tc, mode);
452 else
453 translation->markDelayed();
454 }
As we assume that GEM5 is compiled for X86 architecture, it will invoke TLB implementation for X86 architecture. Please be aware that the translateTiming function is implemented as part of the TLB class, indicating that we are presently working with TLB components, transitioning away from the processor pipeline. The actual translation is done by TLB::translate function. Depending on whether the target virtual address has previously been resolved and its mapping cached in the TLB or not, the function can either retrieve the TLB entry from the cache or obtain it by traversing the page table.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// gem5/src/arch/x86/tlb.cc
277 Fault
278 TLB::translate(const RequestPtr &req,
279 ThreadContext *tc, Translation *translation,
280 Mode mode, bool &delayedResponse, bool timing)
281 {
282 Request::Flags flags = req->getFlags();
283 int seg = flags & SegmentFlagMask;
284 bool storeCheck = flags & (StoreCheck << FlagShift);
...
341 // If paging is enabled, do the translation.
342 if (m5Reg.paging) {
343 DPRINTF(TLB, "Paging enabled.\n");
344 // The vaddr already has the segment base applied.
345 TlbEntry *entry = lookup(vaddr);
346 if (mode == Read) {
347 rdAccesses++;
348 } else {
349 wrAccesses++;
350 }
351 if (!entry) {
352 DPRINTF(TLB, "Handling a TLB miss for "
353 "address %#x at pc %#x.\n",
354 vaddr, tc->instAddr());
355 if (mode == Read) {
356 rdMisses++;
357 } else {
358 wrMisses++;
359 }
360 if (FullSystem) {
361 Fault fault = walker->start(tc, translation, req, mode);
362 if (timing || fault != NoFault) {
363 // This gets ignored in atomic mode.
364 delayedResponse = true;
365 return fault;
366 }
367 entry = lookup(vaddr);
368 assert(entry);
369 } else {
The initial step in the translate function involves a query to the TLB, inquiring whether the necessary translation entry is present in the TLB (line 345). In cases where the TLB entry is absent, the process then proceeds to navigate through the page table, which is stored in memory, in order to acquire the virtual-to-physical translation (spanning from line 351 to 395). Given the presumed interest in utilizing full-system emulation, I will focus on FullSystem parts of TLB handling. In GEM5’s fullsystem mode, when a TLB miss occurs, the system proceeds to navigate the page table using the “pagetable_walker” object (as indicated in line 361). It’s important to note that the “req” parameter is passed to the pagetable_walker because it contains all the essential information, including the address and flags, necessary for correctly resolving memory access.
Page table walking in TLB
In cases where it is either the first request or the previous TLB entry has been evicted from the TLB cache, it is required to traverse the page table and obtain the virtual to physical mapping. Let’s examine the process by which the TLB effectively navigates the page table and retrieves the final-level page table entry.
WalkerState per request
In contrast to simpler operations, it’s typically not possible to resolve TLB misses in a single cycle.
As the page table is structured with multiple levels, the page table walking demands numerous memory accesses. These accesses are essential for reaching the leaf page table entry that contains the virtual-to-physical mapping and other pertinent flags.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
//gem5/src/arch/x86/pagetable_walker.cc
71 Fault
72 Walker::start(ThreadContext * _tc, BaseTLB::Translation *_translation,
73 const RequestPtr &_req, BaseTLB::Mode _mode)
74 {
75 // TODO: in timing mode, instead of blocking when there are other
76 // outstanding requests, see if this request can be coalesced with
77 // another one (i.e. either coalesce or start walk)
78 WalkerState * newState = new WalkerState(this, _translation, _req);
79 newState->initState(_tc, _mode, sys->isTimingMode());
80 if (currStates.size()) {
81 assert(newState->isTiming());
82 DPRINTF(PageTableWalker, "Walks in progress: %d\n", currStates.size());
83 currStates.push_back(newState);
84 return NoFault;
85 } else {
86 currStates.push_back(newState);
87 Fault fault = newState->startWalk();
88 if (!newState->isTiming()) {
89 currStates.pop_front();
90 delete newState;
91 }
92 return fault;
93 }
94 }
It is important to note that TLB misses can occur simultaneously because multiple processors might try to access a memory address for which the virtual-to-physical mapping is not stored in the TLB cache. Additional, since each request cannot be handled in a single clock cycle, there is a need to store the state of page table walking for each request. The “walkerState” is employed for this specific purpose, maintaining all the necessary information for page table walking on a per-request basis.
The “currStates” keeps track of all the outstanding requests, which are those that have been requested previously but have not yet been resolved, in the form of a list. If there are any unresolved TLB misses, the current request is simply added to the list, and the system waits until the preceding requests have been resolved, as seen in lines 80-84. Once the outstanding request has been resolved, the pending requests are then processed one after another.
If there is no remaining requests in the list, as indicated in lines 85-92, a newly generated state should be added, and the “startWalk” function is called with the newly created state. Upon completion of the page table walking by the “startWalk” function, in the case of a timing CPU, there is no need to remove the current state from the “currStates” list, as another stage in the timing CPU model takes care of removing the current state from the list.
startWalk, initiating page table walking
gem5/src/arch/x86/pagetable_walker.cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
229 Fault
230 Walker::WalkerState::startWalk()
231 {
232 Fault fault = NoFault;
233 assert(!started);
234 started = true;
235 setupWalk(req->getVaddr());
236 if (timing) {
237 nextState = state;
238 state = Waiting;
239 timingFault = NoFault;
240 sendPackets();
241 } else {
242 do {
243 walker->port.sendAtomic(read);
244 PacketPtr write = NULL;
245 fault = stepWalk(write);
246 assert(fault == NoFault || read == NULL);
247 state = nextState;
248 nextState = Ready;
249 if (write)
250 walker->port.sendAtomic(write);
251 } while (read);
252 state = Ready;
253 nextState = Waiting;
254 }
255 return fault;
256 }
Since the page table is stored in memory or cache, whenever the TLB miss happens it should retrieve page table content from the memory subsystem. To this end, TLB component initiates memory request through sendPackets function.
multi-level page table walking process = multiple packets
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
661 void
662 Walker::WalkerState::sendPackets()
663 {
664 //If we're already waiting for the port to become available, just return.
665 if (retrying)
666 return;
667
668 //Reads always have priority
669 if (read) {
670 PacketPtr pkt = read;
671 read = NULL;
672 inflight++;
673 if (!walker->sendTiming(this, pkt)) {
674 retrying = true;
675 read = pkt;
676 inflight--;
677 return;
678 }
679 }
680 //Send off as many of the writes as we can.
681 while (writes.size()) {
682 PacketPtr write = writes.back();
683 writes.pop_back();
684 inflight++;
685 if (!walker->sendTiming(this, write)) {
686 retrying = true;
687 writes.push_back(write);
688 inflight--;
689 return;
690 }
691 }
692 }
With modern processors making use of multi-level page tables, it becomes challenging to pre-determine which page table entries will be accessed before resolving the memory access at the previous level of page table entry. Because of this interdependence among page table access, the accesses to these entries must be carried out sequentially rather than in parallel Consequently, TLB accesses are structured into multiple stages, with each stage responsible for accessing one level of the page table.
Since TLB should request memory subsystem to fetch next level of page table entry one by one, it should send send different packets at different stage to access a specific level of the page table.
When you look at the “sendPackets” function, you will notice a familiar function name, “sendTiming,” which dispatches page-table-access-request-packets to the memory subsystem (e.g., cache or memory).
Initial page table access packet creation
When you take a look at the “sendPackets” function, you won’t observe any packet creation within it. However, you will notice that the “sendTiming” function receives a parameter named pkt. So, where does this pkt come from? The “setupWalk” function within the “startWalk” function is responsible for populating the appropriate request packet, which initiates the access to the page table.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
551 void
552 Walker::WalkerState::setupWalk(Addr vaddr)
553 {
554 VAddr addr = vaddr;
555 CR3 cr3 = tc->readMiscRegNoEffect(MISCREG_CR3);
556 // Check if we're in long mode or not
557 Efer efer = tc->readMiscRegNoEffect(MISCREG_EFER);
558 dataSize = 8;
559 Addr topAddr;
560 if (efer.lma) {
561 // Do long mode.
562 state = LongPML4;
563 topAddr = (cr3.longPdtb << 12) + addr.longl4 * dataSize;
564 enableNX = efer.nxe;
565 } else {
566 // We're in some flavor of legacy mode.
567 CR4 cr4 = tc->readMiscRegNoEffect(MISCREG_CR4);
568 if (cr4.pae) {
569 // Do legacy PAE.
570 state = PAEPDP;
571 topAddr = (cr3.paePdtb << 5) + addr.pael3 * dataSize;
572 enableNX = efer.nxe;
573 } else {
574 dataSize = 4;
575 topAddr = (cr3.pdtb << 12) + addr.norml2 * dataSize;
576 if (cr4.pse) {
577 // Do legacy PSE.
578 state = PSEPD;
579 } else {
580 // Do legacy non PSE.
581 state = PD;
582 }
583 enableNX = false;
584 }
585 }
586
587 nextState = Ready;
588 entry.vaddr = vaddr;
589
590 Request::Flags flags = Request::PHYSICAL;
591 if (cr3.pcd)
592 flags.set(Request::UNCACHEABLE);
593
594 RequestPtr request = std::make_shared<Request>(
595 topAddr, dataSize, flags, walker->masterId);
596
597 read = new Packet(request, MemCmd::ReadReq);
598 read->allocate();
599 }
We’ve learned that the “sendPackets” function is employed to transmit multiple page table access requests, depending on the various stages of the page table walking process. So, how are the packets for the subsequent stages created and provided to the “sendPackets” function? Please bear with me as we progress through one complete step of page table walking; I will address this aspect shortly.
SendTiming function: sends request and save current state
Now, let’s explore how the “sendTiming” function transmits the generated page table access request packet to the memory subsystem via the designated port.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
156 bool Walker::sendTiming(WalkerState* sendingState, PacketPtr pkt)
157 {
158 WalkerSenderState* walker_state = new WalkerSenderState(sendingState);
159 pkt->pushSenderState(walker_state);
160 if (port.sendTimingReq(pkt)) {
161 return true;
162 } else {
163 // undo the adding of the sender state and delete it, as we
164 // will do it again the next time we attempt to send it
165 pkt->popSenderState();
166 delete walker_state;
167 return false;
168 }
169
170 }
It’s worth noting that the “sendTiming” function initially generates a separate state called “WalkerSenderState.” This state variable is essential for handling the requested page table access and for processing the response from the memory subsystem once the page table access has been completed.
Handling return packet from memory sub-system
When memory sub-system successfully handled the page table access request, pagetable_walker receives the result packet through the port. When the packet arrives to the port connecting pagetable_walker and memory sub-system, it invokes recvTimingResp function of the walker.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
104 bool
105 Walker::WalkerPort::recvTimingResp(PacketPtr pkt)
106 {
107 return walker->recvTimingResp(pkt);
108 }
109
110 bool
111 Walker::recvTimingResp(PacketPtr pkt)
112 {
113 WalkerSenderState * senderState =
114 dynamic_cast<WalkerSenderState *>(pkt->popSenderState());
115 WalkerState * senderWalk = senderState->senderWalk;
116 bool walkComplete = senderWalk->recvPacket(pkt);
117 delete senderState;
118 if (walkComplete) {
119 std::list<WalkerState *>::iterator iter;
120 for (iter = currStates.begin(); iter != currStates.end(); iter++) {
121 WalkerState * walkerState = *(iter);
122 if (walkerState == senderWalk) {
123 iter = currStates.erase(iter);
124 break;
125 }
126 }
127 delete senderWalk;
128 // Since we block requests when another is outstanding, we
129 // need to check if there is a waiting request to be serviced
130 if (currStates.size() && !startWalkWrapperEvent.scheduled())
131 // delay sending any new requests until we are finished
132 // with the responses
133 schedule(startWalkWrapperEvent, clockEdge());
134 }
135 return true;
136 }
As we’ve seen before, WalkerSenderState wraps up the walker instance (WalkerState) which has been used to send pagetable access request associated with currently received packet.
recvPacket handles received packet and send another packet for next stage pagetable access
Retrieved WalkerState instance handles the received packet by calling recvPacket function of the WalkerState.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
602 bool
603 Walker::WalkerState::recvPacket(PacketPtr pkt)
604 {
605 assert(pkt->isResponse());
606 assert(inflight);
607 assert(state == Waiting);
608 inflight--;
609 if (squashed) {
610 // if were were squashed, return true once inflight is zero and
611 // this WalkerState will be freed there.
612 return (inflight == 0);
613 }
614 if (pkt->isRead()) {
615 // should not have a pending read it we also had one outstanding
616 assert(!read);
617
618 // @todo someone should pay for this
619 pkt->headerDelay = pkt->payloadDelay = 0;
620
621 state = nextState;
622 nextState = Ready;
623 PacketPtr write = NULL;
624 read = pkt;
625 timingFault = stepWalk(write);
626 state = Waiting;
627 assert(timingFault == NoFault || read == NULL);
628 if (write) {
629 writes.push_back(write);
630 }
631 sendPackets();
632 } else {
633 sendPackets();
634 }
635 if (inflight == 0 && read == NULL && writes.size() == 0) {
636 state = Ready;
637 nextState = Waiting;
638 if (timingFault == NoFault) {
639 /*
640 * Finish the translation. Now that we know the right entry is
641 * in the TLB, this should work with no memory accesses.
642 * There could be new faults unrelated to the table walk like
643 * permissions violations, so we'll need the return value as
644 * well.
645 */
646 bool delayedResponse;
647 Fault fault = walker->tlb->translate(req, tc, NULL, mode,
648 delayedResponse, true);
649 assert(!delayedResponse);
650 // Let the CPU continue.
651 translation->finish(fault, req, tc, mode);
652 } else {
653 // There was a fault during the walk. Let the CPU know.
654 translation->finish(timingFault, req, tc, mode);
655 }
656 return true;
657 }
658
659 return false;
660 }
Because the recvPacket function has been invoked as a result of memory read (initial pagetable access) 614-634 will be executed. There are some functions that we don’t know, but it finally invokes sendPackets function again. Wait why sendPackets once again in receive function?
Remember! Page table walking is not a single memory access
Note that we are currently dealing with the result packet from the memory sub-system as a result of sending initial pagetable access request (accessing first level of pagetable) Therefore, the received packet should contain next level page table information not the Page table entry which actually contains physical address to virtual address mapping. Therefore, to acquire the last level page table entry, it needs additional memory accesses to the sub levels of pagetables, which should requires another sendPackets.
Preparing packets for the next pagetable access requests
As we generated initiating packet with the help of setupWalk, packets required for accessing further page table layers are prepared by stepWalk function.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
282 Fault
283 Walker::WalkerState::stepWalk(PacketPtr &write)
284 {
285 assert(state != Ready && state != Waiting);
286 Fault fault = NoFault;
287 write = NULL;
288 PageTableEntry pte;
289 if (dataSize == 8)
290 pte = read->getLE<uint64_t>();
291 else
292 pte = read->getLE<uint32_t>();
293 VAddr vaddr = entry.vaddr;
294 bool uncacheable = pte.pcd;
295 Addr nextRead = 0;
296 bool doWrite = false;
297 bool doTLBInsert = false;
298 bool doEndWalk = false;
299 bool badNX = pte.nx && mode == BaseTLB::Execute && enableNX;
300 switch(state) {
301 case LongPML4:
302 DPRINTF(PageTableWalker,
303 "Got long mode PML4 entry %#016x.\n", (uint64_t)pte);
304 nextRead = ((uint64_t)pte & (mask(40) << 12)) + vaddr.longl3 * dataSize;
305 doWrite = !pte.a;
306 pte.a = 1;
307 entry.writable = pte.w;
308 entry.user = pte.u;
309 if (badNX || !pte.p) {
310 doEndWalk = true;
311 fault = pageFault(pte.p);
312 break;
313 }
314 entry.noExec = pte.nx;
315 nextState = LongPDP;
316 break;
317 case LongPDP:
318 DPRINTF(PageTableWalker,
319 "Got long mode PDP entry %#016x.\n", (uint64_t)pte);
320 nextRead = ((uint64_t)pte & (mask(40) << 12)) + vaddr.longl2 * dataSize;
321 doWrite = !pte.a;
322 pte.a = 1;
323 entry.writable = entry.writable && pte.w;
324 entry.user = entry.user && pte.u;
325 if (badNX || !pte.p) {
326 doEndWalk = true;
327 fault = pageFault(pte.p);
328 break;
329 }
330 nextState = LongPD;
331 break;
332 case LongPD:
333 DPRINTF(PageTableWalker,
334 "Got long mode PD entry %#016x.\n", (uint64_t)pte);
335 doWrite = !pte.a;
336 pte.a = 1;
337 entry.writable = entry.writable && pte.w;
338 entry.user = entry.user && pte.u;
339 if (badNX || !pte.p) {
340 doEndWalk = true;
341 fault = pageFault(pte.p);
342 break;
343 }
344 if (!pte.ps) {
345 // 4 KB page
346 entry.logBytes = 12;
347 nextRead =
348 ((uint64_t)pte & (mask(40) << 12)) + vaddr.longl1 * dataSize;
349 nextState = LongPTE;
350 break;
351 } else {
352 // 2 MB page
353 entry.logBytes = 21;
354 entry.paddr = (uint64_t)pte & (mask(31) << 21);
355 entry.uncacheable = uncacheable;
356 entry.global = pte.g;
357 entry.patBit = bits(pte, 12);
358 entry.vaddr = entry.vaddr & ~((2 * (1 << 20)) - 1);
359 doTLBInsert = true;
360 doEndWalk = true;
361 break;
362 }
363 case LongPTE:
364 DPRINTF(PageTableWalker,
365 "Got long mode PTE entry %#016x.\n", (uint64_t)pte);
366 doWrite = !pte.a;
367 pte.a = 1;
368 entry.writable = entry.writable && pte.w;
369 entry.user = entry.user && pte.u;
370 if (badNX || !pte.p) {
371 doEndWalk = true;
372 fault = pageFault(pte.p);
373 break;
374 }
375 entry.paddr = (uint64_t)pte & (mask(40) << 12);
376 entry.uncacheable = uncacheable;
377 entry.global = pte.g;
378 entry.patBit = bits(pte, 12);
379 entry.vaddr = entry.vaddr & ~((4 * (1 << 10)) - 1);
380 doTLBInsert = true;
381 doEndWalk = true;
382 break;
383 case PAEPDP:
384 DPRINTF(PageTableWalker,
385 "Got legacy mode PAE PDP entry %#08x.\n", (uint32_t)pte);
386 nextRead = ((uint64_t)pte & (mask(40) << 12)) + vaddr.pael2 * dataSize;
387 if (!pte.p) {
388 doEndWalk = true;
389 fault = pageFault(pte.p);
390 break;
391 }
392 nextState = PAEPD;
393 break;
394 case PAEPD:
395 DPRINTF(PageTableWalker,
396 "Got legacy mode PAE PD entry %#08x.\n", (uint32_t)pte);
397 doWrite = !pte.a;
398 pte.a = 1;
399 entry.writable = pte.w;
400 entry.user = pte.u;
401 if (badNX || !pte.p) {
402 doEndWalk = true;
403 fault = pageFault(pte.p);
404 break;
405 }
406 if (!pte.ps) {
407 // 4 KB page
408 entry.logBytes = 12;
409 nextRead = ((uint64_t)pte & (mask(40) << 12)) + vaddr.pael1 * dataSize;
410 nextState = PAEPTE;
411 break;
412 } else {
413 // 2 MB page
414 entry.logBytes = 21;
415 entry.paddr = (uint64_t)pte & (mask(31) << 21);
416 entry.uncacheable = uncacheable;
417 entry.global = pte.g;
418 entry.patBit = bits(pte, 12);
419 entry.vaddr = entry.vaddr & ~((2 * (1 << 20)) - 1);
420 doTLBInsert = true;
421 doEndWalk = true;
422 break;
443 break;
444 case PSEPD: 445 DPRINTF(PageTableWalker,
446 "Got legacy mode PSE PD entry %#08x.\n", (uint32_t)pte);
447 doWrite = !pte.a;
448 pte.a = 1;
449 entry.writable = pte.w;
450 entry.user = pte.u;
451 if (!pte.p) {
452 doEndWalk = true;
453 fault = pageFault(pte.p);
454 break;
455 }
456 if (!pte.ps) {
457 // 4 KB page
458 entry.logBytes = 12;
459 nextRead =
460 ((uint64_t)pte & (mask(20) << 12)) + vaddr.norml2 * dataSize;
461 nextState = PTE;
462 break;
463 } else {
464 // 4 MB page
465 entry.logBytes = 21;
466 entry.paddr = bits(pte, 20, 13) << 32 | bits(pte, 31, 22) << 22;
467 entry.uncacheable = uncacheable;
468 entry.global = pte.g;
469 entry.patBit = bits(pte, 12);
470 entry.vaddr = entry.vaddr & ~((4 * (1 << 20)) - 1);
471 doTLBInsert = true;
472 doEndWalk = true;
473 break;
474 }
475 case PD:
476 DPRINTF(PageTableWalker,
477 "Got legacy mode PD entry %#08x.\n", (uint32_t)pte);
478 doWrite = !pte.a;
479 pte.a = 1;
480 entry.writable = pte.w;
481 entry.user = pte.u;
482 if (!pte.p) {
483 doEndWalk = true;
484 fault = pageFault(pte.p);
485 break;
486 }
487 // 4 KB page
488 entry.logBytes = 12;
489 nextRead = ((uint64_t)pte & (mask(20) << 12)) + vaddr.norml2 * dataSize;
490 nextState = PTE;
491 break;
492 case PTE:
493 DPRINTF(PageTableWalker,
494 "Got legacy mode PTE entry %#08x.\n", (uint32_t)pte);
495 doWrite = !pte.a;
496 pte.a = 1;
497 entry.writable = pte.w;
498 entry.user = pte.u;
499 if (!pte.p) {
500 doEndWalk = true;
501 fault = pageFault(pte.p);
502 break;
503 }
504 entry.paddr = (uint64_t)pte & (mask(20) << 12);
505 entry.uncacheable = uncacheable;
506 entry.global = pte.g;
507 entry.patBit = bits(pte, 7);
508 entry.vaddr = entry.vaddr & ~((4 * (1 << 10)) - 1);
509 doTLBInsert = true;
510 doEndWalk = true;
511 break;
512 default:
513 panic("Unknown page table walker state %d!\n");
514 }
515 if (doEndWalk) {
516 if (doTLBInsert)
517 if (!functional)
518 walker->tlb->insert(entry.vaddr, entry, tc);
519 endWalk();
520 } else {
521 PacketPtr oldRead = read;
522 //If we didn't return, we're setting up another read.
523 Request::Flags flags = oldRead->req->getFlags();
524 flags.set(Request::UNCACHEABLE, uncacheable);
525 RequestPtr request = std::make_shared<Request>(
526 nextRead, oldRead->getSize(), flags, walker->masterId);
527 read = new Packet(request, MemCmd::ReadReq);
528 read->allocate();
529 // If we need to write, adjust the read packet to write the modified
530 // value back to memory.
531 if (doWrite) {
532 write = oldRead;
533 write->setLE<uint64_t>(pte);
534 write->cmd = MemCmd::WriteReq;
535 } else {
536 write = NULL;
537 delete oldRead;
538 }
539 }
540 return fault;
541 }
Even though it is very long, depending on the current state representing a level of the page table accessed by the currently received packet, next level page table address and corresponding packet is populated.
Because we are here as a result of accessing PML4 (first level page table), line 301-316 will be executed and prepare the information to access the next level pagetable. Note that the next state is set to the next level of page table level, PDP.
After setting fields associated with next page table level access, it generates another read packet (Line 520-539) to convey all the information required to access the next level pagetable. Note that the newly populated packet is assigned to the read field of the current WalkerState object. This read packet is used by the sendPackets function to access further pagetable layers.
These sending and receiving steps are repeated until the final PTE is read. When PTE is read from the memory sub-system, it sets the doEndWalk flag and doTLBInsert flag When the flags are set, new TLB entry is inserted to the TLB module (line 515-520).
Finish TLB translation
After the translation has been finished, whether it ends up TLB hit, TLB miss and page table walking, or unexpected TLB fault, it invokes finish function through a translation object.
gem5/src/arch/x86/tlb.cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
441 void
442 TLB::translateTiming(const RequestPtr &req, ThreadContext *tc,
443 Translation *translation, Mode mode)
444 {
445 bool delayedResponse;
446 assert(translation);
447 Fault fault =
448 TLB::translate(req, tc, translation, mode, delayedResponse, true);
449
450 if (!delayedResponse)
451 translation->finish(fault, req, tc, mode);
452 else
453 translation->markDelayed();
454 }
Translation object
Wait, what is the translation object? We haven’t deal with it before. Let’s go back to initiateMemRead function again to understand what is the translation object.
DataTranslation class and finish method
1
2
3
4
5
464 WholeTranslationState *state =
465 new WholeTranslationState(req, new uint8_t[size], NULL, mode);
466 DataTranslation<TimingSimpleCPU *> *translation
467 = new DataTranslation<TimingSimpleCPU *>(this, state);
468 thread->dtb->translateTiming(req, thread->getTC(), translation, mode);
At line 464-468, we can find that it is an object of DataTranslation class. To find out implementation of finish function, let’s take a look at DataTranslation class.
gem5/src/cpu/translation.hh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
208 /**
209 * This class represents part of a data address translation. All state for
210 * the translation is held in WholeTranslationState (above). Therefore this
211 * class does not need to know whether the translation is split or not. The
212 * index variable determines this but is simply passed on to the state class.
213 * When this part of the translation is completed, finish is called. If the
214 * translation state class indicate that the whole translation is complete
215 * then the execution context is informed.
216 */
217 template <class ExecContextPtr>
218 class DataTranslation : public BaseTLB::Translation
219 {
220 protected:
221 ExecContextPtr xc;
222 WholeTranslationState *state;
223 int index;
224
225 public:
226 DataTranslation(ExecContextPtr _xc, WholeTranslationState* _state)
227 : xc(_xc), state(_state), index(0)
228 {
229 }
230
231 DataTranslation(ExecContextPtr _xc, WholeTranslationState* _state,
232 int _index)
233 : xc(_xc), state(_state), index(_index)
234 {
235 }
236
237 /**
238 * Signal the translation state that the translation has been delayed due
239 * to a hw page table walk. Split requests are transparently handled.
240 */
241 void
242 markDelayed()
243 {
244 state->delay = true;
245 }
246
247 /**
248 * Finish this part of the translation and indicate that the whole
249 * translation is complete if the state says so.
250 */
251 void
252 finish(const Fault &fault, const RequestPtr &req, ThreadContext *tc,
253 BaseTLB::Mode mode)
254 {
255 assert(state);
256 assert(mode == state->mode);
257 if (state->finish(fault, index)) {
258 if (state->getFault() == NoFault) {
259 // Don't access the request if faulted (due to squash)
260 req->setTranslateLatency();
261 }
262 xc->finishTranslation(state);
263 }
264 delete this;
265 }
266
267 bool
268 squashed() const
269 {
270 return xc->isSquashed();
271 }
272 };
We can find that finish function is implemented in the DataTranslation class. The finish function defined in DataTranslation class re-invokes another finish function through the state member field (line 257). Also after invoking finish function, it invokes finishTranslation method of ThreadContext when Fault has been raised as a consequence of TLB processing.
WholeTranslationState class object contains translation related info
When we look at the initiateMemRead function again, WholeTranslationState instance is passed to the DataTranslation constructor as a state parameter.
Therefore, state->finish of the DataTranslation invokes WholeTranslationState::finish method. Note that WholeTranslationState contains actual request used for accessing page table entry from tlb.
gem5/src/cpu/translation.hh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
51 /**
52 * This class captures the state of an address translation. A translation
53 * can be split in two if the ISA supports it and the memory access crosses
54 * a page boundary. In this case, this class is shared by two data
55 * translations (below). Otherwise it is used by a single data translation
56 * class. When each part of the translation is finished, the finish
57 * function is called which will indicate whether the whole translation is
58 * completed or not. There are also functions for accessing parts of the
59 * translation state which deal with the possible split correctly.
60 */
61 class WholeTranslationState
62 {
63 protected:
64 int outstanding;
65 Fault faults[2];
66
67 public:
68 bool delay;
69 bool isSplit;
70 RequestPtr mainReq;
71 RequestPtr sreqLow;
72 RequestPtr sreqHigh;
73 uint8_t *data;
74 uint64_t *res;
75 BaseTLB::Mode mode;
76
77 /**
78 * Single translation state. We set the number of outstanding
79 * translations to one and indicate that it is not split.
80 */
81 WholeTranslationState(const RequestPtr &_req, uint8_t *_data,
82 uint64_t *_res, BaseTLB::Mode _mode)
83 : outstanding(1), delay(false), isSplit(false), mainReq(_req),
84 sreqLow(NULL), sreqHigh(NULL), data(_data), res(_res), mode(_mode)
85 {
86 faults[0] = faults[1] = NoFault;
87 assert(mode == BaseTLB::Read || mode == BaseTLB::Write);
88 }
89
90 /**
91 * Split translation state. We copy all state into this class, set the
92 * number of outstanding translations to two and then mark this as a
93 * split translation.
94 */
95 WholeTranslationState(const RequestPtr &_req, const RequestPtr &_sreqLow,
96 const RequestPtr &_sreqHigh, uint8_t *_data,
97 uint64_t *_res, BaseTLB::Mode _mode)
98 : outstanding(2), delay(false), isSplit(true), mainReq(_req),
99 sreqLow(_sreqLow), sreqHigh(_sreqHigh), data(_data), res(_res),
100 mode(_mode)
101 {
102 faults[0] = faults[1] = NoFault;
103 assert(mode == BaseTLB::Read || mode == BaseTLB::Write);
104 }
105
106 /**
107 * Finish part of a translation. If there is only one request then this
108 * translation is completed. If the request has been split in two then
109 * the outstanding count determines whether the translation is complete.
110 * In this case, flags from the split request are copied to the main
111 * request to make it easier to access them later on.
112 */
113 bool
114 finish(const Fault &fault, int index)
115 {
116 assert(outstanding);
117 faults[index] = fault;
118 outstanding--;
119 if (isSplit && outstanding == 0) {
120
121 // For ease later, we copy some state to the main request.
122 if (faults[0] == NoFault) {
123 mainReq->setPaddr(sreqLow->getPaddr());
124 }
125 mainReq->setFlags(sreqLow->getFlags());
126 mainReq->setFlags(sreqHigh->getFlags());
127 }
128 return outstanding == 0;
129 }
the finish function of WholeTranslationState stores generated fault on its internal buffer when meaningful fault has been raised in translation process (line 117). After the fault has been stored by WholeTranslationState’s finish function, remaining part of DataTranslation’s finish function invokes xc->finishTranslation(state) function. Note that finishTranslation function requires WholeTranslationStation instance as state argument.
To understand details, we have to look at what is the xc variable. Because DataTranslation class is declared as a template class, and xc is declared as template type, it will be an instance of template type class.
Now the time of processor, not the TLB
DataTranslation as an interface to interact with CPU
1
2
466 DataTranslation<TimingSimpleCPU *> *translation
467 = new DataTranslation<TimingSimpleCPU *>(this, state);
Because the translation variable
has been declared with *DataTranslation
What CPU has to do after the TLB finish its job
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
627 void
628 TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
629 {
630 _status = BaseSimpleCPU::Running;
631
632 if (state->getFault() != NoFault) {
633 if (state->isPrefetch()) {
634 state->setNoFault();
635 }
636 delete [] state->data;
637 state->deleteReqs();
638 translationFault(state->getFault());
639 } else {
640 if (!state->isSplit) {
641 sendData(state->mainReq, state->data, state->res,
642 state->mode == BaseTLB::Read);
643 } else {
644 sendSplitData(state->sreqLow, state->sreqHigh, state->mainReq,
645 state->data, state->mode == BaseTLB::Read);
646 }
647 }
648
649 delete state;
650 }
When there exists translation fault, it ends up ivnoking translationFault function of the CPU with a previously stored fault(line 638). Note that state->getFault method returns the fault previously stored by WholeTranslationState’s finish. When a translation has happended because of prefetch instruction, it suppress generated fault because it is not critical for execution.
However, when no fault has been encountered during the translation, it invokes sendData function. We will cover this later.
Let CPU handle the TLB fault
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
361 void
362 TimingSimpleCPU::translationFault(const Fault &fault)
363 {
364 // fault may be NoFault in cases where a fault is suppressed,
365 // for instance prefetches.
366 updateCycleCounts();
367 updateCycleCounters(BaseCPU::CPU_STATE_ON);
368
369 if (traceData) {
370 // Since there was a fault, we shouldn't trace this instruction.
371 delete traceData;
372 traceData = NULL;
373 }
374
375 postExecute();
376
377 advanceInst(fault);
378 }
The translationFault function invokes postExecute and advanceInst function. By looking at the function argument, we can infer that advanceInst function actually deal with the fault. The postExecute function doesn’t invoke any meaningful function to proceed pipeline, but it updates stat of the processor such as power model, load instruction counter, etc. Therefore, let’s jump into the advanceInst function.
advanceInst to process generated translation fault
gem5/src/cpu/simple/timing.cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
734 void
735 TimingSimpleCPU::advanceInst(const Fault &fault)
736 {
737 SimpleExecContext &t_info = *threadInfo[curThread];
738
739 if (_status == Faulting)
740 return;
741
742 if (fault != NoFault) {
743 DPRINTF(SimpleCPU, "Fault occured. Handling the fault\n");
744
745 advancePC(fault);
746
747 // A syscall fault could suspend this CPU (e.g., futex_wait)
748 // If the _status is not Idle, schedule an event to fetch the next
749 // instruction after 'stall' ticks.
750 // If the cpu has been suspended (i.e., _status == Idle), another
751 // cpu will wake this cpu up later.
752 if (_status != Idle) {
753 DPRINTF(SimpleCPU, "Scheduling fetch event after the Fault\n");
754
755 Tick stall = dynamic_pointer_cast<SyscallRetryFault>(fault) ?
756 clockEdge(syscallRetryLatency) : clockEdge();
757 reschedule(fetchEvent, stall, true);
758 _status = Faulting;
759 }
760
761 return;
762 }
763
764 if (!t_info.stayAtPC)
765 advancePC(fault);
766
767 if (tryCompleteDrain())
768 return;
769
770 if (_status == BaseSimpleCPU::Running) {
771 // kick off fetch of next instruction... callback from icache
772 // response will cause that instruction to be executed,
773 // keeping the CPU running.
774 fetch();
775 }
776 }
When there is a pending translation fault, it delegates fault exception to the advancePC function which actually controls the PC register of the CPU. TimingSimpleCPU inherits this function from BaseSimpleCPU, we will look at the BaseSimpleCPU class.
gem5/src/cpu/simple/base.cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
661 void
662 BaseSimpleCPU::advancePC(const Fault &fault)
663 {
664 SimpleExecContext &t_info = *threadInfo[curThread];
665 SimpleThread* thread = t_info.thread;
666
667 const bool branching(thread->pcState().branching());
668
669 //Since we're moving to a new pc, zero out the offset
670 t_info.fetchOffset = 0;
671 if (fault != NoFault) {
672 curMacroStaticInst = StaticInst::nullStaticInstPtr;
673 fault->invoke(threadContexts[curThread], curStaticInst);
674 thread->decoder.reset();
675 } else {
676 if (curStaticInst) {
677 if (curStaticInst->isLastMicroop())
678 curMacroStaticInst = StaticInst::nullStaticInstPtr;
679 TheISA::PCState pcState = thread->pcState();
680 TheISA::advancePC(pcState, curStaticInst);
681 thread->pcState(pcState);
682 }
683 }
684
685 if (branchPred && curStaticInst && curStaticInst->isControl()) {
686 // Use a fake sequence number since we only have one
687 // instruction in flight at the same time.
688 const InstSeqNum cur_sn(0);
689
690 if (t_info.predPC == thread->pcState()) {
691 // Correctly predicted branch
692 branchPred->update(cur_sn, curThread);
693 } else {
694 // Mis-predicted branch
695 branchPred->squash(cur_sn, thread->pcState(), branching, curThread);
696 ++t_info.numBranchMispred;
697 }
698 }
699 }
In general, the advancePC function updates current CPU context. However, depending on the current CPU state, whether the fault has been raised or not, it chooses different options to handle the generated fault and redirect the PC to move on. The invoke function called through the fault object handles the generated fault usually with the help of pre-defined ROM code. Also, it resets decoder and make curMacroStaticInst as Null. This is because we have to move on to the new PC after handling fault.
On the other hand, as usually taken path, when the fault has not been raised during the current instruction execution, it updates micropc of the processor to the next instruction (line 676-682).
pre-defined ROM code handles generated fault!
Then let’s take a look at how the fault can be handled by the invoke function implemented in the fault class.
gem5/srch/arch/x86/faults.cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
53 namespace X86ISA
54 {
55 void X86FaultBase::invoke(ThreadContext * tc, const StaticInstPtr &inst)
56 {
57 if (!FullSystem) {
58 FaultBase::invoke(tc, inst);
59 return;
60 }
61
62 PCState pcState = tc->pcState();
63 Addr pc = pcState.pc();
64 DPRINTF(Faults, "RIP %#x: vector %d: %s\n",
65 pc, vector, describe());
66 using namespace X86ISAInst::RomLabels;
67 HandyM5Reg m5reg = tc->readMiscRegNoEffect(MISCREG_M5_REG);
68 MicroPC entry;
69 if (m5reg.mode == LongMode) {
70 if (isSoft()) {
71 entry = extern_label_longModeSoftInterrupt;
72 } else {
73 entry = extern_label_longModeInterrupt;
74 }
75 } else {
76 entry = extern_label_legacyModeInterrupt;
77 }
78 tc->setIntReg(INTREG_MICRO(1), vector);
79 tc->setIntReg(INTREG_MICRO(7), pc);
80 if (errorCode != (uint64_t)(-1)) {
81 if (m5reg.mode == LongMode) {
82 entry = extern_label_longModeInterruptWithError;
83 } else {
84 panic("Legacy mode interrupts with error codes "
85 "aren't implementde.\n");
86 }
87 // Software interrupts shouldn't have error codes. If one
88 // does, there would need to be microcode to set it up.
89 assert(!isSoft());
90 tc->setIntReg(INTREG_MICRO(15), errorCode);
91 }
92 pcState.upc(romMicroPC(entry));
93 pcState.nupc(romMicroPC(entry) + 1);
94 tc->pcState(pcState);
95 }
To look at the behavior of the invoke function of the fault, we have to look at the fault related classes first. GEM5 provides base interface for every faults defined in the x86 architecture. x86 provides different types of events that can intervene the execution flow, which are fault, abort, trap, interrupts. All those events inherit from the base x86 fault class X86FaultBase which provides general interfaces and semantics of the x86 fault events.
However, depending on type of events, different classes inheriting the X86FaultBase can override invoke function to define their own semantics of fault events. For example, PageFault class inherits X86FaultBase class and overrides invoke function to add its own pagefault related semantics before invoking the parent’s invoke function provided by the X86FaultBase class.
Invoke change current RIP to pre-defined microops
Basically, invoke function makes the processor jump to the pre-defined microcode function that implements actual semantics of x86 fault handling. When the fault or interrupt is reported to the processor, first of all, it should stores current context of the processor. And then, it transfers a control flow of the processor to the designated fault handler represented by the IDTR register in x86.
To jump to the pre-defined ROM code from the invoke function, it makes use of ROM labels that statically stores sequence of x86 microops. All the available ROM labels are defined in the RomLabels namespace as show in the below.
gem5/build/X86/arch/x86/generated/decoder-ns.hh.inc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
4587 namespace RomLabels {
4588 const static uint64_t label_longModeSoftInterrupt_stackSwitched = 92;
4589 const static uint64_t label_longModeInterrupt_processDescriptor = 11;
4590 const static uint64_t label_longModeInterruptWithError_cplStackSwitch = 152;
4591 const static uint64_t label_longModeInterrupt_istStackSwitch = 28;
4592 const static uint64_t label_jmpFarWork = 192;
4593 const static uint64_t label_farJmpSystemDescriptor = 207;
4594 const static uint64_t label_longModeSoftInterrupt_globalDescriptor = 71;
4595 const static uint64_t label_farJmpGlobalDescriptor = 199;
4596 const static uint64_t label_initIntHalt = 186;
4597 const static uint64_t label_longModeInterruptWithError_istStackSwitch = 150;
4598 const static uint64_t label_legacyModeInterrupt = 184;
4599 const static uint64_t label_longModeInterruptWithError_globalDescriptor = 132;
4600 const static uint64_t label_longModeSoftInterrupt_processDescriptor = 72;
4601 const static uint64_t label_longModeInterruptWithError = 122;
4602 const static uint64_t label_farJmpProcessDescriptor = 200;
4603 const static uint64_t label_longModeSoftInterrupt = 61;
4604 const static uint64_t label_longModeSoftInterrupt_istStackSwitch = 89;
4605 const static uint64_t label_longModeInterrupt_globalDescriptor = 10;
4606 const static uint64_t label_longModeInterrupt_cplStackSwitch = 30;
4607 const static uint64_t label_longModeInterrupt = 0;
4608 const static uint64_t label_longModeInterruptWithError_processDescriptor = 133;
4609 const static uint64_t label_longModeInterruptWithError_stackSwitched = 153;
4610 const static uint64_t label_longModeInterrupt_stackSwitched = 31;
4611 const static uint64_t label_longModeSoftInterrupt_cplStackSwitch = 91;
4612 const static MicroPC extern_label_initIntHalt = 186;
4613 const static MicroPC extern_label_longModeInterruptWithError = 122;
4614 const static MicroPC extern_label_longModeInterrupt = 0;
4615 const static MicroPC extern_label_longModeSoftInterrupt = 61;
4616 const static MicroPC extern_label_legacyModeInterrupt = 184;
4617 const static MicroPC extern_label_jmpFarWork = 192;
4618 }
PageFault handling ROM code
Although we are looking at translation fault, note that is can be described as PageFault in x86.
gem5/src/arch/x86/faults.cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
137 void PageFault::invoke(ThreadContext * tc, const StaticInstPtr &inst)
138 {
139 if (FullSystem) {
140 /* Invalidate any matching TLB entries before handling the page fault */
141 tc->getITBPtr()->demapPage(addr, 0);
142 tc->getDTBPtr()->demapPage(addr, 0);
143 HandyM5Reg m5reg = tc->readMiscRegNoEffect(MISCREG_M5_REG);
144 X86FaultBase::invoke(tc);
145 /*
146 * If something bad happens while trying to enter the page fault
147 * handler, I'm pretty sure that's a double fault and then all
148 * bets are off. That means it should be safe to update this
149 * state now.
150 */
151 if (m5reg.mode == LongMode) {
152 tc->setMiscReg(MISCREG_CR2, addr);
153 } else {
154 tc->setMiscReg(MISCREG_CR2, (uint32_t)addr);
155 }
156 } else {
157 PageFaultErrorCode code = errorCode;
158 const char *modeStr = "";
159 if (code.fetch)
160 modeStr = "execute";
161 else if (code.write)
162 modeStr = "write";
163 else
164 modeStr = "read";
165
166 // print information about what we are panic'ing on
167 if (!inst) {
168 panic("Tried to %s unmapped address %#x.\n", modeStr, addr);
169 } else {
170 panic("Tried to %s unmapped address %#x.\nPC: %#x, Instr: %s",
171 modeStr, addr, tc->pcState().pc(),
172 inst->disassemble(tc->pcState().pc(), debugSymbolTable));
173 }
174 }
175 }
Because most of the fault handling logic of the PageFault class overlaps with X86FaultBase, after handling TLB related issues, it just calls invoke function of X86FaultBase class. Because translation fault mainly happens in longmode, and generated fault is not software interrupt, we will take a look at the ROM label named label_longModeInterrupt.
Pass arguments to the ROM code
Also, before jumping to the ROM label, it sets micro architectural registers to pass interrupt number and PC address to the ROM code. Additionaly, when the interrupt makes use of error code, it should also be passed to the microcode
To pass the arguments to the microcode world, it invokes setIntReg functions defined in the threadcontext. Threadcontext is instance of SimpleThread class defined in cpu/simple_thread.hh (When you use the o3 out-of-order cpu model, you have to look at O3ThreadContext class). Regardless of your processor model, both classes inherit ThreadContext class which provide generic register context and interface for manipulating the registers.
gem5/src/cpu/simple_thread.hh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
98 class SimpleThread : public ThreadState, public ThreadContext
99 {
100 protected:
101 typedef TheISA::MachInst MachInst;
102 using VecRegContainer = TheISA::VecRegContainer;
103 using VecElem = TheISA::VecElem;
104 using VecPredRegContainer = TheISA::VecPredRegContainer;
105 public:
106 typedef ThreadContext::Status Status;
107
108 protected:
109 std::array<RegVal, TheISA::NumFloatRegs> floatRegs;
110 std::array<RegVal, TheISA::NumIntRegs> intRegs;
111 std::array<VecRegContainer, TheISA::NumVecRegs> vecRegs;
112 std::array<VecPredRegContainer, TheISA::NumVecPredRegs> vecPredRegs;
113 std::array<RegVal, TheISA::NumCCRegs> ccRegs;
114 TheISA::ISA *const isa; // one "instance" of the current ISA.
115
116 TheISA::PCState _pcState;
477 void
478 setIntReg(RegIndex reg_idx, RegVal val) override
479 {
480 int flatIndex = isa->flattenIntIndex(reg_idx);
481 assert(flatIndex < TheISA::NumIntRegs);
482 DPRINTF(IntRegs, "Setting int reg %d (%d) to %#x.\n",
483 reg_idx, flatIndex, val);
484 setIntRegFlat(flatIndex, val);
485 }
Detour to TheISA namespace
Although SimpleThread class can be seen as providing generic registers regardless of architectures, it declares ISA dependent registers. The magic is TheISA symbol. TheISA symbol will be translated to architecture specific namespace depending on the architecture that the Gem5 has been compiled to. Let’s little bit detour and figure out how TheISA namespace works.
When you don’t know what is the TheISA namesapce, you may want to grep “namespace TheISA” to find out files that define TheISA namespace. However, unfortunately, you can only find very few places where the TheISA namespace has been declared with a handful of member functions. Then where those functions and variables of the TheISA namespace come from? To understand the TheISA:: namespace, we should look at the build files not the source file.
build/X86/config/the_isa.hh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1 #ifndef __CONFIG_THE_ISA_HH__
2 #define __CONFIG_THE_ISA_HH__
3
4 #define ALPHA_ISA 1
5 #define ARM_ISA 2
6 #define MIPS_ISA 3
7 #define NULL_ISA 4
8 #define POWER_ISA 5
9 #define RISCV_ISA 6
10 #define SPARC_ISA 7
11 #define X86_ISA 8
12
13 enum class Arch {
14 AlphaISA = ALPHA_ISA,
15 ArmISA = ARM_ISA,
16 MipsISA = MIPS_ISA,
17 NullISA = NULL_ISA,
18 PowerISA = POWER_ISA,
19 RiscvISA = RISCV_ISA,
20 SparcISA = SPARC_ISA,
21 X86ISA = X86_ISA
22 };
23
24 #define THE_ISA X86_ISA
25 #define TheISA X86ISA
26 #define THE_ISA_STR "x86"
27
28 #endif // __CONFIG_THE_ISA_HH__
Here, we can easily find that TheISA is defined as X86ISA. Also when we look at the SConScript file, we can find python function names makeTheISA that actually fills out content of config/the_isa.hh file. Here, because I compiled GEM5 with the X86 configuration, it defines the TheISA as X86ISA.
Therefore, when the TheISA has been used on the cpu related files, it is not a actual namespace called “TheISA”, but the architecture dependent ISA namespace. Consequently, when you encounter namespace TheISA, first check whether the config/the_isa.hh header file has been included in your target source file; and when the answer is yes, you have to look at the architecture dependent namespace defined in the gem5/src/arch/YOUR_ARCHITECTURE directory. In my case, because I use the X86 it should be X86ISA namespace.
SetIntReg with TheISA
Now let’s go back to SimpleThread class. In addition to the architecture specific register context, it provides setIntReg function. It allows the processor to store the data on intRegs array located by the index.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
477 void
478 setIntReg(RegIndex reg_idx, RegVal val) override
479 {
480 int flatIndex = isa->flattenIntIndex(reg_idx);
481 assert(flatIndex < TheISA::NumIntRegs);
482 DPRINTF(IntRegs, "Setting int reg %d (%d) to %#x.\n",
483 reg_idx, flatIndex, val);
484 setIntRegFlat(flatIndex, val);
485 }
618 void
619 setIntRegFlat(RegIndex idx, RegVal val) override
620 {
621 intRegs[idx] = val;
622 }
Note that the val is stored in the intRegs array through the unified interface setIntReg function. The IntRegs contains not only the architecture registers such as rsi,rdi,rcx in x86, but also the integer type micro-registers used only by the microops.
Because x86 in GEM5 defines 16 Integer registers available to the microops, (look at gem5/src/arch/x86/x86_traits.hh) it can pass up to 16 Integer value to the microcode through the setIntReg function. As shown in the invoke function, micro register 1,7, and 15 has been used to pass the fault related arguments to the microops.
Jump to the ROM code!
After finishing setting the required parameters now, it jumps to the stored ROM code pointed to by the label. This control flow transition is done by updating _pcState memeber field of the SimpleThread class object.
gem5/srch/arch/x86/faults.cc
1
2
3
4
92 pcState.upc(romMicroPC(entry));
93 pcState.nupc(romMicroPC(entry) + 1);
94 tc->pcState(pcState);
95 }
When we look at the above code in the invoke function of X86FaultBase class, we can find that it updates upc field of the pcState to location of the ROM code.
gem5/src/base/types.hh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
144 typedef uint16_t MicroPC;
145
146 static const MicroPC MicroPCRomBit = 1 << (sizeof(MicroPC) * 8 - 1);
147
148 static inline MicroPC
149 romMicroPC(MicroPC upc)
150 {
151 return upc | MicroPCRomBit;
152 }
153
154 static inline MicroPC
155 normalMicroPC(MicroPC upc)
156 {
157 return upc & ~MicroPCRomBit;
158 }
159
160 static inline bool
161 isRomMicroPC(MicroPC upc)
162 {
163 return MicroPCRomBit & upc;
164 }
Note that romMicroPC function sets flag to specify upc points to start address of ROM code. Here the flag is just bit-wise ored to the upc address.
arch/generic/types.hh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
193 // A PC and microcode PC.
194 template <class MachInst>
195 class UPCState : public SimplePCState<MachInst>
196 {
197 protected:
198 typedef SimplePCState<MachInst> Base;
199
200 MicroPC _upc;
201 MicroPC _nupc;
202
203 public:
204
205 MicroPC upc() const { return _upc; }
206 void upc(MicroPC val) { _upc = val; }
207
208 MicroPC nupc() const { return _nupc; }
209 void nupc(MicroPC val) { _nupc = val; }
After the upc address is generated, it needs to update the pcState variable to change the current upc address. You can also update the upc address of the current processor’s pcState variable, it is recommended to pass newly initialized pcState object to the processor context. Therefore, new pcState variable invokes upc function to update its upc address. After that, by invoking tc->pcState(pcState), it update member field _pcState of a threadContexts to a new pcState, which makes the processor run from the updated micro pc address when the next fetch happens.
However, note that this function just updates _pcState member field of the ThreadContex. Then who actually redirects the pipeline to fetch the new instructions from the ROM not from the faulting instruction? Let’s go back to the advancePC function that called the invoke function.
Let’s go back to advancePC & advanceInst
gem5/src/cpu/simple/base.cc
1
2
673 fault->invoke(threadContexts[curThread], curStaticInst);
674 thread->decoder.reset();
After the invoke function is called as part of the advancePC function, it resets the decoder, which updates decoder state as ResetState.
gem5/src/cpu/simple/timing.cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
730 void
731 TimingSimpleCPU::advanceInst(const Fault &fault)
732 {
733 SimpleExecContext &t_info = *threadInfo[curThread];
734
735 if (_status == Faulting)
736 return;
737
738 if (fault != NoFault) {
739 DPRINTF(SimpleCPU, "Fault occured. Handling the fault\n");
740
741 advancePC(fault);
742 if (fault != NoFault) {
743 DPRINTF(SimpleCPU, "Fault occured. Handling the fault\n");
744
745 advancePC(fault);
746
747 // A syscall fault could suspend this CPU (e.g., futex_wait)
748 // If the _status is not Idle, schedule an event to fetch the next
749 // instruction after 'stall' ticks.
750 // If the cpu has been suspended (i.e., _status == Idle), another
751 // cpu will wake this cpu up later.
752 if (_status != Idle) {
753 DPRINTF(SimpleCPU, "Scheduling fetch event after the Fault\n");
754
755 Tick stall = dynamic_pointer_cast<SyscallRetryFault>(fault) ?
756 clockEdge(syscallRetryLatency) : clockEdge();
757 reschedule(fetchEvent, stall, true);
758 _status = Faulting;
759 }
760
761 return;
762 }
After returning from the advancePC instruction, advanceInst function checks status of the current processor. When the processor is not in idle state, it reschedules fetchEvent to be executed again after stall ticks. Also note that status of the processor has been changed to Faulting.
fetchEvent invokes fetch() function
By the way what is the fetchEvent?
1
2
3
4
5
6
7
79 TimingSimpleCPU::TimingSimpleCPU(TimingSimpleCPUParams *p)
80 : BaseSimpleCPU(p), fetchTranslation(this), icachePort(this),
81 dcachePort(this), ifetch_pkt(NULL), dcache_pkt(NULL), previousCycle(0),
82 fetchEvent([this]{ fetch(); }, name())
83 {
84 _status = Idle;
85 }
Because fetchEvent is initialized to invoke fetch() function at TimingSimpleCPU constructor, after a stall time passed, it will invoke fetch function to fetch new instruction from the faulting address.
fetchEvent is defined as a EventFunctionWrapper type used for registering event in GEM5. Also, the fetchEvent is initiated by the constructor of the TimingSimpleCPU class to invoke fetch() function. Therefore, after the stall ticks passed, it invokes fetch() function defined in the TimingSimpleCPU class.
Now start to fetch from updatd RIP, the ROM code!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
653 void
654 TimingSimpleCPU::fetch()
655 {
656 // Change thread if multi-threaded
657 swapActiveThread();
658
659 SimpleExecContext &t_info = *threadInfo[curThread];
660 SimpleThread* thread = t_info.thread;
661
662 DPRINTF(SimpleCPU, "Fetch\n");
663
664 if (!curStaticInst || !curStaticInst->isDelayedCommit()) {
665 checkForInterrupts();
666 checkPcEventQueue();
667 }
668
669 // We must have just got suspended by a PC event
670 if (_status == Idle)
671 return;
672
673 TheISA::PCState pcState = thread->pcState();
674 bool needToFetch = !isRomMicroPC(pcState.microPC()) &&
675 !curMacroStaticInst;
676
677 if (needToFetch) {
678 _status = BaseSimpleCPU::Running;
679 RequestPtr ifetch_req = std::make_shared<Request>();
680 ifetch_req->taskId(taskId());
681 ifetch_req->setContext(thread->contextId());
682 setupFetchRequest(ifetch_req);
683 DPRINTF(SimpleCPU, "Translating address %#x\n", ifetch_req->getVaddr());
684 thread->itb->translateTiming(ifetch_req, thread->getTC(),
685 &fetchTranslation, BaseTLB::Execute);
686 } else {
687 _status = IcacheWaitResponse;
688 completeIfetch(NULL);
689
690 updateCycleCounts();
691 updateCycleCounters(BaseCPU::CPU_STATE_ON);
692 }
693 }
Remeber that curMacroStaticInst has been set to StaticInst::nullStaticInstPtr by advancePC. Also, upc has been updated to the ROM code address with MicroPCRomBit flag. Therefore, it sets needToFetch True and start to fetch new instructions from the ROM code.
Comments powered by Disqus.