Post

O3 Cpu Iew

IEW: Issue/Execute/Writeback

GEM5 handles both execute and writeback when the execute() function is called on an instruction. Therefore, GEM5 combines Issue, Execute, and Writeback stage into one stage called IEW. This stage (IEW) handles dispatching instructions to the instruction queue, telling the instruction queue to issue instruction, and executing and writing back instructions.

Nice description about the IEW stage provided by the GEM5 Documentation. Also, this documentation provide which functions are mainly designed to achieve those three operations.

1
2
3
4
5
Rename::tick()->Rename::RenameInsts()
IEW::tick()->IEW::dispatchInsts()
IEW::tick()->InstructionQueue::scheduleReadyInsts()
IEW::tick()->IEW::executeInsts()
IEW::tick()->IEW::writebackInsts()

In this posting, I will explain dispatch, schedule, execute, and write back in details. The commit stage will be studied in the other posting. The tick function of the iew stage is the main body of execution as other stages. Therefore, I will explain each part of the iew stage following the tick implementation. The dispatch function tries to dispatch renamed instructions to the LSQ/IQ (Note that already the rename stage checked availability of the LSQ and IQ) and actually issues instructions every cycle. The execute latency is actually tied to the issue latency to allow the IQ to be able to do back-to-back scheduling without having to speculatively schedule instructions. The IEW separates memory instructions from non-memory instructions. (issuing the instruction to different queues, LSQ or IQ) The writeback portion of IEW completes the instructions, wakes up any dependents, and marks the register as ready on the scoreboard. With those information, IQ can tell which instructions can be woke up and to be issued.

Dispatch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
1502 template<class Impl>
1503 void
1504 DefaultIEW<Impl>::tick()
1505 {
1506     wbNumInst = 0;
1507     wbCycle = 0;
1508 
1509     wroteToTimeBuffer = false;
1510     updatedQueues = false;
1511 
1512     ldstQueue.tick();
1513 
1514     sortInsts();
1515 
1516     // Free function units marked as being freed this cycle.
1517     fuPool->processFreeUnits();
1518 
1519     list<ThreadID>::iterator threads = activeThreads->begin();
1520     list<ThreadID>::iterator end = activeThreads->end();
1521 
1522     // Check stall and squash signals, dispatch any instructions.
1523     while (threads != end) {
1524         ThreadID tid = *threads++;
1525 
1526         DPRINTF(IEW,"Issue: Processing [tid:%i]\n",tid);
1527 
1528         checkSignalsAndUpdate(tid);
1529         dispatch(tid);
1530     }

As shown in the tick function, after checking signal such as block and squash, the first job done by the IEW is dispatching the renamed instructions. The main goal of the dispatch is inserting the renamed instruction into the IQ and LSQ based on the instruction’s type.

Dispatch implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
 911 template<class Impl>
 912 void
 913 DefaultIEW<Impl>::dispatch(ThreadID tid)
 914 {
 915     // If status is Running or idle,
 916     //     call dispatchInsts()
 917     // If status is Unblocking,
 918     //     buffer any instructions coming from rename
 919     //     continue trying to empty skid buffer
 920     //     check if stall conditions have passed
 921 
 922     if (dispatchStatus[tid] == Blocked) {
 923         ++iewBlockCycles;
 924 
 925     } else if (dispatchStatus[tid] == Squashing) {
 926         ++iewSquashCycles;
 927     }
 928 
 929     // Dispatch should try to dispatch as many instructions as its bandwidth
 930     // will allow, as long as it is not currently blocked.
 931     if (dispatchStatus[tid] == Running ||
 932         dispatchStatus[tid] == Idle) {
 933         DPRINTF(IEW, "[tid:%i] Not blocked, so attempting to run "
 934                 "dispatch.\n", tid);
 935 
 936         dispatchInsts(tid);
 937     } else if (dispatchStatus[tid] == Unblocking) {
 938         // Make sure that the skid buffer has something in it if the
 939         // status is unblocking.
 940         assert(!skidsEmpty());
 941 
 942         // If the status was unblocking, then instructions from the skid
 943         // buffer were used.  Remove those instructions and handle
 944         // the rest of unblocking.
 945         dispatchInsts(tid);
 946 
 947         ++iewUnblockCycles;
 948 
 949         if (validInstsFromRename()) {
 950             // Add the current inputs to the skid buffer so they can be
 951             // reprocessed when this stage unblocks.
 952             skidInsert(tid);
 953         }
 954 
 955         unblock(tid);
 956     }
 957 }

The dispatch function is just a wrapper function of the dispatchInsts. Based on the current status of the dispatch stage, associated operations should be executed in addition to the main dispatch function, dispatchInsts. Because the dispatchInsts is fairly complex, I will explain one by one.

Checking availability of resources to dispatch instruction

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
959 template <class Impl>
 960 void
 961 DefaultIEW<Impl>::dispatchInsts(ThreadID tid)
 962 {
 963     // Obtain instructions from skid buffer if unblocking, or queue from rename
 964     // otherwise.
 965     std::queue<DynInstPtr> &insts_to_dispatch =
 966         dispatchStatus[tid] == Unblocking ?
 967         skidBuffer[tid] : insts[tid];
 968 
 969     int insts_to_add = insts_to_dispatch.size();
 970 
 971     DynInstPtr inst;
 972     bool add_to_iq = false;
 973     int dis_num_inst = 0;
 974 
 975     // Loop through the instructions, putting them in the instruction
 976     // queue.
 977     for ( ; dis_num_inst < insts_to_add &&
 978               dis_num_inst < dispatchWidth;
 979           ++dis_num_inst)
 980     {
 981         inst = insts_to_dispatch.front();
 982 
 983         if (dispatchStatus[tid] == Unblocking) {
 984             DPRINTF(IEW, "[tid:%i] Issue: Examining instruction from skid "
 985                     "buffer\n", tid);
 986         }
 987 
 988         // Make sure there's a valid instruction there.
 989         assert(inst);
 990 
 991         DPRINTF(IEW, "[tid:%i] Issue: Adding PC %s [sn:%lli] [tid:%i] to "
 992                 "IQ.\n",
 993                 tid, inst->pcState(), inst->seqNum, inst->threadNumber);
 994 
 995         // Be sure to mark these instructions as ready so that the
 996         // commit stage can go ahead and execute them, and mark
 997         // them as issued so the IQ doesn't reprocess them.
 998 
 999         // Check for squashed instructions.
1000         if (inst->isSquashed()) {
1001             DPRINTF(IEW, "[tid:%i] Issue: Squashed instruction encountered, "
1002                     "not adding to IQ.\n", tid);
1003 
1004             ++iewDispSquashedInsts;
1005 
1006             insts_to_dispatch.pop();
1007 
1008             //Tell Rename That An Instruction has been processed
1009             if (inst->isLoad()) {
1010                 toRename->iewInfo[tid].dispatchedToLQ++;
1011             }
1012             if (inst->isStore() || inst->isAtomic()) {
1013                 toRename->iewInfo[tid].dispatchedToSQ++;
1014             }
1015 
1016             toRename->iewInfo[tid].dispatched++;
1017    
1018             continue;
1019         }
1020  
1021         // Check for full conditions.
1022         if (instQueue.isFull(tid)) {
1023             DPRINTF(IEW, "[tid:%i] Issue: IQ has become full.\n", tid);
1024    
1025             // Call function to start blocking.
1026             block(tid);
1027    
1028             // Set unblock to false. Special case where we are using
1029             // skidbuffer (unblocking) instructions but then we still
1030             // get full in the IQ.
1031             toRename->iewUnblock[tid] = false;
1032    
1033             ++iewIQFullEvents;
1034             break;
1035         }
1036    
1037         // Check LSQ if inst is LD/ST
1038         if ((inst->isAtomic() && ldstQueue.sqFull(tid)) ||
1039             (inst->isLoad() && ldstQueue.lqFull(tid)) ||
1040             (inst->isStore() && ldstQueue.sqFull(tid))) {
1041             DPRINTF(IEW, "[tid:%i] Issue: %s has become full.\n",tid,
1042                     inst->isLoad() ? "LQ" : "SQ");
1043    
1044             // Call function to start blocking.
1045             block(tid);
1046    
1047             // Set unblock to false. Special case where we are using
1048             // skidbuffer (unblocking) instructions but then we still
1049             // get full in the IQ.
1050             toRename->iewUnblock[tid] = false;
1051 
1052             ++iewLSQFullEvents;
1053             break;
1054         }

First, it checks whether the current instruction has been already squashed. If yes, then ignore the current instruction and jump to the next ones. If the instructions is not squashed, it checks the availability of resource required for issuing the instruction. Regardless of the instruction type, it requires one entry from the instruction queue. Also, if it is the memory related instruction, it require one entry from the load queue or store queue based on whether it is load or store instruction.

Checking instruction type

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
1056         // Otherwise issue the instruction just fine.
1057         if (inst->isAtomic()) {
1058             DPRINTF(IEW, "[tid:%i] Issue: Memory instruction "
1059                     "encountered, adding to LSQ.\n", tid);
1060 
1061             ldstQueue.insertStore(inst);
1062 
1063             ++iewDispStoreInsts;
1064 
1065             // AMOs need to be set as "canCommit()"
1066             // so that commit can process them when they reach the
1067             // head of commit.
1068             inst->setCanCommit();
1069             instQueue.insertNonSpec(inst);
1070             add_to_iq = false;
1071 
1072             ++iewDispNonSpecInsts;
1073 
1074             toRename->iewInfo[tid].dispatchedToSQ++;
1075         } else if (inst->isLoad()) {
1076             DPRINTF(IEW, "[tid:%i] Issue: Memory instruction "
1077                     "encountered, adding to LSQ.\n", tid);
1078 
1079             // Reserve a spot in the load store queue for this
1080             // memory access.
1081             ldstQueue.insertLoad(inst);
1082 
1083             ++iewDispLoadInsts;
1084 
1085             add_to_iq = true;
1086 
1087             toRename->iewInfo[tid].dispatchedToLQ++;
1088         } else if (inst->isStore()) {
1089             DPRINTF(IEW, "[tid:%i] Issue: Memory instruction "
1090                     "encountered, adding to LSQ.\n", tid);
1091 
1092             ldstQueue.insertStore(inst);
1093 
1094             ++iewDispStoreInsts;
1095 
1096             if (inst->isStoreConditional()) {
1097                 // Store conditionals need to be set as "canCommit()"
1098                 // so that commit can process them when they reach the
1099                 // head of commit.
1100                 // @todo: This is somewhat specific to Alpha.
1101                 inst->setCanCommit();
1102                 instQueue.insertNonSpec(inst);
1103                 add_to_iq = false;
1104 
1105                 ++iewDispNonSpecInsts;
1106             } else {
1107                 add_to_iq = true;
1108             }
1109 
1110             toRename->iewInfo[tid].dispatchedToSQ++;
1111         } else if (inst->isMemBarrier() || inst->isWriteBarrier()) {
1112             // Same as non-speculative stores.
1113             inst->setCanCommit();
1114             instQueue.insertBarrier(inst);
1115             add_to_iq = false;
1116         } else if (inst->isNop()) {
1117             DPRINTF(IEW, "[tid:%i] Issue: Nop instruction encountered, "
1118                     "skipping.\n", tid);
1119 
1120             inst->setIssued();
1121             inst->setExecuted();
1122             inst->setCanCommit();
1123 
1124             instQueue.recordProducer(inst);
1125 
1126             iewExecutedNop[tid]++;
1127 
1128             add_to_iq = false;
1129         } else {
1130             assert(!inst->isExecuted());
1131             add_to_iq = true;
1132         }

Although it is not clear until we understand the internal of the instQueue and ldstQueue, but the above code pushes the instructions based on the instruction type. For example, for the load operation, it pushes the instruction to the ldstQueue with insertLoad function. For the write operation, it is inserted to the same queue through the insertStore function. For the normal instructions they will be just enqueued to the instQueue.

Issuing instruction

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
1134         if (add_to_iq && inst->isNonSpeculative()) {
1135             DPRINTF(IEW, "[tid:%i] Issue: Nonspeculative instruction "
1136                     "encountered, skipping.\n", tid);
1137 
1138             // Same as non-speculative stores.
1139             inst->setCanCommit();
1140 
1141             // Specifically insert it as nonspeculative.
1142             instQueue.insertNonSpec(inst);
1143 
1144             ++iewDispNonSpecInsts;
1145 
1146             add_to_iq = false;
1147         }
1148 
1149         // If the instruction queue is not full, then add the
1150         // instruction.
1151         if (add_to_iq) {
1152             instQueue.insert(inst);
1153         }
1154 
1155         insts_to_dispatch.pop();
1156 
1157         toRename->iewInfo[tid].dispatched++;
1158 
1159         ++iewDispatchedInsts;
1160 
1161 #if TRACING_ON
1162         inst->dispatchTick = curTick() - inst->fetchTick;
1163 #endif
1164         ppDispatch->notify(inst);
1165     }

After each instructions are handled by inserting them to the corresponding queues with the associated method provided by the queues, some of them should also be inserted to the instruction queue. Note that add_to_iq flag is set based on the instruction type, When this flag is set, the instruction should be added to the instQueue (line 1151-1153).

End of the dispatching

1
2
3
4
5
6
7
8
9
10
11
12
13
14
1167     if (!insts_to_dispatch.empty()) {
1168         DPRINTF(IEW,"[tid:%i] Issue: Bandwidth Full. Blocking.\n", tid);
1169         block(tid);
1170         toRename->iewUnblock[tid] = false;
1171     }
1172 
1173     if (dispatchStatus[tid] == Idle && dis_num_inst) {
1174         dispatchStatus[tid] = Running;
1175 
1176         updatedQueues = true;
1177     }
1178 
1179     dis_num_inst = 0;
1180 }

After dispatching all renamed instructions, it should check whether it still has some instructions in the queue. When the instruction cannot be processed further because of throttling, it should block and handle rest of the instructions at the next cycle.

Instruction Queue and Load/Store queue

Before moving on to the next stage, I’d like to cover some part of the IQ and LSQ.

Instruction queue has several lists to keep issued instructions

Mainly the job of the queue is managing instructions and providing some interfaces to process the enqueued instructions.

gem5/src/cpu/o3/inst_queue.hh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
311     //////////////////////////////////////
312     // Instruction lists, ready queues, and ordering
313     //////////////////////////////////////
314 
315     /** List of all the instructions in the IQ (some of which may be issued). */
316     std::list<DynInstPtr> instList[Impl::MaxThreads];
317 
318     /** List of instructions that are ready to be executed. */
319     std::list<DynInstPtr> instsToExecute;
320 
321     /** List of instructions waiting for their DTB translation to
322      *  complete (hw page table walk in progress).
323      */
324     std::list<DynInstPtr> deferredMemInsts;
325 
326     /** List of instructions that have been cache blocked. */
327     std::list<DynInstPtr> blockedMemInsts;
328 
329     /** List of instructions that were cache blocked, but a retry has been seen
330      * since, so they can now be retried. May fail again go on the blocked list.
331      */
332     std::list<DynInstPtr> retryMemInsts;

Insert new entries to the instruction queue

The insert function is the essential example of the interface. It inserts new entries to the instruction list managed by the instruction queue.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
 578 template <class Impl>
 579 void
 580 InstructionQueue<Impl>::insert(const DynInstPtr &new_inst)
 581 {
 582     if (new_inst->isFloating()) {
 583         fpInstQueueWrites++;
 584     } else if (new_inst->isVector()) {
 585         vecInstQueueWrites++;
 586     } else {
 587         intInstQueueWrites++;
 588     }
 589     // Make sure the instruction is valid
 590     assert(new_inst);
 591 
 592     DPRINTF(IQ, "Adding instruction [sn:%llu] PC %s to the IQ.\n",
 593             new_inst->seqNum, new_inst->pcState());
 594 
 595     assert(freeEntries != 0);
 596 
 597     instList[new_inst->threadNumber].push_back(new_inst);
 598 
 599     --freeEntries;
 600 
 601     new_inst->setInIQ();
 602 
 603     // Look through its source registers (physical regs), and mark any
 604     // dependencies.
 605     addToDependents(new_inst);
 606 
 607     // Have this instruction set itself as the producer of its destination
 608     // register(s).
 609     addToProducers(new_inst);
 610 
 611     if (new_inst->isMemRef()) {
 612         memDepUnit[new_inst->threadNumber].insert(new_inst);
 613     } else {
 614         addIfReady(new_inst);
 615     }
 616 
 617     ++iqInstsAdded;
 618 
 619     count[new_inst->threadNumber]++;
 620 
 621     assert(freeEntries == (numEntries - countInsts()));
 622 }

Inserting the instruction to the list is done by simple push_back operation of the list. However, it invokes two important functions: addToProducers and addToDependents. These two functions generates producer and consumer dependency among instructions’s operands, registers. When one instruction waits until the specific register’s value become ready (consumer), it should be tracked by some hardware component. Also, when the data becomes ready as a result of execution of one instruction (producer), it should be forwarded to the consumers waiting for the value. For that purpose, GEM5 utilize the DependencyGraph. After producing dependency for the unavailable registers, if the instruction references memory while its execution, it should be specially handled by the memory dependency unit. The details will be explained together with the DependencyGraph later.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
1450 template <class Impl>
1451 void
1452 InstructionQueue<Impl>::addIfReady(const DynInstPtr &inst)
1453 {
1454     // If the instruction now has all of its source registers
1455     // available, then add it to the list of ready instructions.
1456     if (inst->readyToIssue()) {
1457 
1458         //Add the instruction to the proper ready list.
1459         if (inst->isMemRef()) {
1460 
1461             DPRINTF(IQ, "Checking if memory instruction can issue.\n");
1462 
1463             // Message to the mem dependence unit that this instruction has
1464             // its registers ready.
1465             memDepUnit[inst->threadNumber].regsReady(inst);
1466 
1467             return;
1468         }
1469 
1470         OpClass op_class = inst->opClass();
1471 
1472         DPRINTF(IQ, "Instruction is ready to issue, putting it onto "
1473                 "the ready list, PC %s opclass:%i [sn:%llu].\n",
1474                 inst->pcState(), op_class, inst->seqNum);
1475 
1476         readyInsts[op_class].push(inst);
1477 
1478         // Will need to reorder the list if either a queue is not on the list,
1479         // or it has an older instruction than last time.
1480         if (!queueOnList[op_class]) {
1481             addToOrderList(op_class);
1482         } else if (readyInsts[op_class].top()->seqNum  <
1483                    (*readyIt[op_class]).oldestInst) {
1484             listOrder.erase(readyIt[op_class]);
1485             addToOrderList(op_class);
1486         }
1487     }
1488 }

At the end of the insert function, it adds instruction to the readyInsts buffer if all the registers are available (line 1476). If the instruction is not ready, which means the source registers are not available, the instruction should not be inqueued to the readyInsts buffer. The instructions waiting for the source register to become available will be added to the readyInsts buffer when other dependent instructions complete.

Execute

To understand what should be done after dispatching the instructions, let’s go back to the tick function of the iew stage.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
1532     if (exeStatus != Squashing) {
1533         executeInsts();
1534 
1535         writebackInsts();
1536 
1537         // Have the instruction queue try to schedule any ready instructions.
1538         // (In actuality, this scheduling is for instructions that will
1539         // be executed next cycle.)
1540         instQueue.scheduleReadyInsts();
1541 
1542         // Also should advance its own time buffers if the stage ran.
1543         // Not the best place for it, but this works (hopefully).
1544         issueToExecQueue.advance();
1545     }

If the execution stage is not in the squashing state, it will execute instructions stored in the instQueue, particularly readyInsts queue. Here execute() function of the compute instruction is invoked and sent to commit. Please note execute() will write results to the destination registers. Therefore, after executeInsts is invoked, writebackInsts is called to write the result to destination registers. Furthermore, when there are dependent instructions to the currently executed one, those instructions will be added to the ready list for scheduling.

executeInsts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
1205 template <class Impl>
1206 void
1207 DefaultIEW<Impl>::executeInsts()
1208 {
1209     wbNumInst = 0;
1210     wbCycle = 0;
1211 
1212     list<ThreadID>::iterator threads = activeThreads->begin();
1213     list<ThreadID>::iterator end = activeThreads->end();
1214 
1215     while (threads != end) {
1216         ThreadID tid = *threads++;
1217         fetchRedirect[tid] = false;
1218     }
1219 
1220     // Uncomment this if you want to see all available instructions.
1221     // @todo This doesn't actually work anymore, we should fix it.
1222     // printAvailableInsts();
1223 
1224     // Execute/writeback any instructions that are available.
1225     int insts_to_execute = fromIssue->size;
1226     int inst_num = 0;
1227     for (; inst_num < insts_to_execute;
1228           ++inst_num) {
1229 
1230         DPRINTF(IEW, "Execute: Executing instructions from IQ.\n");
1231 
1232         DynInstPtr inst = instQueue.getInstToExecute();
1233 
1234         DPRINTF(IEW, "Execute: Processing PC %s, [tid:%i] [sn:%llu].\n",
1235                 inst->pcState(), inst->threadNumber,inst->seqNum);
1236 
1237         // Notify potential listeners that this instruction has started
1238         // executing
1239         ppExecute->notify(inst);
1240 
1241         // Check if the instruction is squashed; if so then skip it
1242         if (inst->isSquashed()) {
1243             DPRINTF(IEW, "Execute: Instruction was squashed. PC: %s, [tid:%i]"
1244                          " [sn:%llu]\n", inst->pcState(), inst->threadNumber,
1245                          inst->seqNum);
1246 
1247             // Consider this instruction executed so that commit can go
1248             // ahead and retire the instruction.
1249             inst->setExecuted();
1250 
1251             // Not sure if I should set this here or just let commit try to
1252             // commit any squashed instructions.  I like the latter a bit more.
1253             inst->setCanCommit();
1254 
1255             ++iewExecSquashedInsts;
1256 
1257             continue;
1258         }

The executeInsts function execute an many instruction as it can afford, which is implemented as the loop in the line 1227 and after. First it retrieves instruction that can be executed by invoking getInstToExecute function of the instQueue. After one instruction is retrieved, it checks if the instruction should be squashed. Although the squashed instructions are not really executed, but it should be treated as executed because it should be committed. After this condition is checked, depending on the type of the instruction, it will process the instruction separately.

execute memory instruction

1259 1260 Fault fault = NoFault; 1261 1262 // Execute instruction. 1263 // Note that if the instruction faults, it will be handled 1264 // at the commit stage. 1265 if (inst->isMemRef()) { 1266 DPRINTF(IEW, “Execute: Calculating address for memory “ 1267 “reference.\n”); 1268 1269 // Tell the LDSTQ to execute this instruction (if it is a load). 1270 if (inst->isAtomic()) { 1271 // AMOs are treated like store requests 1272 fault = ldstQueue.executeStore(inst); 1273 1274 if (inst->isTranslationDelayed() && 1275 fault == NoFault) { 1276 // A hw page table walk is currently going on; the 1277 // instruction must be deferred. 1278 DPRINTF(IEW, “Execute: Delayed translation, deferring “ 1279 “store.\n”); 1280 instQueue.deferMemInst(inst); 1281 continue; 1282 } 1283 } else if (inst->isLoad()) { 1284 // Loads will mark themselves as executed, and their writeback 1285 // event adds the instruction to the queue to commit 1286 fault = ldstQueue.executeLoad(inst); 1287 1288 if (inst->isTranslationDelayed() && 1289 fault == NoFault) { 1290 // A hw page table walk is currently going on; the 1291 // instruction must be deferred. 1292 DPRINTF(IEW, “Execute: Delayed translation, deferring “ 1293 “load.\n”); 1294 instQueue.deferMemInst(inst); 1295 continue; 1296 } 1297 1298 if (inst->isDataPrefetch() || inst->isInstPrefetch()) { 1299 inst->fault = NoFault; 1300 } 1301 } else if (inst->isStore()) { 1302 fault = ldstQueue.executeStore(inst); 1303 1304 if (inst->isTranslationDelayed() && 1305 fault == NoFault) { 1306 // A hw page table walk is currently going on; the 1307 // instruction must be deferred. 1308 DPRINTF(IEW, “Execute: Delayed translation, deferring “ 1309 “store.\n”); 1310 instQueue.deferMemInst(inst); 1311 continue; 1312 } 1313 1314 // If the store had a fault then it may not have a mem req 1315 if (fault != NoFault || !inst->readPredicate() || 1316 !inst->isStoreConditional()) { 1317 // If the instruction faulted, then we need to send it along 1318 // to commit without the instruction completing. 1319 // Send this instruction to commit, also make sure iew stage 1320 // realizes there is activity. 1321 inst->setExecuted(); 1322 instToCommit(inst); 1323 activityThisCycle(); 1324 } 1325 1326 // Store conditionals will mark themselves as 1327 // executed, and their writeback event will add the 1328 // instruction to the queue to commit. 1329 } else { 1330 panic(“Unexpected memory type!\n”); 1331 } 1332 1333 } else {

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
For the memory operation, it can be one of three instruction type:
atomic, load, store. 
Basically, the loadstore queue in charge of executing memory instructions,
but based on the type of memory operation, it needs to handle 
instruction differently. 
Let's take a look at how the load and store instruction will be processed.

### Execute load instruction
```cpp
1283             } else if (inst->isLoad()) {
1284                 // Loads will mark themselves as executed, and their writeback
1285                 // event adds the instruction to the queue to commit
1286                 fault = ldstQueue.executeLoad(inst);
1287
1288                 if (inst->isTranslationDelayed() &&
1289                     fault == NoFault) {
1290                     // A hw page table walk is currently going on; the
1291                     // instruction must be deferred.
1292                     DPRINTF(IEW, "Execute: Delayed translation, deferring "
1293                             "load.\n");
1294                     instQueue.deferMemInst(inst);
1295                     continue;
1296                 }
1297
1298                 if (inst->isDataPrefetch() || inst->isInstPrefetch()) {
1299                     inst->fault = NoFault;
1300                 }

The main execution of the load instruction is done by the executeLoad function of the ldstQueue. After the execution, it needs to check whether the translation is the bottleneck of making progress on the load operation. Note that when the virtual to physical address resolution is delayed because of long TLB latency, it should be executed at the next or later clock cycle when the TLB is ready. Therefore, when the instruction cannot be executed at this moment, it should set the current load instruction is deferred (deferMemInst). Also, when the load operation was just prefetch, then any fault generated by this operation should be ignored (line 1298-1299). Let’s take our important function executeLoad in detail!

gem5/src/o3/cpu/lsq_impl.hh

1
2
3
4
5
6
7
8
 251 template<class Impl>
 252 Fault
 253 LSQ<Impl>::executeLoad(const DynInstPtr &inst)
 254 {
 255     ThreadID tid = inst->threadNumber;
 256
 257     return thread[tid].executeLoad(inst);
 258 }

gem5/src/o3/cpu/lsq.hh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
  63 template <class Impl>
  64 class LSQ
  65 
  66 {
......
1104     /** Total Size of LQ Entries. */
1105     unsigned LQEntries;
1106     /** Total Size of SQ Entries. */
1107     unsigned SQEntries;
1108 
1109     /** Max LQ Size - Used to Enforce Sharing Policies. */
1110     unsigned maxLQEntries;
1111 
1112     /** Max SQ Size - Used to Enforce Sharing Policies. */
1113     unsigned maxSQEntries;
1114 
1115     /** Data port. */
1116     DcachePort dcachePort;
1117 
1118     /** The LSQ units for individual threads. */
1119     std::vector<LSQUnit> thread;
1120 
1121     /** Number of Threads. */
1122     ThreadID numThreads;
1123 };

gem5/src/o3/cpu/lsq_unit_impl.hh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
 558 template <class Impl>
 559 Fault
 560 LSQUnit<Impl>::executeLoad(const DynInstPtr &inst)
 561 {  
 562     using namespace TheISA;
 563     // Execute a specific load.
 564     Fault load_fault = NoFault;
 565    
 566     DPRINTF(LSQUnit, "Executing load PC %s, [sn:%lli]\n",
 567             inst->pcState(), inst->seqNum);
 568    
 569     assert(!inst->isSquashed());
 570    
 571     load_fault = inst->initiateAcc();
 572 
 573     if (load_fault == NoFault && !inst->readMemAccPredicate()) {
 574         assert(inst->readPredicate());
 575         inst->setExecuted();
 576         inst->completeAcc(nullptr);
 577         iewStage->instToCommit(inst);
 578         iewStage->activityThisCycle();
 579         return NoFault;
 580     }
 581        
 582     if (inst->isTranslationDelayed() && load_fault == NoFault)
 583         return load_fault;
 584            
 585     if (load_fault != NoFault && inst->translationCompleted() &&
 586         inst->savedReq->isPartialFault() && !inst->savedReq->isComplete()) {
 587         assert(inst->savedReq->isSplit());
 588         // If we have a partial fault where the mem access is not complete yet
 589         // then the cache must have been blocked. This load will be re-executed
 590         // when the cache gets unblocked. We will handle the fault when the
 591         // mem access is complete.
 592         return NoFault;
 593     }  
 594        
 595     // If the instruction faulted or predicated false, then we need to send it
 596     // along to commit without the instruction completing.
 597     if (load_fault != NoFault || !inst->readPredicate()) {
 598         // Send this instruction to commit, also make sure iew stage
 599         // realizes there is activity.  Mark it as executed unless it
 600         // is a strictly ordered load that needs to hit the head of
 601         // commit.
 602         if (!inst->readPredicate())
 603             inst->forwardOldRegs();
 604         DPRINTF(LSQUnit, "Load [sn:%lli] not executed from %s\n",
 605                 inst->seqNum,
 606                 (load_fault != NoFault ? "fault" : "predication"));
 607         if (!(inst->hasRequest() && inst->strictlyOrdered()) ||
 608             inst->isAtCommit()) {
 609             inst->setExecuted();
 610         }
 611         iewStage->instToCommit(inst);
 612         iewStage->activityThisCycle();
 613     } else {
 614         if (inst->effAddrValid()) {
 615             auto it = inst->lqIt;
 616             ++it;
 617 
 618             if (checkLoads)
 619                 return checkViolations(it, inst);
 620         }
 621     }
 622 
 623     return load_fault;
 624 }

initiateAcc: handling TLB request

I already covered InitiateAcc of the memory instructions before. However, compared to simple processors, the O3 cpu have different way to process the initateAcc.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
147 template <class Impl>
148 Fault
149 BaseO3DynInst<Impl>::initiateAcc()
150 {    
151     // @todo: Pretty convoluted way to avoid squashing from happening
152     // when using the TC during an instruction's execution
153     // (specifically for instructions that have side-effects that use
154     // the TC).  Fix this.
155     bool no_squash_from_TC = this->thread->noSquashFromTC;
156     this->thread->noSquashFromTC = true;
157 
158     this->fault = this->staticInst->initiateAcc(this, this->traceData);
159 
160     this->thread->noSquashFromTC = no_squash_from_TC;
161 
162     return this->fault;
163 }    

Because the staticInst stored in the dynamic instruction is the class object of a specific microoperation, it will invokes the initiateAcc function of that micro-load/store operation. For the memory read operation case, it invokes initiateMemRead function of architecture side. This will end up invoking initiateMemRead function of the CPU side.

1
2
3
4
5
6
7
8
9
10
 42 namespace X86ISA
 43 {
 44 
 45 /// Initiate a read from memory in timing mode.
 46 static Fault
 47 initiateMemRead(ExecContext *xc, Trace::InstRecord *traceData, Addr addr,
 48                 unsigned dataSize, Request::Flags flags)
 49 {
 50     return xc->initiateMemRead(addr, dataSize, flags);
 51 }
1
2
3
4
5
6
7
8
9
10
11
12
 962 template<class Impl>
 963 Fault
 964 BaseDynInst<Impl>::initiateMemRead(Addr addr, unsigned size,
 965                                    Request::Flags flags,
 966                                    const std::vector<bool>& byte_enable)
 967 {
 968     assert(byte_enable.empty() || byte_enable.size() == size);
 969     return cpu->pushRequest(
 970             dynamic_cast<typename DynInstPtr::PtrType>(this),
 971             /* ld */ true, nullptr, size, addr, flags, nullptr, nullptr,
 972             byte_enable);
 973 }

Because the instruction of the O3 CPU is instance of BaseO3DynInst inheriting the BaseDynInst, when the instruction implementation invokes initateMemRead (invoked through the InitateAcc implementation of the instruction), it invokes the corresponding method implemented in the BaseDynInst class.

pushRequest

1
2
3
4
5
6
7
8
9
10
11
713     /** CPU pushRequest function, forwards request to LSQ. */
714     Fault pushRequest(const DynInstPtr& inst, bool isLoad, uint8_t *data,
715                       unsigned int size, Addr addr, Request::Flags flags,
716                       uint64_t *res, AtomicOpFunctorPtr amo_op = nullptr,
717                       const std::vector<bool>& byte_enable =
718                           std::vector<bool>())
719 
720     {
721         return iew.ldstQueue.pushRequest(inst, isLoad, data, size, addr,
722                 flags, res, std::move(amo_op), byte_enable);
723     }

Instead of directly handling the load operation, initiateMemRead pushes the request to the load queue through the pushRequest function. This design seems to be odd because the initateAcc function has been invoked by the lsq at the first place, and the instruction forward the request to the loadstore queue once again. It might have been just implemented as simple function that handles the request directly without going through multiple different units. Anyway, initiateMemRead invokes the pushRequest of the CPU side and it will end up invoking pushRequest of the LSQ.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
 693 template<class Impl>
 694 Fault
 695 LSQ<Impl>::pushRequest(const DynInstPtr& inst, bool isLoad, uint8_t *data,
 696                        unsigned int size, Addr addr, Request::Flags flags,
 697                        uint64_t *res, AtomicOpFunctorPtr amo_op,
 698                        const std::vector<bool>& byte_enable)
 699 {
 700     // This comming request can be either load, store or atomic.
 701     // Atomic request has a corresponding pointer to its atomic memory
 702     // operation
 703     bool isAtomic M5_VAR_USED = !isLoad && amo_op;
 704 
 705     ThreadID tid = cpu->contextToThread(inst->contextId());
 706     auto cacheLineSize = cpu->cacheLineSize();
 707     bool needs_burst = transferNeedsBurst(addr, size, cacheLineSize);
 708     LSQRequest* req = nullptr;
 709 
 710     // Atomic requests that access data across cache line boundary are
 711     // currently not allowed since the cache does not guarantee corresponding
 712     // atomic memory operations to be executed atomically across a cache line.
 713     // For ISAs such as x86 that supports cross-cache-line atomic instructions,
 714     // the cache needs to be modified to perform atomic update to both cache
 715     // lines. For now, such cross-line update is not supported.
 716     assert(!isAtomic || (isAtomic && !needs_burst));
 717 
 718     if (inst->translationStarted()) {
 719         req = inst->savedReq;
 720         assert(req);
 721     } else {
 722         if (needs_burst) {
 723             req = new SplitDataRequest(&thread[tid], inst, isLoad, addr,
 724                     size, flags, data, res);
 725         } else {
 726             req = new SingleDataRequest(&thread[tid], inst, isLoad, addr,
 727                     size, flags, data, res, std::move(amo_op));
 728         }
 729         assert(req);
 730         if (!byte_enable.empty()) {
 731             req->_byteEnable = byte_enable;
 732         }
 733         inst->setRequest();
 734         req->taskId(cpu->taskId());
 735 
 736         // There might be fault from a previous execution attempt if this is
 737         // a strictly ordered load
 738         inst->getFault() = NoFault;
 739 
 740         req->initiateTranslation();
 741     }
 742 
 743     /* This is the place were instructions get the effAddr. */
 744     if (req->isTranslationComplete()) {
 745         if (req->isMemAccessRequired()) {
 746             inst->effAddr = req->getVaddr();
 747             inst->effSize = size;
 748             inst->effAddrValid(true);
 749 
 750             if (cpu->checker) {
 751                 inst->reqToVerify = std::make_shared<Request>(*req->request());
 752             }
 753             Fault fault;
 754             if (isLoad)
 755                 fault = cpu->read(req, inst->lqIdx);
 756             else
 757                 fault = cpu->write(req, data, inst->sqIdx);
 758             // inst->getFault() may have the first-fault of a
 759             // multi-access split request at this point.
 760             // Overwrite that only if we got another type of fault
 761             // (e.g. re-exec).
 762             if (fault != NoFault)
 763                 inst->getFault() = fault;
 764         } else if (isLoad) {
 765             inst->setMemAccPredicate(false);
 766             // Commit will have to clean up whatever happened.  Set this
 767             // instruction as executed.
 768             inst->setExecuted();
 769         }
 770     }
 771 
 772     if (inst->traceData)
 773         inst->traceData->setMem(addr, size, flags);
 774 
 775     return inst->getFault();
 776 }

The dynamic instruction can track whether the current instruction has started TLB translation by checking the flag stored in the instruction. It provide the interface to access that information, called translationStarted When the instruction set that flag, it means that the instruction already started the TLB access but waiting response. In the delayed TLB response case, the instruction stores the request information in its instruction object. Therefore, it can retrieve the request that has sent to TLB before. However, if it is the first time of execution, then it should generate new request. As shown in line 722-728, if the request should access two separate cache blocks, it generates SplitDataRequest request object. However, if it only access one block, then SingleDataRequest request object is generated instead. After the request has been produced, it should set proper flags of the instruction object to indicate the instruction initiated the TLB access (line 733). After that, the initiateTranslation function provided by the request object is invoked to actually generate accesses to the TLBs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
 860 template<class Impl>
 861 void
 862 LSQ<Impl>::SingleDataRequest::initiateTranslation()
 863 {
 864     assert(_requests.size() == 0);
 865 
 866     this->addRequest(_addr, _size, _byteEnable);
 867 
 868     if (_requests.size() > 0) {
 869         _requests.back()->setReqInstSeqNum(_inst->seqNum);
 870         _requests.back()->taskId(_taskId);
 871         _inst->translationStarted(true);
 872         setState(State::Translation);
 873         flags.set(Flag::TranslationStarted);
 874 
 875         _inst->savedReq = this;
 876         sendFragmentToTranslation(0);
 877     } else {
 878         _inst->setMemAccPredicate(false);
 879     }
 880 }

The addRequest just generates packet need to be sent to the TLB unit. Although the current object can be interpreted as just an request itself that can be directly sent to the TLB unit, but it is a wrapper for all the required interface and data structures to resolve TLB access. For example, it includes the ports connected with the TLB unit so that the generated request and its response can be communicated through that port. Anyway, the addRequest function just generates the real packet understandable by the TLB unit.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
 407         void
 408         addRequest(Addr addr, unsigned size,
 409                    const std::vector<bool>& byte_enable)
 410         {
 411             if (byte_enable.empty() ||
 412                 isAnyActiveElement(byte_enable.begin(), byte_enable.end())) {
 413                 auto request = std::make_shared<Request>(_inst->getASID(),
 414                         addr, size, _flags, _inst->masterId(),
 415                         _inst->instAddr(), _inst->contextId(),
 416                         std::move(_amo_op));
 417                 if (!byte_enable.empty()) {
 418                     request->setByteEnable(byte_enable);
 419                 }
 420                 _requests.push_back(request);                                                                                                                                                                  
 421             }
 422         }

The addRequest function of the LSQRquest class just generates the request and save it to the _requests vector to send them later. After the request packets are generated, initiateTranslation invokes sendFragmentToTranslation to send the generated packet(s) to the TLB.

1
2
3
4
5
6
7
8
9
10
 980 template<class Impl>
 981 void
 982 LSQ<Impl>::LSQRequest::sendFragmentToTranslation(int i)
 983 {
 984     numInTranslationFragments++;
 985     _port.dTLB()->translateTiming(
 986             this->request(i),
 987             this->_inst->thread->getTC(), this,
 988             this->isLoad() ? BaseTLB::Read : BaseTLB::Write);
 989 }

Remember that the SingleDataRequest has only one request packet. Therefore, it has only one entry in the _requests vector. This function sends the request stored in the _requests vector to the TLB. Note that the argument is used to index the entry stored in the _requests vector. You can see that it invokes the translateTiming of the dTLB connected to the LSQ. The details of the translateTiming function of the TLB is explained in the previous posting. Also, note that it passes the this as the translation object parameter. Because the translation object is used to invoke the finish function when the TLB access is resolved.

Response of LSQ for the TLB resolution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
 778 template<class Impl>
 779 void
 780 LSQ<Impl>::SingleDataRequest::finish(const Fault &fault, const RequestPtr &req,
 781         ThreadContext* tc, BaseTLB::Mode mode)
 782 {
 783     _fault.push_back(fault);
 784     numInTranslationFragments = 0;
 785     numTranslatedFragments = 1;
 786     /* If the instruction has been squahsed, let the request know
 787      * as it may have to self-destruct. */
 788     if (_inst->isSquashed()) {
 789         this->squashTranslation();
 790     } else {
 791         _inst->strictlyOrdered(req->isStrictlyOrdered());
 792 
 793         flags.set(Flag::TranslationFinished);
 794         if (fault == NoFault) {
 795             _inst->physEffAddr = req->getPaddr();
 796             _inst->memReqFlags = req->getFlags();
 797             if (req->isCondSwap()) {
 798                 assert(_res);
 799                 req->setExtraData(*_res);
 800             }
 801             setState(State::Request);
 802         } else {
 803             setState(State::Fault);
 804         }
 805 
 806         LSQRequest::_inst->fault = fault;
 807         LSQRequest::_inst->translationCompleted(true);
 808     }
 809 }

When the translation is completed, the finish function provided by the Request generated by the LSQ will be invoked at the end of the translation. As shwon in the above code, it first checks whether the instruction has been squashed while the TLB process the request. If it has not been squashed, it will set required flags indicating the translation is completed for specific instruction. Note that it sets various fields of the instruction that has initiated the TLB request (_inst in the line 790-808). One of the most important field changed by the finish function is _state field of the request. This field indicates current status of the TLB request and can be set to other state by using the setState function. Remind that how the simpleCPU starts memory access after the TLB is resolved. It initiates memory operation at the end of the finish function. However, O3 cpu does not invoke any related functions to generate actual memory request when the TLB is resolved. Then when and where the O3, especially the LSQ initiates the memory operation? The answer is in the pushRequest!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
 693 template<class Impl>
 694 Fault
 695 LSQ<Impl>::pushRequest(const DynInstPtr& inst, bool isLoad, uint8_t *data,
 696                        unsigned int size, Addr addr, Request::Flags flags,
 697                        uint64_t *res, AtomicOpFunctorPtr amo_op,
 698                        const std::vector<bool>& byte_enable)
 699 {
 ......
 743     /* This is the place were instructions get the effAddr. */
 744     if (req->isTranslationComplete()) {
 745         if (req->isMemAccessRequired()) {
 746             inst->effAddr = req->getVaddr();
 747             inst->effSize = size;
 748             inst->effAddrValid(true);
 749
 750             if (cpu->checker) {
 751                 inst->reqToVerify = std::make_shared<Request>(*req->request());
 752             }
 753             Fault fault;
 754             if (isLoad)
 755                 fault = cpu->read(req, inst->lqIdx);
 756             else
 757                 fault = cpu->write(req, data, inst->sqIdx);
 758             // inst->getFault() may have the first-fault of a
 759             // multi-access split request at this point.
 760             // Overwrite that only if we got another type of fault
 761             // (e.g. re-exec).
 762             if (fault != NoFault)
 763                 inst->getFault() = fault;
 764         } else if (isLoad) {
 765             inst->setMemAccPredicate(false);
 766             // Commit will have to clean up whatever happened.  Set this
 767             // instruction as executed.
 768             inst->setExecuted();
 769         }
 770     }
 771
 772     if (inst->traceData)
 773         inst->traceData->setMem(addr, size, flags);
 774
 775     return inst->getFault();
 776 }

It first checks the TLB translation is finished by invoking isTranslationComplete.

1
2
3
4
5
6
7
8
9
10
11
12
 586         bool
 587         isInTranslation()
 588         {
 589             return _state == State::Translation;
 590         }
 591 
 592         bool
 593         isTranslationComplete()
 594         {
 595             return flags.isSet(Flag::TranslationStarted) &&
 596                    !isInTranslation();
 597         }

You might remember that the _state field was changed when the finish function of the TLB request packet is invoked. Therefore, if the TLB request is already resolved, the isTranslationComplete function will return true. And then the actual memory read or write operation is made based on the instruction type. Because the translation packet req has translated physical address from the virtual address, it should also be passed to the operation because memory operation should target the physical address not the virtual address. Because we care currently dealing with the read operation, let’s take a look at how the O3 access the real memory.

CPU->read->LSQ::read->LSQUnit::read

1
2
3
4
5
725     /** CPU read function, forwards read to LSQ. */
726     Fault read(LSQRequest* req, int load_idx)
727     {
728         return this->iew.ldstQueue.read(req, load_idx);
729     }
1
2
3
4
5
6
7
8
1125 template <class Impl>
1126 Fault
1127 LSQ<Impl>::read(LSQRequest* req, int load_idx)
1128 {
1129     ThreadID tid = cpu->contextToThread(req->request()->contextId());
1130 
1131     return thread.at(tid).read(req, load_idx);
1132 }

The processor load function handles four different memory load operations: LLSC (locked load/store), MappedIPR (memory mapped register), store forwarding, and just memory load operation. I will cover the plain memory load operation that will try to access the data from the cache and memory. The store forwarding case will be handled in the other posting.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
621 LSQUnit<Impl>::read(LSQRequest *req, int load_idx)
622 {
623     LQEntry& load_req = loadQueue[load_idx];
624     const DynInstPtr& load_inst = load_req.instruction();
625 
626     load_req.setRequest(req);
627     assert(load_inst);
628 
629     assert(!load_inst->isExecuted());
630 
631     // Make sure this isn't a strictly ordered load
632     // A bit of a hackish way to get strictly ordered accesses to work
633     // only if they're at the head of the LSQ and are ready to commit
634     // (at the head of the ROB too).
635 
636     if (req->mainRequest()->isStrictlyOrdered() &&
637         (load_idx != loadQueue.head() || !load_inst->isAtCommit())) {
638         // Tell IQ/mem dep unit that this instruction will need to be
639         // rescheduled eventually
640         iewStage->rescheduleMemInst(load_inst);
641         load_inst->clearIssued();
642         load_inst->effAddrValid(false);
643         ++lsqRescheduledLoads;
644         DPRINTF(LSQUnit, "Strictly ordered load [sn:%lli] PC %s\n",
645                 load_inst->seqNum, load_inst->pcState());
646 
647         // Must delete request now that it wasn't handed off to
648         // memory.  This is quite ugly.  @todo: Figure out the proper
649         // place to really handle request deletes.
650         load_req.setRequest(nullptr);
651         req->discard();
652         return std::make_shared<GenericISA::M5PanicFault>(
653             "Strictly ordered load [sn:%llx] PC %s\n",
654             load_inst->seqNum, load_inst->pcState());
655     }
656 
657     DPRINTF(LSQUnit, "Read called, load idx: %i, store idx: %i, "
658             "storeHead: %i addr: %#x%s\n",
659             load_idx - 1, load_inst->sqIt._idx, storeQueue.head() - 1,
660             req->mainRequest()->getPaddr(), req->isSplit() ? " split" : "");
661 
662     if (req->mainRequest()->isLLSC()) {
663         // Disable recording the result temporarily.  Writing to misc
664         // regs normally updates the result, but this is not the
665         // desired behavior when handling store conditionals.
666         load_inst->recordResult(false);
667         TheISA::handleLockedRead(load_inst.get(), req->mainRequest());
668         load_inst->recordResult(true);
669     }
670 
671     if (req->mainRequest()->isMmappedIpr()) {
672         assert(!load_inst->memData);
673         load_inst->memData = new uint8_t[MaxDataBytes];
674 
675         ThreadContext *thread = cpu->tcBase(lsqID);
676         PacketPtr main_pkt = new Packet(req->mainRequest(), MemCmd::ReadReq);
677 
678         main_pkt->dataStatic(load_inst->memData);
679 
680         Cycles delay = req->handleIprRead(thread, main_pkt);
681 
682         WritebackEvent *wb = new WritebackEvent(load_inst, main_pkt, this);
683         cpu->schedule(wb, cpu->clockEdge(delay));
684         return NoFault;
685     }
686 
687     // Check the SQ for any previous stores that might lead to forwarding
......
840     // If there's no forwarding case, then go access memory
841     DPRINTF(LSQUnit, "Doing memory access for inst [sn:%lli] PC %s\n",
842             load_inst->seqNum, load_inst->pcState());
843 
844     // Allocate memory if this is the first time a load is issued.
845     if (!load_inst->memData) {
846         load_inst->memData = new uint8_t[req->mainRequest()->getSize()];
847     }
848 
849     // For now, load throughput is constrained by the number of
850     // load FUs only, and loads do not consume a cache port (only
851     // stores do).
852     // @todo We should account for cache port contention
853     // and arbitrate between loads and stores.
854 
855     // if we the cache is not blocked, do cache access
856     if (req->senderState() == nullptr) {
857         LQSenderState *state = new LQSenderState(
858                 loadQueue.getIterator(load_idx));
859         state->isLoad = true;
860         state->inst = load_inst;
861         state->isSplit = req->isSplit();
862         req->senderState(state);
863     }
864     req->buildPackets();
865     req->sendPacketToCache();
866     if (!req->isSent())
867         iewStage->blockMemInst(load_inst);
868 
869     return NoFault;
870 }

Execute store instruction

Execute non-memory instruction

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1333         } else {
1334             // If the instruction has already faulted, then skip executing it.
1335             // Such case can happen when it faulted during ITLB translation.
1336             // If we execute the instruction (even if it's a nop) the fault
1337             // will be replaced and we will lose it.
1338             if (inst->getFault() == NoFault) {
1339                 inst->execute();
1340                 if (!inst->readPredicate())
1341                     inst->forwardOldRegs();
1342             }
1343 
1344             inst->setExecuted();
1345 
1346             instToCommit(inst);
1347         }
1348 
1349         updateExeInstStats(inst);

1351 // Check if branch prediction was correct, if not then we need 1352 // to tell commit to squash in flight instructions. Only 1353 // handle this if there hasn’t already been something that 1354 // redirects fetch in this group of instructions. 1355 1356 // This probably needs to prioritize the redirects if a different 1357 // scheduler is used. Currently the scheduler schedules the oldest 1358 // instruction first, so the branch resolution order will be correct. 1359 ThreadID tid = inst->threadNumber; 1360 1361 if (!fetchRedirect[tid] || 1362 !toCommit->squash[tid] || 1363 toCommit->squashedSeqNum[tid] > inst->seqNum) { 1364 1365 // Prevent testing for misprediction on load instructions, 1366 // that have not been executed. 1367 bool loadNotExecuted = !inst->isExecuted() && inst->isLoad(); 1368 1369 if (inst->mispredicted() && !loadNotExecuted) { 1370 fetchRedirect[tid] = true; 1371 1372 DPRINTF(IEW, “[tid:%i] [sn:%llu] Execute: “ 1373 “Branch mispredict detected.\n”, 1374 tid,inst->seqNum); 1375 DPRINTF(IEW, “[tid:%i] [sn:%llu] “ 1376 “Predicted target was PC: %s\n”, 1377 tid,inst->seqNum,inst->readPredTarg()); 1378 DPRINTF(IEW, “[tid:%i] [sn:%llu] Execute: “ 1379 “Redirecting fetch to PC: %s\n”, 1380 tid,inst->seqNum,inst->pcState()); 1381 // If incorrect, then signal the ROB that it must be squashed. 1382 squashDueToBranch(inst, tid); 1383 1384 ppMispredict->notify(inst); 1385 1386 if (inst->readPredTaken()) { 1387 predictedTakenIncorrect++; 1388 } else { 1389 predictedNotTakenIncorrect++; 1390 } 1391 } else if (ldstQueue.violation(tid)) { 1392 assert(inst->isMemRef()); 1393 // If there was an ordering violation, then get the 1394 // DynInst that caused the violation. Note that this 1395 // clears the violation signal. 1396 DynInstPtr violator; 1397 violator = ldstQueue.getMemDepViolator(tid); 1398 1399 DPRINTF(IEW, “LDSTQ detected a violation. Violator PC: %s “ 1400 “[sn:%lli], inst PC: %s [sn:%lli]. Addr is: %#x.\n”, 1401 violator->pcState(), violator->seqNum, 1402 inst->pcState(), inst->seqNum, inst->physEffAddr); 1403 1404 fetchRedirect[tid] = true; 1405 1406 // Tell the instruction queue that a violation has occured. 1407 instQueue.violation(inst, violator); 1408 1409 // Squash. 1410 squashDueToMemOrder(violator, tid); 1411 1412 ++memOrderViolationEvents; 1413 } 1414 } else { 1415 // Reset any state associated with redirects that will not 1416 // be used. 1417 if (ldstQueue.violation(tid)) { 1418 assert(inst->isMemRef()); 1419 1420 DynInstPtr violator = ldstQueue.getMemDepViolator(tid); 1421 1422 DPRINTF(IEW, “LDSTQ detected a violation. Violator PC: “ 1423 “%s, inst PC: %s. Addr is: %#x.\n”, 1424 violator->pcState(), inst->pcState(), 1425 inst->physEffAddr); 1426 DPRINTF(IEW, “Violation will not be handled because “ 1427 “already squashing\n”); 1428 1429 ++memOrderViolationEvents; 1430 } 1431 } 1432 } 1433 1434 // Update and record activity if we processed any instructions. 1435 if (inst_num) { 1436 if (exeStatus == Idle) { 1437 exeStatus = Running; 1438 } 1439 1440 updatedQueues = true; 1441 1442 cpu->activityThisCycle(); 1443 } 1444 1445 // Need to reset this in case a writeback event needs to write into the 1446 // iew queue. That way the writeback event will write into the correct 1447 // spot in the queue. 1448 wbNumInst = 0; 1449 1450 } ```

Schedule

Schedule (InstructionQueue::scheduleReadyInsts()) The IQ manages the ready instructions (operands ready) in a ready list, and schedules them to an available FU. The latency of the FU is set here, and instructions are sent to execution when the FU done.

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.