diff --git a/en/basic/1.9.md b/en/basic/1.9.md new file mode 100644 index 0000000..a313641 --- /dev/null +++ b/en/basic/1.9.md @@ -0,0 +1,3197 @@ + +# Performance and Simple Cache + + +TODO: Supplement the introduction about zicntr and zihpm, and ask students to check whether ecall and ebreak are counted as instret CSR. + + + +> #### Danger::Update ysyxSoC +> +> We greatly updated the 'ysyxSoC' project at 2024/04/21 18:00:00 +> We added the framework code related to memory latency, and refactored some of the code to facilitate subsequent configuration. +> At the same time, due to the massive subsequent configuration combinations, the 'ysyxSoC' project is difficult to provide a variety of pre-generated 'ysyxsocfuller.v', +> So you need to configure a Chisel build environment to generate this file, see below for details. +> But if you choose to develop with Verilog, you still don't need to write Chisel code. + + + +> Due to the massive reconstruction, we manually update the project here. +> If you get the code for 'ysyxSoC' before the above time, please do the following: + +> ```bash +> cd ysyx-workbench +> mv ysyxSoC ysyxSoC.old +> git clone git@github.com:OSCPU/ysyxSoC.git +> # install mill, at https://mill-build.com/mill/Intro_to_Mill.html +> mill --version # Check whether mill is installed +> cd ysyxSoC +> make dev-init # Pull the rocket-chip project and initialize +> make verilog # Generate build/ysyxSoCFull.v +> cp ... ... # Copy the files in ysycSoC.old where the code has been written to the appropriate directory in ysyxSoC +> # For the Chisel code, we put it in the src/ directory and did some refactoring of the directory structure, +> # But you should easily find the location of the file +> make verilog +> cd am-kernels/benchmarks/microbench +> make ARCH=riscv32e-ysyxsoc mainargs=test run # Simulate to verify that the program works correctly +> cd ysyx-workbench +> rm -rf ysyxSoC.old # Make sure you have copied all the code you have written +> ``` + + + +After connecting to ysyxSoC, your designed NPC can interact correctly with various devices and can participate in tape-out from a functional point of view. Based on the "done first, perfect later" system design principle, we can now discuss how to carry out performance optimization. + + + +It's about performance optimization, but that's the end goal. In a complex system, we will be faced with many choices, such as: +What areas are worth optimizing? What options should be used for optimization? What's the expected payoff? What are the costs of these programs? +If we spend a lot of effort and find that the performance has only improved by one part in ten thousand, this is certainly not what we expect. Therefore, rather than blindly writing code, we need a set of scientific methods that can guide us to answer these questions: + +1. Evaluate current performance +2. Locate performance bottlenecks +3. Use appropriate optimization methods +4. Evaluate the performance after optimization and compare whether the performance improvement is in line with expectations + + +## Performance evaluation + + + +To talk about optimization, you first need to know how well the current system is performing. Therefore, we need to have a quantitative measure of performance, rather than judging by our sense of "running well". Using this metric to evaluate the current system is the first step in performance optimization. + + + +In our understanding, "high performance" is basically equivalent to "running fast ". +Therefore, a direct measure of performance is the execution time of the program. Thus, to evaluate the performance of a system is to evaluate the execution time of a program on that system. + + +### Benchmark program selection + + + +What programs should be evaluated? There are so many different programs that it's not realistic to evaluate them all, so we need to choose some representative programs. Representativeness refers to the performance gains that optimization techniques bring to these programs, it is consistent with the performance benefit trend in real application scenarios. + + + +The "application scenarios" mentioned here indicate that the above trends are likely to differ for different application scenarios. This means that different application scenarios require different representative procedures, which results in different benchmarks. For example, Linpack is used to represent supercomputing scenarios, MLPerf is used to represent machine learning training scenarios, CloudSuite is used to represent cloud computing scenarios and Embench is used to represent embedded scenarios. For general-purpose computing scenarios, the most famous benchmark is the SPEC CPU, which is used to evaluate the general-purpose computing capabilities of a CPU. + +SPEC(Standard Performance Evaluation Corporation) is an organization, goal is to set up to evaluate the various standards of computer system maintenance, [it is defined and published a variety of scenarios of benchmark] (https://www.spec.org/benchmarks.html). +In addition to SPEC cpus, there are benchmarks for graphics, workstations, high-performance computing, storage, power consumption, virtualization, and other scenarios. + +A benchmark also typically consists of several subentries. For example, the integer test of SPEC CPU 2006 includes the following subentries: + +| subitem | Introduction | +| --- | --- | +| 400.perlbench | The Perl language detects spam | +| 401.bzip2 | bzip compression algorithm | +| 403.gcc | gcc compiler | +| 429.mcf | Combinatorial optimization of single-station vehicle scheduling in large-scale public transportation | +| 445.gobmk | Go, AI searching questions | +| 456.hmmer | The method of gene recognition based on hidden Markov model was used to search gene sequence | +| 458.sjeng | Chess, AI search questions | +| 462.libquantum | Quantum computing simulates prime number decomposition | +| 464.h264ref | Perform H.264 video encoding on YUV format source files | +| 471.omnetpp | Large CSMA/CD protocol Ethernet emulation | +| 473.astar | A\* algorithm pathfinding | +| 483.xalancbmk | XML to HTML format conversion | + + + +In addition to integer tests, SPEC CPU 2006 also includes floating-point tests, test programs covering fluid mechanics, quantum chemistry, biomolecules, finite element analysis, linear programming, image ray tracing, computational electromagnetics, weather forecasting, speech recognition, etc. + + + +Of course, benchmark also needs to move with The Times to represent a new era of programming. As of 2024, the SPEC CPU has evolved into 6 versions, beginning in 1989, 1992, 1995, and 1995. To 2000, 2006, and finally the latest version in 2017. SPEC CPU 2017 adds a number of new programs to represent new application scenarios, such as biomedical imaging, 3D rendering and animation, artificial intelligence Go program using Monte Carlo tree search (with high probability influenced by AlphaGo in 2016) and so on. + + + +> #### Comment: CoreMark and Dhrystone are not good benchmarks + +> CoreMark and Dhrystone are part of the synthetic program, +> A program that is composed of several pieces of code. +> For example, CoreMark consists of three code fragments: linked list operations, matrix multiplication, and state machine transition operations; +> Dhrystone consists of snippets of code for string operations. +> +> The biggest problem with compositing as a benchmark is that it is not representative: +What use cases do CoreMark and Dhrystone represent? Compared to the various real-world applications in SPEC CPU 2006, +> Code snippets in CoreMark can only be considered C homework at best; +> Dhrystone is further from the application scenario, its code is very simple (using short string constants), +> Even with modern compilers, snippets of code in the body of a loop are likely to be deeply optimized (remember 'pattern_decode()' in NEMU), +> The evaluation results are inflated and cannot objectively reflect the performance of the system. +> [this article] (https://www.transputer.net/tn/27/tn27.html) analyse Dhrystone benchmark as defects in detail. +> +> Ironically, when many CPU manufacturers release their products today, +> Still use CoreMark or Dhrystone evaluation results to identify product performance, +Even some of them claim to be products for high-performance scenarios. +> [Architecture Guru, Turing Award winner David Patterson when introducing Embench] (HTTP: / / https://www.sigarch.org/embench-recruiting-for-the-long-overdue-and-deserved-demise-of-dhrystone-a s-a-benchmark-for-embedded-computing/), critic Dhrystone is long out of date and should be discontinued. In fact, the first version of Dhrystone was released in 1984, and Dhrystone has not been maintained or upgraded since 1988. +The computer world has changed dramatically since the 1980s, +> The application has been updated, the compilation technology is becoming more mature, the hardware computing power is also greatly improved, +Using 40-year-old benchmarks to evaluate today's computers is flawed. + + + +However, the SPEC CPU's program is a little too realistic for a teaching-oriented "core for life" : +On the one hand, they are very large, even on x86 real machines take hours to run; +On the other hand, they need to run in a Linux environment, which means that we first need to design a CPU that can properly boot Linux, then you can run the SPEC CPU benchmark. + + + +Instead, we want to have a set of benchmarks suitable for teaching scenarios that meet the following conditions: + +* The scale is not too large, the execution time in the simulator or even in the RTL simulation environment is less than 2 hours +* Can run in bare metal environment without starting Linux +* The program is representative, unlike CoreMark and Dhrystone, which use synthetic programs + + + +In fact, microbench integrated with am-kernels is a good choice. +On the one hand, microbench provides a test set of multiple sizes, and the simulator can adopt the 'ref' scale. RTL simulation environment can adopt 'train' scale; +microbench, on the other hand, is an AM program that can run without having to boot Linux; +In addition, microbench contains 10 subitems, overlay sorting, bit manipulation, language interpreter, matrix calculation, prime number generation, A\* algorithm, maximum network flow, data compression, MD5 checksum and other scenarios. +Therefore, when a subsequent handout mentions performance evaluation without specifying the corresponding benchmark, it will default to referring to the microbench 'train' scale. + + + +If the application scenario of the processor is specific, such as running Super Mario, +Then you can just use Super Mario as a benchmark, it is equivalent to taking the "Super Mario Game experience" as the standard for "working well." Unlike microbench, Super Mario is a program that never runs out, therefore, FPS can be used as a quantitative metric to evaluate, rather than using run time. + + +## Find performance bottlenecks + + +### Performance formula and optimization direction + + + +We can measure the running time of the benchmark to get the performance of the system. +But uptime is a single piece of data, and it's hard to find performance bottlenecks in the system to optimize uptime, so we need some more granular data. + + +Actually, we can break down the running time of a program into the following three factors: + +``` + time inst cycle time +perf = ------- = ------ * ------- * ------- + prog prog inst cycle +``` + + + +The goal of performance optimization is to reduce the running time of the program, that is, to reduce each factor in it, which also reveals the three directions of performance optimization. + + + +The first optimization direction is to reduce the number of instructions executed by the program (i.e., the number of dynamic instructions). Possible measures include: + + 1. Modify the program and adopt a better algorithm. + 2. Adopt better compilation optimization strategies. Take gcc as an example, in addition to the use of '-O3', '-Ofast' and other general optimization level parameters, there are about 600 compilation options in gcc related to the quality of the generated code. Selecting the appropriate compilation options can greatly optimize the number of dynamic instructions in the program. + For example, when yzh carried out a project, it only used '-O3' to compile CoreMark, and the number of dynamic instructions running 10 rounds was about 3.12 million; if some additional targeted compilation options are turned on, the number of dynamic instructions can be reduced to about 2.25 million, and the performance of the program is significantly improved. + 3. Design and use instruction sets with more complex behavior. We know that the CISC instruction set contains instructions with more complex behavior, if the compiler uses these complex instructions, the number of dynamic instructions can be reduced.Alternatively, you can add custom specialized instructions to the processor and have the program use them. + + + +The second optimization direction is to reduce the "average number of Cycles Per Instruction execution", that is, the CPI(Cycles Per Instruction). It is also known as the reciprocal of increasing CPI, that is, IPC(Instructions Per Cycle), that is, increasing the average number of instructions executed per cycle. This indicator reflects the microstructural design of the processor, and powerful processors can execute more instructions per cycle. Therefore, IPC is often improved by optimizing the microprocessor's microstructure design, so that the processor can complete the execution of the program faster. The optimization of microstructure design has different directions, which we will briefly discuss below. + + + +The third optimization direction is to reduce 'time per cycle', that is, to increase the number of cycles per unit time, which is the frequency of the circuit. +Possible optimization measures include: + + 1. Optimize the front-end design of digital circuits to reduce the logic delay of critical paths. + 2. Optimize the back-end design of the digital circuit to reduce the routing delay of the critical path. + + + +If we can quantify these three factors, we can better assess the potential of these three optimization directions, this guides us to the performance bottleneck. +Fortunately, these indicators are not hard to come by: + +* The number of dynamic instructions can be counted directly in the simulation environment +* With the number of dynamic instructions, then the number of statistical cycles, you can calculate IPC +* The frequency of the circuit can be obtained by viewing the synthesizer report + + + +> #### todo:: Statistics IPC +> +> Try to count IPC in a simulation environment. +> +> In fact, when we implement the bus, we ask you to evaluate the running time of the program, using the performance formula described above. +> But at that time we calculated it by the number of cycles/frequency required to execute the program. +> The above performance formula only further breaks down the number of cycles required for program execution into two factors. +But this split still gives us further information, +> After all, 'the number of cycles required for program execution' is related to both the program and the processor, +The average number of cycles per instruction execution is only related to the processing power of the processor. + + +### Performance model for simple processors + + + +Even though IPC is not difficult to count, just as the uptime statistics do not tell us how to optimize the uptime, the IPC statistics do not tell us how to optimize the IPC. +In order to find the performance bottleneck, we need to analyze the factors that affect IPC, just as we analyzed the run-time factors above. +To do this, we need to re-examine how the processor executes instructions. + +``` + /--- frontend ---\ /-------- backend --------\ + +-----+ <--- 2. computation efficiency + +--> | FU | --+ + +-----+ +-----+ | +-----+ | +-----+ + | IFU | --> | IDU | --+ +--> | WBU | + +-----+ +-----+ | +-----+ | +-----+ + ^ +--> | LSU | --+ + | +-----+ +1. instruction supply ^ + 1. data supply --+ +``` + + + +The diagram above is a simple processor structure, we have been understanding this diagram from a functional perspective, now we need to understand it from a performance perspective. +We can divide the processor into front end and back end, where the front end includes finger extraction and decoding, the remaining modules belong to the back end and are responsible for actually executing the instructions and changing the processor state. +Note that the front and back end of the processor is not the same as the front and back end design of the digital circuit mentioned above. +In fact, the front-end design of the processor belongs to the front-end design of the digital circuit. + + + +To improve the execution efficiency of the processor, you need to do: + + 1. The front end of the processor needs to ensure instruction supply. If the front end does not get enough instructions, it will not be able to fully use the computing power of the processor. Because the execution of each instruction needs to be taken first, the instruction supply capacity of the front end will affect the execution efficiency of all instructions. + 2. The back-end of the processor needs to ensure computing efficiency and data supply + +* For most computation-class instructions, their execution efficiency depends on the computational efficiency of the corresponding functional unit. + For example, the execution efficiency of a multiplier and division instruction is also affected by the computational efficiency of the multiplier and divider. Similar to floating-point execution and floating-point processor unit FPU. +* For memory access class instructions, the execution efficiency depends on the memory access efficiency of the LSU. In particular, for the load instruction, the processor needs to wait for the memory to return data before it can write it back to the register pile. This means that the efficiency of the load instruction depends on the data supply capacity of the LSU and the memory. The store instruction is special because the store instruction does not need to be written to the register pile, in principle, the processor does not have to wait for data to be fully written to memory. In high-performance processors, a store buffer component is usually designed, after the processor writes the information of the store instruction to the store buffer, the store execution is considered to be complete. The store buffer component then controls the actual writing of the data to the memory. Of course, this adds complexity to the processor design, for example, the load instruction also needs to check if the latest data is in the store buffer. + + + +So, how can we quantitatively evaluate the instruction supply, computational efficiency, and data supply of a processor? +In other words, what we really want to know is whether modules such as IFU and LSU are working at full speed when the processor is running the specified benchmark.To do that, we need to collect more information. + + +### Performance events and performance counters + + + +In order to quantitatively evaluate a processor's instruction supply, computational efficiency, and data supply, we need to further understand the detailed factors that affect them. Take instruction supply as an example, how can instruction supply capacity be considered strong? +What can directly reflect the ability of the instruction supply is whether the IFU gets the instruction. To do this, we can consider "IFU fetch instruction "as an event and count the frequency of this event. If this event happens often, the command supply capacity is strong; Otherwise, the instruction supply capacity is weak. + + + +These events are called performance events, and through them, we can translate some of the more abstract performance indicators in the performance model into concrete events on the circuit. Similarly, we can measure the strength of data supply by counting the frequency of the event "LSU gets data "; count the frequency of the event "EXU complete calculation "to measure the efficiency of the calculation. + + + +To count the frequency of performance events, we just need to add some counters to the hardware, when a performance event is detected, the value of the counter is increased by 1. +These counters are called performance counters. With a performance counter, we can see where the time spent running a program on the processor is, it's like profiling the inside of the processor. + + + +It is not difficult to detect the occurrence of performance events on the circuit, we can use the handshake signal of the bus mechanism to detect. For example, the handshake of the R channel of the IFU finger fetch indicates that the IFU has received the data returned by the AXI bus, thus completing a finger fetch operation. So when the R-channel handshake occurs, we can increment the corresponding performance counter by 1. + + + +> #### todo:: Adds a performance counter +> +> Try to add some performance counters to the NPC, including at least one of the following performance events: +> +> * IFU fetch instruction +> * LSU gets data +> * EXU Complete the calculation +> * Decode various types of instructions, such as calculation class instructions, memory access instructions, CSR instructions, etc +> +> Performance counters are also essentially implemented by circuits. As the number of performance counters increases, they will take up more and more circuit area and may even affect the critical path in the circuit. +> So we don't require performance counters to participate in the flow sheet, you just need to use them in the simulation environment: +> You can use RTL to implement performance counters and output their values at the end of the simulation by means of '$display()' etc. +> Then choose not to instantiate them by way of configuration during synthesis; or the detection signal of the performance event is connected to the simulation environment through DPI-C to realize the performance counter in the simulation environment. +> This way, you can add performance counters as much as you want without worrying about affecting the area and frequency of the circuit. +> After implementation, try running a microbench test scale to collect results for performance counters. +> If your implementation is correct, there should be consistency between different semantically similar performance counters. +For example, the total number of instructions of different classes obtained by decoding should be the same as the number of instructions fetched by IFU, and also the number of dynamic instructions. +> Try to find more consistent relationships and check if those relationships hold up. + + + +Sometimes we care more about when and why events don't happen than when they do. For example, we care more about when IFU can't get instructions and why IFU can't get instructions, figuring out why this is happening helps us understand where the bottlenecks in instruction supply are, i t provides guidance for improving instruction supply of processor. We can define the "event does not occur" as a new event and add a performance counter for the new event. + + + +> #### todo:: Add a performance counter (2) +> +> Add more performance counters to NPCS and try to analyze the following issues: +> +> * What percentage of each type of instruction? How many cycles do they each need to execute on average? +> * What are the reasons why IFU can't get instructions? What are the chances that IFU can't fetch an instruction for these reasons? +> * What is the average memory access delay for LSU? + + + + +> #### comment: trace the performance counter +> +> The use of the performance counter described above is output and analysis after the simulation. +> If we output the value of the performance counter every cycle, we can get a trace of the performance counter! +> According to this trace, with some drawing tools (such as python's matplotlib drawing library), we can plot the value of the performance counter over time, +> Visualize how performance counters change during simulation, this helps us to better determine whether the performance counter changes as expected. + + +### Amdahl's law + + + +Performance counters can provide quantitative guidance for optimization of processor microstructures.So where is the performance bottleneck? What optimizations are worth doing? What is the expected performance benefit of the optimization effort? +We need to answer these questions before we start making specific optimizations, to help us avoid optimizations with low expected performance gains, this will allow more time to be spent on optimization work with high returns. It sounds like a job of predicting the future, but Amdahl's law can tell us the answer. + + + +[Amdahl's law] (https://en.wikipedia.org/wiki/Amdahl%27s_law) Gene Amdahl by computer scientists put forward in 1967, +Its contents are: + +``` +The overall performance improvement gained by optimizing a single part +of a system is limited by the fraction of time that the improved part +is actually used. + +``` + + + +Suppose that the actual time proportion of a certain part of the system is' p ', and the acceleration ratio of this part after optimization is' s', then the acceleration ratio of the whole system is' f(s) = 1 / (1-p + p/s) ', which is the formula expression of Amdahl's law. + + + +For example, the running process of A program is divided into two independent parts A and B, of which A accounts for 80% and B accounts for 20%. + +* If B is optimized by a factor of 5, the acceleration ratio of the entire program is' 1 / (0.8 + 0.2/5) = 1.1905 '; +* If B is optimized 5000 times, then the acceleration ratio of the entire program is' 1 / (0.8 + 0.2/5000) = 1.2499 '; +* If A is optimized by a factor of 2, the acceleration ratio of the entire program is' 1 / (0.2 + 0.8/2) = 1.6667 '. + +``` +<------- A --------><-B-> +++++++++++++++++++++ooooo Original program + +++++++++++++++++++++o Optimize B by 5 times + +++++++++++++++++++++ Optimize B by 5000 times, and the running time after optimization is very short + +++++++++++ooooo Optimize A by 2 times +``` + + + +In general, optimizing 5000 times takes a lot more effort than optimizing 2 times, but Amdahl's law tells us that optimizing B by A factor of 5,000 is better than optimizing A by a factor of two. This counterintuitive result tells us that we can't just consider the acceleration ratio of one part, it is also necessary to consider the time proportion of this part and evaluate the optimization effect of a technology from the perspective of the whole system. +Therefore, for processor performance optimization, it is important to measure the elapsed time of an optimized object in advance by a performance counter. + +\ + +> #### todo:: Find the appropriate performance bottleneck based on the performance counter +> +> According to the statistical results of the performance counter, try to find some potential optimization objects, +Then Amdahl's law is used to estimate the theoretical benefits they can achieve to determine where the performance bottleneck of the system lies. + + +> +> #### caution:: Professional quality in computer architecture work +> +> There is a piece of advice widely circulated in the software world: +> ```Talking about optimization without workload is rogue.``` +> It means that the choice of optimization scheme must be based on the operation of the load. +> This is especially true in the field of architecture, where we can't rely on intuition to optimize processor design and change where we see an opportunity for optimization. +> Otherwise, it's easy to adopt a scheme that doesn't work, or even cause performance regression in real scenarios. On the contrary, it is scientific to adopt a suitable design plan based on the evaluation data. +> In fact, Amdahl's law is easy to understand, without regard to professional background, we can even package it as a math word problem for elementary school students to solve. +But we have also seen a lot of beginners "rogue ", to put it bluntly, or lack of relevant professional quality. +> Everyone to learn "One Student One Chip", not only to learn RTL coding, +> It is more important to learn scientific methods to solve problems, and exercise professional quality in this direction. +> So that when you encounter real problems in the future, you know how to solve them in the right way. + + + +> #### caution:: Top-down debugging method +> +> You've been debugging a lot of feature bugs, but there's actually a kind of bug called a performance bug, its performance is not a program error or crash, but the performance of the program is lower than expected. +> Of course, the process of debugging performance bugs is similar to the process of performance optimization, look for performance bottlenecks in the system. +> In fact, debugging bugs and debugging performance bugs are also similar. +> When debugging a bug, the first thing we see is that the program is faulty or crashes. +> But just reading such information, it is difficult to find bugs; Therefore, it is necessary to understand the behavior of the program through various levels of trace tools to find the specific performance of the program when the error; Then use tools such as gdb/ waveform for detailed analysis at the variable/signal level. +> When debugging a performance bug, the first thing we see is the running time of the program, but it doesn't tell us directly where the performance bottleneck is; +> Then the running time of the program is decomposed into 3 factors by the performance formula. +> Investigate the optimization potential from the three directions of compilation, microstructure and frequency; For microstructures, it is still difficult to find performance bottlenecks based on statistical IPC alone. +> So we need to analyze the factors that affect IPC, divide the processor into three parts, understand the process of processor execution from instruction supply, data supply and computational efficiency. +> But we need more concrete quantitative data; So we need to add a performance counter to count the occurrence of performance events in each module. +> Finally, find the real performance bottleneck through Amdahl's law. +> It's no coincidence that debugging both of these bugs uses a similar top-down analysis approach, it is the embodiment of abstract thinking in the field of computer systems: abstraction is the only way to understand complex systems. +> In fact, if you try to debug using gdb/ waveform at first, you will find it very difficult. +> This is because the amount of detail at the bottom is so large that it is difficult to provide a macro perspective. +> Therefore, we need to start at the high level semantics and trace down the appropriate path, +> Locate a small area at the bottom, which helps us quickly find the problem. + + +### Calibrate memory access delay + + + +After we connect NPC to ysyxSoC, modules such as SDRAM controller provide a more realistic memory access process. You can imagine if we counted performance counters before we plugged in ysyxSoC, due to the difference in memory access delay, the result will be very different from that after connecting to ysyxSoC. And the different statistical results will guide us to optimize in different directions, but if we aim for the stream, these different directions of optimization are likely to fail to achieve the desired effect. So the closer the simulated environment behaves to the real chip, the less error there is in the evaluation, the more real the performance gains from optimizations guided by performance counters. + + + +In fact, the previous ysyxSoC environment assumed that the processor and various peripherals were running on the same frequency: +A cycle of verilator emulation is both a cycle in the processor and a cycle in the peripherals. But in practice this is not the case: due to electrical characteristics, peripherals can usually only operate at low frequencies, for example, SDRAM particles usually only operate at around 100MHz, and too high a frequency can lead to timing violations, making SDRAM particles not work correctly; But on the other hand, processors using advanced processes are often able to run at higher frequencies, for example, a version of the yzh multi-period NPC reaches a frequency of about 1.2GHz in the nangate45 process provided by default by the 'yosys-sta' project. +In the above configuration, the SDRAM controller goes through 1 cycle, the NPC should go through 12 cycles, however, verilator does not perceive the frequency difference between the two, and still simulates under the assumption that the two frequencies are the same. +The simulation result is much more optimistic than the real chip, which may also make some optimization measures can not achieve the expected effect in the real chip. + + + +> #### Danger:: Update yosys-sta +> +> We updated the 'yosys-sta' project at 2024/04/09 08:30:00, adding the netlist optimization tool developed by the iEDA team. +> The comprehensive netlists generated by yosys have been significantly optimized to bring their timing evaluation results closer to commercial tools. +> If you obtained the code for 'yosys-sta' before the above time, delete the existing 'yosys-sta' project and clone it again. + + + +In order to obtain more accurate simulation results to guide us in more effective optimization, we need to conduct a calibration of the memory access delay. Calibration in one of two ways, one kind is to use simulators, support multiple clock domains such as VCS or [ICARUS verilog] (https://github.com/steveicarus/iverilog). Unlike verilator, which is implemented in a cycle-accurate model, this simulator is implemented in an event queue model. Consider every calculation in Verilog as an event, and you can maintain the latency of the event. Thus, the order of each calculation can be correctly maintained when different modules in the multi-clock domain work at different frequencies. However, in order to maintain the event queue model, such emulators usually run slower than verilator. + + + +The second way to calibrate is to modify the RTL code, insert a delay module into the ysyxSoC, responsible for delaying the request for several cycles to simulate the effect of the device operating at a low frequency. The number of cycles the NPC waits is close to the number of cycles it will wait in the future when running at high frequencies. This approach is not complicated to implement and can be simulated using a faster verilator, so we chose this approach. In addition, this approach also applies to FPGA. + + + +Naturally, in order to implement the delay module, we just need to wait until the delay module receives a reply from the device, it does not reply to the upstream module immediately, but waits for a certain number of cycles before replying, but how to calculate the number of cycles to wait is a challenge. +Considering the yzh example above, if a request takes 6 cycles in the SDRAM controller, the NPC should wait a total of '6 * 12 = 72' cycles; If the SDRAM controller is refreshing SDRAM particles and the request took 10 cycles in the SDRAM controller, the NPC should wait a total of '10 * 12 = 120' cycles; If the request goes to flash and takes 150 cycles in the SPI master, the NPC should wait a total of '150 * 12 = 1800' cycles. +As you can see, the number of cycles the delay module needs to wait is related to the time spent on the device service request. +It is not a fixed constant, so it needs to be dynamically calculated in the delay module. +Suppose the request takes' k 'cycles in the device, and the processor/device frequency ratio is' r' (there should be 'r >= 1'), then the delay module needs to calculate the number of cycles the processor needs to wait 'c = k * r'. +To perform this dynamic calculation, we need to consider two more questions: + +1. How to implement multiplication cheaply? +2. If 'r' is a decimal, how to implement multiplication of decimals? +For example, 'yosys-sta' reports a frequency of 550MH, then 'r = 550/100 = 5.5', but if 5.5 is calculated as 5, a request that takes 6 cycles on the device will introduce a 3 cycle error on the processor, the error is too large for a CPU running at high speed, and the accumulation of error will significantly affect the value of the performance counter, which further affects the optimization decision. + + + +Considering that ysyxSoC code does not participate in synthesis and streaming, there are some simple ways to solve the problem, for example, use '*' to calculate the result of multiplication, and use fixed-point numbers to represent decimals. +But as an exercise, we're going to ask you to try something that can be integrated. +In the future, if you need to solve a similar problem in a synthesizable circuit, you will know how to do it. + + + +Let's first consider how to multiply when r is an integer. +Since the delay module itself also needs to wait for the device's reply, the waiting time is exactly the number of cycles' k 'spent on the request in the device. Simply ask the delay module to add a counter for each cycle it waits, adding 'r' for each cycle. For a given processor frequency and device frequency, 'r' is a fixed value and can therefore be hardcoded directly into the RTL code. After the delay module receives a reply from the device, it enters a waiting state, reducing the counter by 1 per cycle, and then returning the request to the upstream when it reaches 0. + + + +And then let's consider the case where r is a decimal. +Since the decimal part is not convenient to handle, direct truncation will introduce a large error, we can figure out a way to add the decimal part to the whole part. +In fact, we can introduce an amplification factor 's', adding 'r * s' to the counter every period. That is, at the end of the accumulation, the value of the counter is' y = r * s * k ', then update the counter to 'y/s' before entering the waiting state. +Since 's' is a constant, the result of 'r * s' can also be hardcoded directly into RTL, +Of course, r * s here is probably not an integer, so we truncate it to an integer. +Although in theory this still introduces a certain amount of error, we can show that the error is much smaller than before. +However, the value of 'y' is dynamically calculated and cannot be hard-coded into RTL. +So for the general 's', 'y/s' needs to be divided. +You should quickly think that we can take some special 's' to simplify the calculation process! +In this way, we can reduce the error to 1/s. +That is, when the original accumulated error in the accumulation stage reaches' s', the error under this new method is increased by '1'. + + + +Looking back at the current ysyxSoC, SDRAM uses the APB interface, so we need to implement a delay module for APB. ysyxSoC already includes a framework for the APB delay module and is integrated upstream of APB Xbar. Capture all APB access requests, including SDRAM access requests. However, the framework does not provide a specific implementation of the delay module, so there is no delay effect by default. In order to calibrate the access latency of SDRAM in ysyxSoC, you also need to implement the functionality of the APB delay module. + + + +> #### todo:: Calibrate memory access delay +> +> Implement the APB delay module in ysyxSoC as described above to calibrate the memory access delay of the simulation environment. +> Specifically, if you choose Verilog, you need to implement the corresponding code in 'ysyxSoC/perip/amba/apb_delayer.v'; +> If you choose to Chisel, you need to implement ` ysyxSoC/SRC/amba/APBDelayer scala ` in ` APBDelayerChisel ` module to realize the corresponding code, and modify ` ysyxSoC/SRC/amba/APBDelayer scala in ` ` Module (new apb_delayer) ` to instantiate ` APBDelayerChisel ` Module. + + +> +> #### todo:: Find the highest combined frequency +> +> In addition to frequency, area is also an evaluation indicator of the circuit. +But at the standard unit level of the process library, the two are essentially mutually restrictive: +> A class of units with the same function, if you want the logical delay of the unit to be low, it is necessary to increase the area of the unit through more transistors to make it have stronger driving capacity. +> Given the constraint between area and frequency, the synthesizer will generally use as little area as possible to achieve a given target frequency. It does not separately consider the maximum frequency that a given circuit can achieve. +> If your circuit quality is relatively high, there may be a phenomenon that the frequency of the comprehensive report increases with the increase of the target frequency. +> However, the overall area of the circuit will also increase. +> Therefore, if you do not consider the area cost for the time being, you can set a relatively high target frequency for the synthesizer to try to reach. +In processor design, there are some factors that will become the upper limit of processor frequency: +> 1. Read delay of register pile. Usually, the read operation of the register pile needs to be completed in one cycle, cannot be split into multiple cycles, so the processor frequency will not exceed the maximum operating frequency of the register pile. +> 2. Adder delay such as bit width and processor word length. Usually the addition operation of EXU needs to be completed in one cycle, if the addition operation takes multiple cycles to complete, it will significantly reduce the execution efficiency of all instructions containing the addition operation. Including addition instruction, subtraction instruction, memory access instruction (need to calculate the memory access address), branch instruction (need to calculate the target address), even 'PC + 4' calculation, which makes the IPC of the program significantly reduced. Therefore, the processor frequency will not exceed the maximum operating frequency of this adder. +> 3. Read/write delay of SRAM. As a fully customized unit, SRAM cannot be logically designed to optimize its read/write latency. +> Therefore, as long as SRAM is used, the processor's frequency will not exceed the maximum operating frequency of SRAM. +> +> You can write some simple small modules to individually evaluate the maximum operating frequency of these parts. +> In order to avoid the I/O ports, you need to insert some triggers in both the input and output of these parts. +> SRAM as a fully customized unit, its maximum operating frequency is usually recorded in the corresponding manual, and we do not use SRAM at present, you can not carry out the evaluation of SRAM. +> +> After evaluation, you can set the integrated target frequency of the processor higher than the maximum operating frequency of the above parts. +> To guide the synthesizer to synthesize results at higher frequencies as much as possible. +> Of course, you can also set the target frequency directly to a value that is difficult to achieve, such as 5000MHz, however, we recommend that you go through the above evaluation process to find out the maximum operating frequency of these components. + + +> +> #### comment: Programmable counter increments +> +> The 'r' mentioned above is a constant for RTL design, but not in complex processors that can be dynamically tuned. For this complex processor, we need to make 'r' programmable by storing it in a device register. After the software performs dynamic frequency modulation, the new 'r' is written to the device register. +> Therefore, we also need to map this device register to the address space of the processor. +> Make it accessible to the processor via the SoC. Of course, s can also be designed to be programmable. +> However, this requires a lot of changes to ysyxSoC, so we won't require you to implement the above programmable features. + + +> +> #### todo:: Rediscover the optimization bottleneck +> +> After adding the delay module, rerun some tests and collect the performance counter statistics, +> Then look for performance bottlenecks according to Amdahl's law. + + +> +> #### todo:: Evaluate the performance of NPCS +> +> After adding the delay module, run microbench train-scale tests, record various performance data, includes frequency information and various performance counters. +> +> After calibrating the memory access delay, running microbench train-scale tests in ysyxSoC is expected to take several hours. +> But we will get performance data very close to the streaming environment. +> Then you can re-evaluate and record performance data each time you add a feature, to help you tease out the performance benefits of each feature. + + + +> #### danger:: Records performance data +> +> Next, we ask you to record performance data after each evaluation. +> If you apply for the sixth streaming film, you will submit this part of the record. +> If the recorded situation does not match the actual development process, you may not get the streaming opportunity. +> We want to force you to recognize and understand how NPC behave in this way, exercise the basic literacy of processor architecture design, +> Instead of just translating architecture diagrams from reference books into RTL code. +> +> Specifically, you can record as follows: +| commit | comment | simulation cycle | instruction count | IPC | synthesize frequency | synthesize area | performance counter 1 | performance counter 2 | ... | +> | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | +> | 0123456789abcdef | eg, complement cache | 200000 | 10000 | 0.05| 750MHz | 16000 | 3527 | 8573 | ... | +> +> Where: +> +> * We ask you to add a rule 'make perf' to the Makefile in the NPC project directory. +> Enable future execution +> +> ```bash +> commit in the git checkout table +> make perf +> ``` +> +> Then, the performance data in the corresponding table can be reproduced. +> If the reproduced situation is materially inconsistent with the data recorded in the table, it may be judged as a violation of academic integrity if it cannot be reasonably explained. +> +> * You can assume that you are conducting a scientific study and that you are responsible for the experimental data during the study: +> Experimental data need to be reproducible and able to stand up to scrutiny in public. +> * You can create a new worksheet in the learning record called 'NPC Performance Evaluation Results' to record these performance data +> * You can replace 'performance counter 1' and 'performance counter 2' with the actual names of the respective performance counters +> * You can record more performance counters according to the actual situation +> * You can also record your analysis of performance data in the description column +> * We encourage you to record as many entries of performance data as possible to help you quantify the performance changes of NPC. + + +> +> #### question: Is it worth optimizing the frequency? +> +> After calibrating the memory access delay against the main frequency, you will see a significant decrease in IPC. +> It can be expected that if the frequency increases further, the number of cycles of the access delay will also increase, resulting in a decrease in IPC. +> So, is the frequency worth optimizing? +> If so, where do the performance gains from optimizing the frequency come from? +> If it is not worth it, where is the performance regression caused by optimizing the frequency? +> Try analyzing your guess with performance counters. + + +> +> #### comment:: Calibrates the memory access delay on the FPGA +> +> Modern FPGas typically contain DDR memory controllers. +However, limited by the implementation principle of FPGA, the CPU frequency of PL part also has a big gap with the ASIC process. +> Even the CPU frequency is lower than the memory controller frequency. For example, a memory controller can run at 200MHz, but the CPU can only run at hundreds or even tens of MHz on the FPGA. In real chips, the CPU can usually run at more than 1GHz (for example, the third-generation Xiangshan is targeting 3GHz). Obviously, the performance data obtained in such an evaluation environment is significantly distorted for CPU performance tests targeting the stream film. +> Solving the above memory frequency inversion situation by calibrating the memory access delay, is an issue that a processor enterprise must address before using an FPGA to evaluate CPU performance. +> +> In fact, since real DDR is a complex system, even with the delay module scheme described above, there are also more questions to consider: +> +> * Due to the differences in analog circuit components, the phy module of the memory controller in the FPGA is different from the memory controller of the ASIC. +> This may affect access latency +> * Limited by the implementation principle of FPGA and the configurable range of PLL on FPGA, +> DDR controllers also operate less frequently than ASIC memory controllers, +> However, DDR particles cannot be reduced in equal proportion, resulting in inaccurate memory access delay +> * After the DDR controller downfrequency, its refresh rate and other parameters are also inconsistent with the ASIC memory controller +> +Therefore, to solve the memory frequency inversion problem well on the FPGA is also a big challenge in the industry. +For example, the Xiangshan team set up a group led by engineers to solve the problem. +> +> In fact, whether you need to calibrate the memory access delay of the FPGA depends on the usage scenario and the target of the FPGA: +> +> * Teaching: Use FPGas only as an environment for functional testing. At this point, the role of the FPGA is to accelerate the simulation process, no matter what the ratio of running frequencies between the processor and the memory controller is, it theoretically does not affect the results of the functional test. +> * Competition or research project: Use the FPGA as the environment for performance testing, but also as the target platform. +> The stream is not targeted at this time, so there is no need to calibrate the memory access delay. +> * Enterprise product development: Use the FPGA as the environment for performance testing, but at the same time, the stream film as the target. +> We would expect the performance data derived from the FPGA to be as consistent as possible with the real chip. +> In this case, calibrating the memory access delay of the FPGA will be indispensable. +> "A core for life" although a lot of simplification, but in general, I still hope that everyone can appreciate the general process of enterprise product development, +And given the engineering challenges of calibrating real DDR controllers, we don't require you to use FPGA. +> In contrast, calibrating memory access latency in a simulation environment is much easier than in an FPGA. Therefore, we recommend that you perform performance evaluation and optimization in a simulation environment. + + +> #### todo:: Improve the efficiency of functional testing +> +> It is appropriate to use the ysyxSoC simulation environment after the calibrated memory access delay for performance evaluation. But you will also feel that the simulation efficiency of this environment is significantly lower than that of the previous' riscv32e-npc ': +> From the running time of microbench's Train-scale tests, +> The simulation efficiency of 'riscv32e-npc' is tens or even hundreds of times that of 'riscv32e-ysyxsoc'. +> This reflects a trade-off: To get more accurate performance data, +> The more details you want to simulate (such as SDRAM controllers and SDRAM particles), +> Therefore, more time is spent in the process of simulation 1 cycle, resulting in lower simulation efficiency. +> Accordingly, the performance data obtained from 'riscv32e-npc' with higher simulation efficiency is inaccurate. +> Is' riscv32e-npc 'meaningless? +In fact, we can use 'riscv32e-npc' as a functional testing environment. +> If there is a bug in 'riscv32e-npc', +> Then this bug also has a high probability of existing in 'riscv32e-ysyxsoc', +But obviously debugging this bug in the more efficient simulation 'riscv32e-npc' is a more appropriate solution. +In this way, we can give full play to the advantages of the two simulation environments, learn from each other, +> Improve the overall efficiency of development and testing. +> Try to modify the relevant simulation flow to support NPC emulation in 'riscv32e-npc' and 'riscv32-ysyxsoc'. +> Where, 'riscv32e-npc' still takes' 0x8000_0000 'as the PC value at reset. + + +## 4 kinds of optimization methods for classical architecture + + + +Once we have identified the performance bottleneck, we can consider how to optimize it. +There are four main types of optimization methods in classical architecture: + + 1. Locality - Use the nature of data access to improve command supply and data supply capabilities. The typical technique is cache. + 2. Parallel - multiple instances work at the same time to improve the overall processing capacity of the system. There are many categories of parallel methods: + +* Instruction level parallelism - executing multiple instructions at the same time, related techniques include pipelining, multi-emission, VLIW, and out-of-order execution +* Data level parallelism - accessing multiple data at the same time, related techniques include SIMD, vector instruction/vector machine +* Task-level parallelism - executing multiple tasks at the same time, related technologies include multi-threading, multi-core, multi-processor, and multi-process; GPU belongs to SIMT, which is a parallel method between data level parallelism and task level parallelism + 3. Prediction - Perform a choice speculatively when you don't know the right choice, then check whether the choice is correct, and if the prediction is correct, you can reduce the delay in waiting, thereby achieving performance gains. If the prediction is wrong, it needs to be recovered through additional mechanisms. Typical techniques include branch prediction and cache prefetch + 4. Accelerators - The use of specialized hardware components to perform specific tasks, thereby improving the efficiency of the execution of that task, some examples include: + +* AI Accelerator - Acceleration of the calculation process of AI loads, usually accessed via a bus +* Custom extension instructions - The accelerator is integrated into the CPU and accessed through new custom extension instructions +* Multiplier and divider - can be seen as a class of accelerators, treating RVM as an extension of RVI. The special hardware module of multiplication and division is controlled by the multiplication and division command to accelerate the calculation process of multiplication and division + + + +> #### caution:: Re-examine the processor architecture design +> Many electronics majors are likely to start off thinking of processor architecture design as "developing a processor with RTL." But RTL coding is only one part of the processor design process, and strictly speaking does not belong to the scope of processor architecture design. +> +> In fact, a qualified processor architect should have the following abilities: +> +> 1. Understand how the program runs on the processor +> 2. For the features that support the operation of the program, it can determine whether they are suitable for implementation at the hardware level or the software level +> 3. For the features suitable for implementation at the hardware level, it can propose a set of design schemes that still meet the target requirements under the balance of various factors +> These capabilities reflect the fundamental purpose for which people use computers: to solve real needs through programs. If a feature added at the hardware level has a low benefit to the program, or even if the program does not use the feature at all, +> The decision makers of the relevant solutions are really not professional architects. +> +> In fact, these abilities need to be deliberately practiced. +> We have met many students who can translate the block diagram of the pipeline processor into RTL code according to some references. But there is no way to assess whether a program is running as expected, or how to further optimize or implement new requirements; +Some students have designed an out-of-order superscalar processor, but the performance is not as good as the textbook five-level pipeline. +> This shows that the ability to design architecture is not the same as the ability to code RTL. +> Maybe these students did understand the basic concepts of pipelining and out-of-order superscalar during the development process, but it lacks a global vision and understanding, and only focuses on improving the efficiency of back-end computing. +> Little or no attention to instruction and data supply, +> Lead to the memory access capacity of the processor is much lower than the computing capacity, and the overall performance is poor. Therefore, even if you can design a correct out-of-order superscalar processor, it is not a good processor. To some extent, these students also do not have the ability to design processor architecture. +> +> "A core for Life" tries to exercise your processor architecture design ability from another idea: +> First of all, run the program and understand every detail of program operation from the perspective of software and hardware collaboration; +Then learn the basic principles of processor performance evaluation and understand the microcosmic manifestation of program behavior at the hardware level; +Finally, learn the various architectural optimization methods and use scientific evaluation methods to understand the real benefits of these optimization methods on the operation of the program. +> +> This program of study is very different from textbooks. This is because processor architecture design skills can only be practiced. However, the theoretical classroom using textbooks is limited by the curriculum system and cannot examine the students' ability of architecture design. +> So it's easy to get started with a textbook or reference book, but if you want to become a professional in this direction, understand the limits of these books, +> Cross the boundaries of books at the necessary stages to develop real architecture design skills through targeted training. + + +## Memory hierarchy and locality principle + + + +After calibrating ysyxSoC's memory access delay, you should find that the performance bottleneck is the instruction supply: +Take an instruction to wait for tens of hundreds of cycles, the pipeline can not flow. +To improve the ability of instruction supply, the most suitable is to use caching technology. Before introducing caching, however, we need to understand the storage hierarchy and locality principles of computers. + + +### Memory Hierarchy + + + +There are different storage media in computers, such as registers, memory, hard disks and magnetic tape. They have different physical characteristics, so the various indicators are also different. They can be evaluated in terms of access time, capacity, and cost. + +``` +access time /\ capacity price + / \ + ~1ns / reg\ ~1KB $$$$$$ + +------+ + ~10ns / DRAM \ ~10GB $$$$ + +----------+ + ~10ms / disk \ ~1TB $$ + +--------------+ + ~10s / tape \ >10TB $ + +------------------+ +``` + + + +* Register. The access time of the register is very short, basically consistent with the main frequency of the CPU. Current commercial-grade high-performance cpus are clocked at about 3GHz, so register access times are less than 1ns. Registers are small in size, usually less than 1KB. For example, the RV32E has 16 32-bit registers and is 512b in size. +In addition, the manufacturing cost of registers is more expensive, if a large number of registers are used, it will occupy a lot of flow area. +* DRAM. The access time of DRAM is about 10ns, its capacity is much larger than the register, and it is often used in memory. Its cost is also much lower, the price of 16GB memory on an e-commerce platform is 329 yuan, about 20 yuan /GB. +* Mechanical hard drives. The access time of mechanical hard drives is limited by their mechanical components, such as platters that need to be rotated, usually in the order of 10ms. In contrast, mechanical hard drives also have a larger capacity, often up to several terabytes; +Its cost is also cheaper, the price of 4TB mechanical hard disk on an e-commerce platform is 569 yuan, about 0.139 yuan /GB. +* SSD. SSD is also a popular storage medium, and its storage unit uses NAND flash. +Works based on electrical properties, so access is much faster than mechanical hard drives, and its read latency is close to DRAM, +However, due to the characteristics of flash units, write latency is still much higher than DRAM. Its cost is slightly higher than the mechanical hard disk, the price of 1TB solid state disk on an e-commerce platform is 699 yuan, about 0.683 yuan /GB. +* Tape. The storage capacity of tape is very large, the cost is very low, but the access time is very long, about 10s. Therefore, it is rarely used at present and is usually used in data backup scenarios. The price of a 30TB tape drive on an e-commerce platform is 1000 yuan, about 0.033 yuan /GB. + +\ + +It can be seen that due to the limitations of the physical characteristics of the storage medium, no memory can meet various indicators of large capacity, fast speed and low cost at the same time. Therefore, computers usually integrate a variety of memories and organize them organically through certain technologies to form a storage hierarchy. +On the whole, it achieves the comprehensive index of large capacity, fast speed and low cost. It sounds a little crazy, but the key is how to organize the various kinds of memory. + + +### principle of locality + + + +In fact, the above way of organization is exquisite, and the secret is the principle of locality of the procedure. Computer architects have found that a program's access to memory over a period of time is usually concentrated in a small area: + +* Time locality - After accessing a storage unit, it may be accessed again within a short period of time +* Spatial locality - After a storage unit is accessed, its neighboring storage units may be accessed within a short period of time + + + +The above phenomena are related to the structure and behavior of the program: + +* Programs will most of the time execute sequentially or in a loop, following spatial locality and temporal locality, respectively +* When writing a program, the related variables are located near each other in the source code, or are organized by structures. The compiler will also allocate a similar amount of storage space for its allocation, thus showing spatial locality +* The number of variables accessed during program execution is usually not less than the number of variables (otherwise there will be unused variables). Therefore, there must be variables that are accessed multiple times, thus presenting temporal locality + + + +> #### option:: Observe program locality +> +> Program locality is related to memory access, and naturally, we can observe it through mtrace! +> Run some programs in NEMU and get mtrace. +> After that, you need to do a second processing of the mtrace output and try to render your results with some drawing tools. + + + +> #### question:: The locality of the linked list +> +> Is there locality in traversing the linked list? +> Try to compare which is more local when accessing array elements versus linked list elements. + + + +The principle of locality tells us that program access to memory exhibits a centralized character. This shows that even though the capacity of slow memory is large, the program will only access a small portion of the data at a time. In this case, we can first move this data from the slow memory to the fast memory, and then access it in the fast memory. + + + +This is the trick to the way the various stores are organized in a storage hierarchy: +All kinds of memory are arranged according to the hierarchy, the upper level memory is fast but the capacity is small, the lower level memory is large but the speed is slow; +When accessing data, first access the faster upper storage. If the data is at the current level (called hit), the data at the current level is directly accessed. +Otherwise (called missing), it is looked for at the next level, which passes the target data and its neighbors to the upper level. Among them, "pass the target data to the upper layer memory" makes use of the time locality, expecting that the target data can be hit in the fast memory when it is accessed next time; +On the other hand, "pass adjacent data to upper storage" takes advantage of spatial locality in the expectation that the next time adjacent data is accessed, it will also be hit in the faster memory. + + + +For example, when accessing DRAM, if the data does not exist, the mechanical hard disk is accessed and the target data and its adjacent data are transported to the DRAM. +The next time the data is accessed, it can be hit in DRAM, thus directly accessing the data in DRAM. In this way, we approximate a memory with access speed close to DRAM and capacity close to mechanical hard disk! In terms of cost, the quotation of the above e-commerce platform, for example, the total price of a 16GB memory bar and a 4TB mechanical hard disk is less than 900 yuan, however, if you want to purchase 4TB of memory, you need '329 * (4TB / 16GB) = 84,224' yuan! + + + +Of course, there is no free lunch in the world, to achieve the above effect is conditional, the design of the computer system needs to meet the principle of locality: +On the one hand, a computer system needs to design and implement a storage hierarchy. +On the other hand, programmers also need to develop a better local program, so that it can obtain better performance in the storage hierarchy. If the program is poorly localized and the data accessed does not have centralized characteristics. As a result, most accesses fail to hit the fast memory, making the performance of the entire system close to accessing the slow memory. + + +## Simple cache + +### Cache introduction + + + +Going back to the performance bottleneck above, in order to optimize instruction supply, what we actually need to do is improve the efficiency of accessing DRAM. To do this, we only need to add a layer of memory between registers and DRAM in accordance with the idea of computer storage hierarchy. This is the idea of cache. +In other words, before accessing the DRAM, access the cache first, and if the hit, directly access; If it is missing, data is read from the DRAM into the cache and then accessed from the cache. + + + +The above cache belongs to the narrow category, referring to the processor cache (CPU cache). In fact, cache in the broad sense doesn't just refer to the hardware module on the access path, in computer systems, cache is ubiquitous: The disk controller also contains a cache for caching data read from disks. The row buffering in SDRAM we introduced earlier is essentially the cache of SDRAM storage arrays; The operating system also maintains a cache through software for storage devices such as disks, sed to store recently accessed data, this cache is essentially a large array of structures. It is allocated in memory, so the operating system is responsible for moving data between disk and memory; Caching is also critical for distributed systems, and if the data to be accessed is not cached locally. You need to access the remote side, and this is the case with the browser's web cache and video content cache. + + +> +> #### comment:: cache visible to software programs +> +> In some processors, the cache is visible to software programs. +> For example, shared memory in the CUDA GPU programming model is the same as CPU cache at the organizational level. +> All are a layer of memory between registers and DRAM; But unlike CPU caches, Gpus provide memory access instructions specifically for accessing shared memory. Therefore, a GPU program can use instructions to read data from memory into a specified location in shared memory. + + + +For the sake of description, we will call the data read from the DRAM a block of data, +The data blocks stored in the cache are called cache blocks (some textbooks also call them cache lines, or cache lines in English). +Naturally, designing a cache requires the following concerns: + +* What size should the data block be? +* How do I check whether the cache access request is matched? +* The capacity of cache is usually smaller than that of DRAM. How do I maintain the mapping between cache blocks and data blocks in DRAM? What do I do when the cache is full? +* The CPU may perform write operations to update the data in the data block. How should the cache be maintained? + + +### Simple instruction cache + + + +Let's start with the instruction cache (icache). The icache is read-only because the IFU finger fetching process does not need to be written to memory. We can forget about how to handle the case of CPU writing data blocks. For the block size, let's first take the length of an instruction, which is 4B. This may not be the best design, but less than 4B is certainly not a good design for icache. Otherwise, multiple memory accesses are required to retrieve a new instruction. As to whether greater than 4B is better, we'll evaluate that later. + + + +To check if the access request is hit in the cache, naturally, in addition to storing the data block itself, the cache also needs to record some properties of the block. The most direct way to do this is to record a unique number for the block. But we also want this unique number to be calculated in a simple way. +Since the blocks come from memory, we can number the memory by block size. The number of the block corresponding to the memory address 'addr' is' addr / 4 ', and this number is called the tag of the block. So, we just calculate the tag of the cache address and compare it to the tag of each cache block, we could know if the target block is in the cache. + + + +Then consider the organization of cache blocks. +Depending on the storage hierarchy, cache capacity cannot be as large as DRAM, and usually cannot be as small as 1 cache block. Therefore, it is necessary to consider which cache block to read into when reading a new block. Since there are multiple cache blocks in the cache, we can also number the cache blocks. The simplest way to organize is to read a new block into a fixed cache block, which is called direct-mapped. +To do this, we need to clarify the mapping from the memory address 'addr' to the cache block number. If the cache can hold 'k' cache blocks, a simple mapping relationship is' cache block number = (addr / 4) % k '. That is, for a data block with memory address 'addr', it will be read into a cache block numbered '(addr / 4) % k'. + + + +Obviously, multiple data blocks may map to the same cache block. At this point, you need to decide whether to keep the existing cache block or read the new block into the cache block. According to the locality principle of the program, the probability of accessing a new block in the future is greater. Therefore, when a new cache block is read, the existing cache block should be replaced with the new block. This enables new blocks to be hit in the cache in the following period of time. + + + +We can treat all cache blocks as an array, and the cache block number is the index of the array, so the cache block number is also called the block index. For a block size of 'b' bytes, a total of 'k' of cache blocks directly mapped cache, have 'tag = addr/b', 'index = (addr/b) % k'. +In order to facilitate calculation, it is usual to take 'b' and 'k' as powers of 2, assuming 'b = 2^m', 'k = 2^n'. Suppose ` addr ` for 32-bit, have ` tag = addr / 2 ^ m = addr [31: m] `, ` index = (addr/m ^ 2) % 2 ^ n = addr [m + n - 1: m] `. As you can see, index is actually the low n bit in the tag. In a directly mapped cache, different index data blocks must be mapped to different cache blocks. Even if their tags have the same high m bits (addr[31:m+n]). Therefore, it is only necessary to record 'addr[31:m+n]' when recording the tag. + + + +A memory address can be divided into the following three parts: tag, index, offset. +The tag part is the unique number of the data block in the cache, and the index part is the index of the data block in the cache. The offset part is an in-block offset that indicates which part of the data in the block needs to be accessed + +``` + 31 m+n m+n-1 m m-1 0 ++---------+---------+--------+ +| tag | index | offset | ++---------+---------+--------+ +``` + + + +Finally, when the system is reset, there is no data in the cache, and all cache blocks are invalid. To identify whether a cache block is valid, add a valid bit to each cache block. +valid and tag are collectively referred to as cache block metadata. They refer to the data used to manage data. In this case, the managed data is a cache block. + + + +In summary, the icache workflow is as follows: + + 1. The IFU sends a finger fetch request to the icache + 2. After the icache obtains the address requested for obtaining a finger, it indexes a cache block based on index. Check whether the tag of the cache block is the same as that of the request address, and check whether the cache block is valid. If the above conditions are met at the same time, the match is hit and go to Step 5 + 3. Read the requested data block in the DRAM through the bus + 4. Fill the data block into the corresponding cache block and update the metadata + 5. Return the fetch command to IFU + + + +Once you've sorted out your workflow, you should know how to implement icache: a state machine! The above workflow even includes a bus access, so the implementation of icache can also be seen as an extension of the bus state machine. You are already familiar with the implementation of the bus, so how to achieve the icache state machine, let you comb it! + + +> +> #### todo:: Implements the icache +> +> Implement a simple icache with a block size of 4B and 16 cache blocks. +> In general, the cache's storage array (including data and metadata) is implemented through SRAM. However, the use of SRAM in the ASIC process involves selection and instantiation, where the selection of SRAM may affect the storage of data and metadata. +> As the first cache exercise, for the sake of simplicity, the storage array is implemented through a trigger to improve the implementation flexibility. +> During implementation, you are advised to configure related parameters to facilitate subsequent evaluation of performance of different configuration parameters. +> After implementation, try to evaluate its performance. + + + +> #### todo:: An address space suitable for caching +> +> Not all address Spaces are suitable for caching, only memory type address Spaces are suitable for caching. +> In addition, the access latency of SRAM itself is only 1 cycle, so there is no need for caching. +> Leaving the cache block for another address space is a more appropriate solution. + + + +> #### todo:: Estimate the ideal return of dcache +> +> Typically, an LSU also has a cache paired with it, called a data cache (dcache). +Before implementing dcache, we can first estimate its performance benefits under ideal conditions. +> Assume that the capacity of dcache is infinite, the hit ratio of dcache access is 100%, and the access delay of dcache is 1 period. +> Try to estimate the performance benefit of adding such a dcache based on the performance counter. +> If your estimate is correct, you should find that adding dcache at this point is not worth it, and we will continue to discuss this issue below. + + +## formal verification + + + +With DiffTest, it should be easy to ensure that a given program still runs correctly after you plug in icache. But how do you ensure that icache works correctly for any program? + + + +This seems like a very difficult question, and I'm sure you've encountered this situation: +The code runs a given test case correctly, but any day it runs a different test, something goes wrong. Whether it is analyzed in principle or summarized in practice, it is impossible to prove the correctness of a module by testing alone. Unless these test cases cover all programs, or all input cases of the module under test. The number of programs is infinite, and it's not practical to test them all. However, the input of the module under test is limited, and it is at least theoretically possible to iterate over all the inputs. + + + +If you want to iterate over all the inputs in a module, on the one hand, you want to generate a test set that covers all the input cases, +On the other hand, there needs to be a way to determine whether a specific input is correct. Even if this can be done, it takes a long time to run all the tests, which is often unbearable. The equivalence class testing method in software testing theory can classify the tests whose essential behavior is similar. Select one test from the equivalence class to represent the tests of the entire equivalence class, thereby reducing the size of the test set. However, how equivalence classes should be divided is decided manually according to the logic of the module under test. But according to another popular piece of software advice, any process that requires human intervention is at risk of going wrong. + + +### The basic principles of formal verification + + + +So, can we have tools that automatically find test cases for us? There is such a tool! +A Solver is a class of mathematical tools that find feasible solutions under given constraints. It is essentially similar to solving mathematical problems such as equations or linear programming. For example, [Z3][z3] is a solver for the [Satisfiablity Modulo Theories, SMT][smt] problem. It can solve whether a proposition containing real numbers, integers, bits, characters, arrays, strings and so on is true. +In fact, as long as the problem can be expressed as a subset of the first-order logic language, it can be handed over to the SMT solver to solve. Therefore, SMT solvers can also be used to solve complex problems like Sudoku. SMT solvers are widely used in automatic theorem proving, program analysis, program verification and software testing. +Here is an example of using Z3 to solve a system of equations in python. + +```python +#!/usr/bin/python +from z3 import * + +x = Real('x') # define variable +y = Real('y') +z = Real('z') +s = Solver() +s.add(3*x + 2*y - z == 1) # define constraints +s.add(2*x - 2*y - 4*z == -2) +s.add(-x + 0.5*y - z == 0) +print(s.check()) # Find out if there is a feasible solution: sat +print(s.model()) # Output feasible solution: [y = 14/25, x = 1/25, z = 6/25] +``` + +[z3]: https://github.com/Z3Prover/z3 +[smt]: https://en.wikipedia.org/wiki/Satisfiability_modulo_theories + + + +In the field of test verification, there is a class of solver based verification methods called formal verification. Its core idea is to take the design as the constraint condition, the input as the variable, and "at least one verification condition is not valid" as the solution goal. Express this in a first-order logic language, translate it into a language that the solver recognizes, and then try to get the solver to find out if there is a viable solution. For example, if a design has two validation conditions: assert(cond1) and assert(cond2), attempts to have the solver look for the presence of inputs such that '! cond1 || ! cond2 'is established. If a feasible solution exists, it means that the solver has found a test case that violates the verification condition, and this counterexample can help us debug and improve the design. If the feasible solution does not exist, it means that all the inputs do not violate the verification condition, thus proving the correctness of the design! It can be seen that whether the solver can find a viable solution is excellent news for the designer. + + +> +> #### caution:: Don't trust 100% coverage reports for UVM testing +> +> If you know UVM, you should know that the goal of UVM is to improve coverage. But if you think that improving coverage is the ultimate goal of test validation, you probably don't know much about test validation. +> +> In fact, the ultimate goal of test validation is to prove the correctness of the design, or to find all the bugs in the design. But experienced engineers know that even if 100% coverage is achieved, there may still be some bugs that are not detected. +> And it's impossible to estimate how many of these undetected bugs are out there. +> The goal of coverage is widely adopted by enterprises, in part because coverage is an easily quantifiable and statistical indicator. +> If you use strict language to describe "an event is covered ", it is +> ``` +> There is a test case that ran successfully, and the event was triggered during the run. +> ``` +> The event here can be execution of a line of code (line coverage), a signal flipping (flip coverage), +> The state of a state machine is transferred (state machine coverage), user-defined conditions are met (function coverage), and so on. +> and "coverage reaches 100%", yes +> ``` +> For each event, there is a test case that runs successfully and triggers the event during its run. +> ``` +> Note that we can override different events with different test cases. +> According to this definition, you only need to add some flags during the simulation to calculate the coverage. Even most RTL emulators (including Verilators) provide automatic coverage statistics, +> If you want to learn how to count coverage, just need RTFM. +> +> On the other hand, it can also be seen from the above definition that improving coverage is actually the lower limit of verification work, +> Too low coverage only indicates that the validation has not been done enough, which is consistent with "untested code is always wrong". +> But the ultimate goal of test validation is +> ``` +> Ran successfully for all test cases. +> ``` +> In contrast, "100% coverage" is actually a necessary but not sufficient condition for "correct design", which is very loose. +> Even we can easily cite a counter-example: +> A module has two functions, for each function, has been covered by its own test cases, then the function coverage reaches 100%; +But when you run test cases that require the two functions to interact, something goes wrong. +> +> Achieving higher coverage certainly improves the probability of a correct design compared to low-coverage verification efforts. But what we want to say is that even if coverage reaches 100%, it's still far from enough. Especially for complex systems, +> Some hidden bugs are usually triggered when multiple boundary conditions are met at the same time. +> Instead of insisting on 100% coverage as the verification goal, we encourage people to actively think about how to find more potential bugs through other methods and techniques. +> The practical significance of the verification work is also greater. + + +### Simple example of formal validation + + +#### Formal validation process based on Chisel + + + +Chisel's testing framework [chiseltest][chiseltest] has integrated formal validation capabilities. You can translate FIRRTL code into a certain language that is recognized by Z3 and have Z3 prove that a given 'assert()' is correct. If the counterexample can be found, it is very convenient to generate the waveform of the counterexample to assist debugging. With formal validation tools, we no longer have to worry about incomplete test case coverage, or even the need to write test cases. Nice! + +[chiseltest]: https://github.com/ucb-bar/chiseltest + + +Here is an example of formal validation of the Chisel module: + +```scala +import chisel3._ +import chisel3.util._ +import chiseltest._ +import chiseltest.formal._ +import org.scalatest.flatspec.AnyFlatSpec + +class Sub extends Module { + val io = IO(new Bundle { + val a = Input(UInt(4.W)) + val b = Input(UInt(4.W)) + val c = Output(UInt(4.W)) + }) + io.c := io.a + ~io.b + Mux(io.a === 2.U, 0.U, 1.U) + + val ref = io.a - io.b + assert(io.c === ref) +} + +class FormalTest extends AnyFlatSpec with ChiselScalatestTester with Formal { + "Test" should "pass" in { + verify(new Sub, Seq(BoundedCheck(1))) + } +} +``` + + + +> #### danger:: No longer use Utest +> +> As the Chisel version has evolved, Utest is no longer supported, so we also recommend that it not be used. +> If you get the code for 'chisel-playground' before 2024/04/11 01:00:00, +> Please modify your build.sc by referring to 'object test' in [new build.sc] [new build.sc]. + +[new build.sc]: https://github.com/OSCPU/chisel-playground/blob/master/build.sc + + + +The 'Sub' module of the above example implements the function of complement subtraction by "inverting and adding 1". In order to verify the correctness of the implementation of the 'Sub' module, the code compares the result of the "inverse plus 1" calculation with the result obtained by the subtraction operator. We expect that 'assert()' should hold for any input. To demonstrate the effect of formal validation, we injected a bug into the implementation of the 'Sub' module: +The "add 1" operation is not performed when 'io.a' is' 2 ', in which case the result of complement subtraction is incorrect. + + + +The code also needs to pass in a 'BoundedCheck(1)' parameter when calling the formal validation function provided by chiseltest. This parameter specifies the number of cycles that the SMT solver needs to prove. For example, 'BoundedCheck(4)' means asking the SMT solver to try to prove that the module under test has been reset for 4 cycles. 'assert() 'is not violated under any input signal. For combinational logic circuits, we only need to let the SMT solver solve in 1 cycle. + + +Before running the above tests, you also need to install 'z3' : + +```bash +apt install z3 +``` + +After installation, run the test through mill-i __.test with the following output: + +``` +Assertion failed + at SubTest.scala:16 assert(io.c === ref) +- should pass *** FAILED *** + chiseltest.formal.FailedBoundedCheckException: [Sub] found an assertion violation 0 steps after reset! + at chiseltest.formal.FailedBoundedCheckException$.apply(Formal.scala:26) + at chiseltest.formal.backends.Maltese$.bmc(Maltese.scala:92) + at chiseltest.formal.Formal$.executeOp(Formal.scala:81) + at chiseltest.formal.Formal$.$anonfun$verify$2(Formal.scala:61) + at chiseltest.formal.Formal$.$anonfun$verify$2$adapted(Formal.scala:61) + at scala.collection.immutable.List.foreach(List.scala:333) + at chiseltest.formal.Formal$.verify(Formal.scala:61) + at chiseltest.formal.Formal.verify(Formal.scala:34) + at chiseltest.formal.Formal.verify$(Formal.scala:32) + at FormalTest.verify(SubTest.scala:19) + ... +``` + + +The above information indicates that the solver found a test case that violated 'assert()' on cycle 0 after the reset. Further, developers can aid debugging with the waveform file test_and_run/Test_should_pass/Sub.vcd. After fixing the error in the 'Sub' module, re-running the above test will no longer output the error message. The representation solver cannot find a counterexample, which proves the correctness of the code. + + +#### Formal verification process based on Verilog + + + +chiseltest's formal validation process is to convert FIRRTL code into a language that is recognized by Z3. Verilog is not involved, so the above process does not support projects developed on Verilog. If you're developing with Verilog, you can use a formal validation process based on Yosys. [SymbiYosys][symbiyosys] is the front-end tool for this process. + +[symbiyosys]: https://symbiyosys.readthedocs.io/en/latest/ + + +Here is an example of formal validation of the Verilog module: + +```verilog +// Sub.sv +`define FORMAL + +module Sub( + input [3:0] a, + input [3:0] b, + output [3:0] c +); + + assign c = a + ~b + (a == 4'd2 ? 1'b0 : 1'b1); + +`ifdef FORMAL + always @(*) begin + c_assert: assert(c == a - b); + end +`endif // FORMAL + +endmodule +``` + + + +The 'Sub' module of the above example implements the function of complement subtraction by "inverting and adding 1". In order to verify the correctness of the implementation of the 'Sub' module, the code compares the result of the "inverse plus 1" calculation with the result obtained by the subtraction operator. We expect that 'assert()' should hold for any input. To demonstrate the effect of formal validation, we injected a bug into the implementation of the 'Sub' module: +The "add 1" operation is not performed when 'a' is' 2 ', then the result of complement subtraction is wrong. + + + +After writing the above 'Sub.sv' file, you also need to write the SymbiYosys configuration file '*.sby', which is generally composed of the following parts: + +* task: Optional to specify the task to be executed +* options: A must for matching 'assert', 'cover', etc. statements in the code to the model +* engines: Required to specify the model to be solved +* script: A required item that contains the Yosys script required for testing +* files: Required to specify the files for testing + +Here is an example of the configuration file 'Sub.sby' : + +```sby +[tasks] +basic bmc +basic: default + +[options] +bmc: +mode bmc +depth 1 + +[engines] +smtbmc + +[script] +read -formal Sub.sv +prep -top Sub + +[files] +Sub.sv +``` + + + +The above configuration has a 'depth' option to specify the number of cycles the SMT solver needs to prove. For example, 'depth 4' means having the SMT solver try to prove that the module under test has been reset for four cycles. 'assert() 'is not violated under any input signal. For combinatorial logic circuits, we only need to let the SMT solver solve in 1 cycle. + + +Before formal verification, you will need to download the appropriate tool from [this link][oss release]. After decompression, run the command line 'path-to-oss-cad-suite/bin/sby -f Sub.sby' for formal verification, and the output information is as follows: + +[oss release]: https://github.com/YosysHQ/oss-cad-suite-build/releases + +``` +SBY 16:52:19 [Sub_basic] engine_0: ## 0:00:00 Checking assumptions in step 0.. +SBY 16:52:19 [Sub_basic] engine_0: ## 0:00:00 Checking assertions in step 0.. +SBY 16:52:19 [Sub_basic] engine_0: ## 0:00:00 BMC failed! +SBY 16:52:19 [Sub_basic] engine_0: ## 0:00:00 Assert failed in Sub: c_assert +SBY 16:52:19 [Sub_basic] engine_0: Status returned by engine: FAIL +SBY 16:52:19 [Sub_basic] summary: Elapsed clock time [H:MM:SS (secs)]: 0:00:00 (0) +SBY 16:52:19 [Sub_basic] summary: Elapsed process time [H:MM:SS (secs)]: 0:00:00 (0) +SBY 16:52:19 [Sub_basic] summary: engine_0 (smtbmc) returned FAIL +SBY 16:52:19 [Sub_basic] summary: counterexample trace: Sub_basic/engine_0/trace.vcd +SBY 16:52:19 [Sub_basic] summary: failed assertion Sub.c_assert at Sub.sv:11.9-11.37 in step 0 +SBY 16:52:19 [Sub_basic] DONE (FAIL, rc=2) +SBY 16:52:19 The following tasks failed: ['basic'] +``` + + + +The above information indicates that the solver found a test case that violated 'assert()' on cycle 0 after the reset. Further, developers can use the waveform file 'Sub_basic/engine_0/trace.vcd' to aid debugging. After correcting the error in the 'Sub' module, re-running the above command will output a success message. The representation solver cannot find a counterexample, which proves the correctness of the code. + + +### Test icache with formal validation + + + +Our goal was to prove icache's correctness through formal verification, +Therefore, first we need to design the corresponding REF and determine the correctness condition. Cache is a technology that improves memory access efficiency. Therefore, it should not affect the accuracy of memory access results. That is, the behavior of the access request should be the same regardless of whether the cache is present. +Therefore, we can use one of the simplest memory access systems as REF. It receives memory access requests from the CPU and then accesses memory directly. In contrast, the DUT lets the access request pass through the cache. For correctness conditions, we simply check that the results returned by the read request are consistent. + + +Based on the above analysis, it is easy to write pseudo-code that validates the top-level module. Here we use Chisel as pseudo-code, but if you are developing with Verilog, you can still borrow some ideas to write validation top-level modules. + +```scala +class CacheTest extends Module { + val io = IO(new Bundle { + val req = new ... + val block = Input(Bool()) + }) + + val memSize = 128 // byte + val mem = Mem(memSize / 4, UInt(32.W)) + val dut = Module(new Cache) + + dut.io.req <> io.req + + val dutData = dut.io.rdata + val refRData = mem(io.req.addr) + when (dut.io.resp.valid) { + assert(dutData === refData) + } +} +``` + + + +The above pseudocode only gives the general framework, you need to fill in some details according to your specific implementation: + +* Mask the write operation, you can set the write enable-related signal to '0' +* Read data from 'mem' when cache is missing, since the test object does not generate write operations. So DUT and REF can use the same memory +* Since REF reads data directly from 'mem' without any delay. The DUT needs to go through several cycles to read data from the cache, so it needs to synchronize the timing of 'assert()' : +After the REF reads the data, it needs to wait for the DUT to return the read result before it can be checked, which is obviously easy to do with the state machine +* Since the formal validation tool traverses all input cases for each cycle, the input signal changes each cycle. You may need to use registers to temporarily store some results +* With the "formal validation tool walks through all input cases per cycle" feature, +We can define some 'block' signals at the top of the test. Used to test whether AXI-related code can work under random delay scenarios, +For example, dut.io.axi.ar.ready := arready_ok & ~block1 ', `dut.io.axi.r.valid := rvalid_ok & ~block2` + + +> +> #### option:: Tests the icache implementation through formal validation +> +> Although this is not required, we strongly recommend that you try this modern test verification method, +> Experience the exhilaration of "using the right tools to solve problems". +> Regarding the number of cycles that the SMT solver needs to prove (' BoundedCheck() 'or' depth '), you can pick an appropriate parameter, +> Enable cache to process 3~4 requests within the period demonstrated by the formal validation tool, +> To test whether the cache can handle any consecutive requests correctly. + + + +Formal verification seems to have all the advantages, but in fact, formal verification has a fatal disadvantage, which is the problem of state space explosion. As the design scale increases and the number of proof cycles increases, the space that the solver needs to traverse becomes larger. In fact, first-order logical languages are theoretically undecidable; Even decidable subsets are usually NP-Hard problems in the sense of algorithmic complexity. This means that the run time of the solver is likely to increase exponentially with the size of the design. For this reason, formal validation is commonly used in unit testing. + + +> +> #### comment:: Using the latest version of Z3 +> +> Z3 As an open source project, its version has been iterating. +> Solver as a kind of mature software, the direction of iteration is naturally performance optimization. +> That is, using the new version of the Z3 may have a significant performance improvement over using the old version. +> If you are using a Linux distribution from a few years ago, the Z3 version installed through 'apt' may be older, +> If you want to improve the efficiency of Z3, you can try compiling and installing z3 from [code repository][z3], please refer to RTFM. + + +## Cache optimization + + + +Since caching technology is mainly used to improve memory access efficiency, it is natural that we should evaluate the performance of the cache by accessing related metrics. AMAT(Average Memory Access Time) is usually used to evaluate the performance of the cache, assuming that the cache hit ratio is' p ': + +``` +AMAT = p * access_time + (1 - p) * (access_time + miss_penalty) + = access_time + (1 - p) * miss_penalty +``` + + +Where 'access_time' indicates the access time of the cache, that is, the time required for receiving the access request from the cache to obtaining the matching result. 'miss_penalty' is the cost when the cache is missing, in this case, the time for accessing DRAM. + + + +This equation provides a guideline for optimizing cache performance: +Reduce access time 'access_time', increase hit rate 'p', or reduce miss cost 'miss_penalty'. In current NPC, access time has less room for architectural design optimization and is more influenced by the specific implementation, such as cycle number and critical path. Therefore, we will focus on the optimization of hit rate and missing cost. + + #### todo::统计AMAT +> +> 在NPC中添加合适的性能计数器, 统计icache的AMAT. --> + +> #### todo:: Statistics AMAT +> +> Add an appropriate performance counter to the NPC to count the AMAT of the icache. + + +### Optimize hit ratio + + + +To improve the hit rate, that is, to reduce the miss rate, we need to first understand the reasons for missing cache. + + +#### cache missing 3C model + + + +Computer scientist [Mark Hill][mark mill] proposed the 3C model in his [mark mill phd thesis] in 1987, +Three types of cache loss are described: + + 1. Compulsory miss, defined as a loss occurring in a cache of unlimited capacity. It is represented by a loss that occurs when a block of data is accessed for the first time. + 2. Capacity miss, defined as an absence that cannot be eliminated without expanding the cache capacity. A loss occurs because the cache cannot hold all the data it needs to access + 3. Conflict miss, defined as the absence caused by the above two reasons. The deletion occurs because multiple cache blocks replace each other + +[mark mill]: https://pages.cs.wisc.edu/~markhill/ +[mark mill phd thesis]: https://www2.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-381.pdf + + + +With the 3C model, we can propose specific solutions for each type of loss to reduce the corresponding loss rate. + + +#### Reduce Compulsory miss + + + +In order to reduce Compulsory miss, in principle, it is necessary to read a data block into the cache before accessing it. However, the cache workflow above does not support this feature, so a new mechanism, called prefetch, is needed to implement it. However, theoretically speaking, Compulsory miss only refers to the miss that occurs when a data block is accessed for the first time. The proportion of Compulsory miss is not high in the occasions with more visits. Therefore, here we will not discuss how to reduce Compulsory miss, interested students can search and read the relevant materials of prefetch. + + +#### Reduce Capacity miss + + + +To reduce Capacity miss, by definition, you can only expand the cache capacity to make better use of time locality. However, the larger cache capacity is not the better. On the one hand, the larger cache capacity means the larger the flow area, which increases the flow cost. On the other hand, the larger the storage array, the higher the access latency, which increases the cache access time and degrades the cache performance. Therefore, in actual projects, blindly increasing cache capacity is not a reasonable solution, and it needs to be balanced after considering various factors. + + +#### Reduce Conflict miss + + + +In order to reduce Conflict miss, it is necessary to consider how to reduce the substitution of multiple cache blocks. The preceding section describes the organization of direct mapping in cache. Each data block can be read only into a cache block with a fixed index. If multiple data blocks have the same index, the later data block will replace the earlier data block. Therefore, one idea to reduce substitution is to adopt a new cache block organization that allows data blocks to be read into multiple cache blocks. + + + +At one extreme, each data block can be stored in any cache block, which is called fully associative. The specific cache block to be stored in, of course, the first is stored in the invalid cache block. If all cache blocks are valid, it is determined by the replacement algorithm. Different replacement algorithms affect the cache access hit ratio. +In general, the replacement algorithm needs to select a cache block that is least likely to be accessed in the future. For a given memory access sequence, we can design an optimal replacement algorithm to minimize Conflict miss. But in practice, we can't know the future memory sequence in advance. So the design of the replacement algorithm becomes a problem of predicting the future based on the past, That is, you need to predict a cache block that is least likely to be accessed in the future based on the access status of each cache block in the past. Common replacement algorithms are as follows: + +* FIFO, first-in-first-out, replaces the oldest read cache block +* LRU, least recently used, replaces the cache block that has the fewest accesses in the latest period of time +* random, random substitution + + + +With the appropriate replacement algorithm, the fully associative organization can replace a cache block that is least likely to be accessed in the future with a greater probability.Thus, Conflict miss can be reduced to the greatest extent. However, because the fully associative organization can store data blocks in any cache block, there are two costs. +First of all, the access address does not need to be divided into the index part, so except the offset part, the rest is the tag part. Therefore, we need to spend more storage overhead in the storage array to store the tag portion of the cache block. + +``` + 31 m m-1 0 ++-------------------+--------+ +| tag | offset | ++-------------------+--------+ +``` + + + +Secondly, when judging a hit, it is necessary to check whether the tag of all cache blocks matches, which requires the use of many comparators, thus increasing the area overhead. Due to these costs, the fully associative organization is generally only used when the number of cache blocks is small. + + + +set-associative is a compromise between direct mapping and full associative mapping. The idea is to group all cache blocks. Select a cache block by direct mapping between groups, and select a cache block by full association within the group. That is, each data block can be stored in any cache block with the group number tag % Number of groups. If there are 'w' cache blocks in each group, it is called 'w' -way set-associative. + + + +In the group association organization, a memory address can be divided into the following three parts: tag, index, offset. The index part is the group index, so its bit width is' n = log2(w) '. + +``` + 31 m+n m+n-1 m m-1 0 ++---------+---------+--------+ +| tag | index | offset | ++---------+---------+--------+ +``` + + + +When detecting a hit, you only need to check whether the tags of all cache blocks in the group match. When w is not large, the area cost of the comparator is acceptable. + + + +In fact, both full association and direct mapping can be regarded as special cases of group association: 'w=1' is a direct mapping, 'w= total number of cache blocks' is a full association. Modern cpus usually use 8 - or 16-way connections. + + +#### Block size selection + + + +Block size is a special parameter. If the cache block is large, on the one hand, the tag storage overhead can be reduced. On the other hand, it can store more adjacent data in the cache block, which can better capture the spatial locality of the program and reduce Conflict miss. However, in order to read more adjacent data, the cost of cache loss will increase; At the same time, for a given cache capacity, a larger cache block also means a smaller number of cache blocks. If the spatial locality of the program is not obvious, it means that more and smaller cache blocks are needed. In this case, a larger cache block will increase Conflict miss. + +```txt +// Programs with poor spatial locality +// Two cache blocks of size 4 + 1111 2222 cache +|--------------oooo-------oooo-----| memory, 'o' is the hot data that the program accesses + +// One cache block of size 8 + 11111111 cache +|--------------oooo-------oooo-----| memory + + + +// A program with good spatial locality +// Two cache blocks of size 4 + 11112222 cache +|--------------oooooooo------------| memory + +// One cache block of size 8 + 11111111 cache +|--------------oooooooo------------| memory +``` + + +### Design space exploration + + + +There are a lot of cache-related parameters mentioned above.How to select a set of appropriate parameters to achieve better performance under a given resource. It belongs to the Design Space Exploration (DSE) of cache. Currently, we care about performance including IPC, frequency, area. Among them, the main frequency and area can be quickly evaluated through the 'yosys-sta' project. IPC, on the other hand, is usually obtained after running the full program in a ysyxSoC environment with a calibrated access memory delay. + + + +But there are so many combinations of different parameters that if you look at each of them. It takes hours of time to get its IPC, and the efficiency of designing space exploration will be very low. As mentioned above, data accuracy and simulation efficiency are tradeoffs. So one way to improve the efficiency of design space exploration is to sacrifice the statistical precision of IPC. A lower overhead is used to calculate an indicator that can reflect the changing trend of IPC. + + + +When we adjust the various parameters of the cache, it directly affects AMAT, +Therefore, we can assume that the execution overhead of the rest of the CPU remains the same. According to the definition of AMAT, adjusting the above parameters does not affect the cache access time, so it can be regarded as a constant. +So what we really care about is the total time that the program has to wait for a cache loss. We call this Total Miss Time (TMT). In fact, TMT can represent the changing trend of IPC: the smaller the TMT, the smaller the number of cycles required for each instruction to access memory, and thus the larger the IPC. + + + +You should have counted TMT before when you counted AMAT through the performance counter, but you need to run the finished program in ysyxSoC. In order to calculate TMT at low cost, we start from the Angle of 'missing times * missing costs'. Consider how to count the number of misses and the cost of misses cheaply. + + + +For the statistics of missing times, we have the following observations: + +* The number of cache accesses for a given program is fixed. As long as you get the itrace of the program running and input it into the icache, you can simulate the process of icache working. This allows you to count the number of icache misses, so you don't have to emulate the entire ysyxSoC, not even NPC. +* When an NPC executes a program, it needs to know the next instruction to be executed by the execution result of the current instruction. But itrace already contains the complete instruction stream, so when counting TMT, we only need the PC value of the instruction stream, not the instruction itself. +* The data part of the icache is used as a command to reply to the IFU of the NPC. +However, since NPC are not required for TMT statistics, the data part of icache is not required, and only the metadata part is required. In fact, for a given access address sequence, the number of cache misses is independent of the access content. The correct number of misses can be counted by maintaining metadata. + + + +Therefore, in order to count the number of missing icache, we do not need to run the program completely every time. What we really need is a simple cache function emulator, which we'll call cachesim. Cachesim receives a sequence of PC instruction streams (a simplified version of itrace) and maintains metadata to count the number of missing PC sequences. As for the PC sequence of instruction flow, we can quickly generate it through NEMU. + + + +As for missing costs, cachesim does not contain the memory access details in ysyxSoC, so in principle we cannot get this value accurately. But as discussed above, only the block size parameter affects the missing cost. So we can calculate an average missing cost in ysyxSoC and use it as a constant to estimate TMT. + + + +> #### todo:: Implements the cachesim +> +> Implement a simple cache emulator as described above. +> +> With cachesim, we can perform a DiffTest performance test on the icache. +> Specifically, we can use cachesim as a REF for performance tests. +> Performance counter results from running a program in an NPC, +> It should be exactly the same as the number of misses obtained by cachesim from the corresponding PC sequence statistics. +> If not, there may be a performance bug in the RTL implementation. +> This performance bug cannot be found through DiffTest or formal validation of functional testing against NEMU. For example, even if the icache is missing all the time, the program will still run correctly on the NPC. +> Of course, it is also possible that the cachesim as REF has a bug. But in any case, can have REF as a comparison, it will not lose. +> +> But in order to get consistent itrace, you may need to make some changes to NEMU, +> Enable it to run 'riscv32e-ysyxsoc' image file. + + + +> #### option:: compresses the trace +> +> If you get a very large itrace, consider compressing it in the following ways: +> +> * Store itrace in binary format instead of text +> * Most of the time instructions are executed sequentially, for a continuous sequence of PCS, +> We can only record the first PC and the number of instructions executed consecutively +> * Further compress the generated itrace file with the 'bzip2' related tool, +> Then cachesim code by 'popen("bzcat compressed file path ", "r")' to get a readable file pointer. +> For use of 'popen()', please refer to RTFM + + + +> #### todo:: Design space exploration with cachesim +> +> With cachesim, we can quickly evaluate the expected benefits of different cache parameter combinations. +> Cachesim can be evaluated thousands or even tens of thousands of times faster than ysyxSoC for a single parameter combination. +> +> In addition, we can evaluate multiple parameter combinations simultaneously with multi-core: +> Specifically, first let cachesim pass in various cache arguments from the command line. +> Then use the script to start multiple cachesim and pass different parameter combinations. +> In this way, we can obtain evaluation results of dozens of parameter combinations in a matter of minutes. +> To help us quickly choose the right combination of parameters. +> +> Try to get through this quick evaluation process and evaluate several parameter combinations. +> However, we have not optimized the missing costs to evaluate a more reasonable TMT; +> In addition, we have not given a limit on the area size. All of these factors affect the final decision, so you don't have to make a final design choice for now. + + +### Optimization missing cost + + + +If the cache is missing, the data will be accessed to the next storage layer, so the cost of the missing is the access time of the next storage layer. There are many techniques to reduce the missing cost, and we will discuss one of them first: burst access over bus transmission. If the size of the cache block is the same as the data bit width of the bus, then only one bus transfer transaction is needed to complete the access to the data block, and the optimization space is not large. However, for larger cache blocks, in principle, multiple bus transfers are required to complete access to the data block, where there is an opportunity for optimization. + + +#### block size and missing cost + + + +With current cache designs, the next layer of storage is SDRAM. For further analysis, we first establish the following simple model for SDRAM access time. The SDRAM access time is divided into four segments, so the overhead for an independent bus-transmitted transaction is' a+b+c+d '. Assume that the data bit width of the bus is 4 bytes and the block size of the cache is 16 bytes. If four separate buses are used to transmit the transaction, the cost is' 4(a+b+c+d) '. + +```txt ++------------------------ arvalid set valid +| +-------------------- The AR channel shakes hands and receives read requests +| | +------------ The state machine transitions to the READ state and sends the READ command to the SDRAM particle +| | | +------ SDRAM particles return read data +| | | | +-- R-channel handshake, return read data +V a V b V c V d V +|---|-------|-----|---| +``` + + + +> #### todo:: Supports larger block sizes +> +> Change the block size in cachesim to four times the bus data bit width, and estimate the missing cost by using a separate bus transfer transaction. +> +> After implementation, compare the results with the previous evaluation and try to analyze the reasons. + + + +> #### todo:: Supports larger block sizes (2) +> +> Modify the implementation of icache to support larger block sizes. +> When implementing, it is recommended that the block size parameters be configured to facilitate subsequent evaluation. + + +#### AXI4 burst transmission + + + +Similar to SDRAM particles, the AXI bus also supports "burst transfer". That is, a bus transmission transaction contains several consecutive data transfers, and one data transfer is called a "beat". In the AXI bus protocol, the arbusrt signal through the AR channel is required to indicate whether the transmission adopts burst transmission. If so, it is also necessary to indicate the number of beats transmitted by the 'arlen' signal. + + + +Of course, the support of the bus protocol is not enough, but also requires the master side of AXI to initiate burst transmission transactions. And the slave side supports the processing of burst transmission transactions. Icache as the master side, we will initiate a burst transmission transaction as the experimental content left to you. On the other hand, the SDRAM controller provided by ysyxSoC does support the processing of burst transmission transactions. The above four bus transfers can be included in a single bus transaction, effectively reducing the overhead of a complete read-out of a block of data: +First of all, through the burst transmission, 'AR' channel only needs to do one handshake, compared with the above scheme, the cost can save '3a'; +Second, the SDRAM controller can split a burst transmission transaction into multiple READ commands sent to the SDRAM particle. When the SDRAM particle returns a READ data, if there is still an address sequential read command. The state machine of the SDRAM controller is transferred directly to the READ state and continues to send the READ command. Compared with the above scheme, the cost can be saved by '3b'; +Finally, the SDRAM controller's reply to the 'R' channel can occur simultaneously with the next READ command transmission. Thus, the overhead of the 'R' channel handshake is hidden, and the overhead can be saved by '3d' compared with the above scheme. + +```txt + a b c d +|---|-------|-----|---| <-------------------- The first beat + |-----|---| <-------------- The second beat + |-----|---| <-------- The third beat + |-----|---| <-- The forth beat +``` + + + +Above all, the cost of using burst transmission is' a+b+4c+d ', which can save '3(a+b+d)' cost compared with the above scheme. Generalized to the general case, if the cache block size is n times the bus data bit width, then the use of burst transmission can save n(a+b+d) overhead. Although on the surface 'a', 'b', 'd' are not large, but don't forget that this is the overhead of the SDRAM controller perspective. After calibrating the memory access delay, it is possible to save tens or even hundreds of cycles for the CPU. + + +#### Implementation of burst transmission + + + +As you can see, to reap the benefits of burst transmission, you first need to make the cache use larger blocks. But as discussed above, larger cache blocks can also increase Conflict miss. This results in negative returns, depending on the spatial locality of the program. As for which is better, it is up to benchmark to evaluate. +To evaluate TMT in cachesim, we also need to capture the missing cost of using burst transport. To do this, we need to implement AXI burst transmission in ysyxSoC environment first. But this change involves a lot of detail, so we went through it in several steps. + + + +In order to facilitate the test, we first connected the SDRAM controller with the APB interface. But the APB protocol does not support burst transmission, to use burst transmission, first we need to replace the SDRAM controller with the AXI interface version. + + + +In addition, because the SDRAM controller uses an AXI interface with a 32-bit data bit width. ysyxSoC uses a 64-bit AXI interface, so it needs to convert the data bit width. +ysyxSoC already includes a framework for an AXI data bit-wide conversion module that is plugged into the upstream of the AXI SDRAM controller. However, the framework does not provide a specific implementation of data bit-width conversion. You need to implement the AXI bit-width conversion module in ysyxSoC so that AXI transactions can transfer data correctly. + + + +Since the word length of the NPC is 32 bits, even though ysyxSoC uses the AXI interface with 64 bits of data width. In principle, there will be no full 64-bit data transfer. +So we can simplify this AXI data bit width conversion module. On the one hand, it is only necessary to realize the conversion from 64 to 32 bits, without considering the conversion between other data bit widths; On the other hand, it can be assumed that no full 64-bit data transfer will occur. So you don't have to think about how to split a 64-bit data transfer into two 32-bit data transfers. + + +> +> #### todo:: SDRAM controller with integrated AXI interface +> +> You need to complete the following: +> +> 1. Implement AXI data bit width conversion module. Specifically, +> If you choose Verilog, you need to implement the corresponding code in 'ysyxSoC/perip/amba/axi_data_width_converter_64to32.v'; +> If you choose to Chisel, you need to ` ysyxSoC/SRC/amba/AXI4DataWidthConverter scala ` +> ` AXI4DataWidthConverter64to32Chisel ` module to realize the corresponding code, +> and will `ysyxSoC/SRC/device/SDRAM.scala` in ` Module (new AXI4DataWidthConverter64to32) ` +> Changed to instantiate ` AXI4DataWidthConverter64to32Chisel ` module. +> 2. Change the variable 'sdramUseAXI' to 'true' in the 'Config' object of 'ysyxSoC/src/SoC.scala', +> Re-generate 'ysysocfuller.v' after modification. +> +> When you're done, try running some tests to check that your implementation is correct. + + + +> #### todo:: Enables the icache to support burst transmission +> +> Modify the implementation of icache to use burst transfer to access data blocks in SDRAM. +> +> After the correct implementation, record the burst transmission waveform and compare it with the waveform without burst transmission. +> You should be able to observe that the use of burst transmission does increase efficiency. + + +#### Calibrates the memory access delay of AXI interface SDRAM + + + +Although the burst transmission mode has been working, the corresponding memory access delay has not been calibrated. The resulting performance data is inaccurate. Similar to the previously implemented APB delay module, we need an AXI delay module to calibrate the corresponding access delay. + + + +Because AXI protocol is more complex than APB protocol, the implementation of AXI delay module needs to consider the following issues: + +* The read and write channels of AXI are independent, so in principle separate counters for read and write transactions need to be designed to control latency. However, current NPCS are multi-periodic and do not send read and write requests at the same time. +So now you can also design a uniform set of counters to control. But then when you implement the pipeline, you still need to use two separate sets of counters. +* AXI has a complete handshake signal, waiting for the handshake also involves the state of the device. Therefore, this period of time should also fall within the scope of calibration, that is, the moment when the valid signal is valid should be regarded as the beginning of the transaction. +* AXI supports burst transmission, so the transmission mode is different from APB. +* In the case of read transactions, a burst transmission may involve multiple data receives, all of which need to be calibrated separately. Suppose an AXI burst read transaction starts at 't0' and the device side returns data at 't1' and 't2' respectively. +AXI delay module returns data upstream at 't1',' t2'. Have a ` equation (t1 - t0) * r = t1 '- t0 ` and ` (t2 - t0) * r = t2' - t0 `. +* The burst transmission of write transactions involves multiple data transmissions, since it takes at least one cycle for the device to receive data once. This appears to the CPU to have passed 'r' cycles, so the data sending time also needs to be calibrated. +However, we have not implemented dcache yet, and LSU does not initiate burst write transactions. Therefore, the calibration of burst write transactions can not be implemented for the time being, but for a single write transaction, calibration is still required. + + + +> #### todo:: Implements the AXI delay module +> +> Implement the AXI delay module in ysyxSoC as described above. +> Specifically, if you choose Verilog, you need to implement the corresponding code in 'ysyxSoC/perip/amba/apb4_delayer.v'; +> If you choose to Chisel, you need to ` ysyxSoC/SRC/amba/AXI4Delayer scala ` of ` AXI4DelayerChisel ` module to realize the corresponding code, +> and will `ysyxSoC/SRC/amba/AXI4Delayer.scala` in ` Module (new axi4_delayer) ` modified to instantiate ` AXI4DelayerChisel ` Module. +> +> To simplify the implementation, for now you can assume that the number of beats transmitted in bursts will not be greater than 8. +> After implementation, try to take a different 'r' and see if the above equation holds in the waveform. + + + +> #### todo:: Evaluates the performance of the burst transmission mode +> +> After calibrating the memory access delay of the burst transmission mode, run a microbench train-scale test and compare the results with the previous recorded results. + + +#### Quickly assess the cost of missing + + + +According to the previous discussion, the current icache loss cost is only related to the block size and bus transmission mode, and has nothing to do with other cache parameters. +Therefore, we can evaluate the missing costs of various combinations of block sizes and bus transport modes in advance. In the future, by directly substituting this missing cost, the TMT can be calculated to estimate the expected benefits of different combinations of cache parameters. However, according to the above analysis, the performance of the burst transmission mode is definitely better than that of the independent transmission mode. +So in fact, we only need to evaluate the missing cost of various block sizes in advance when using burst transmission mode. + + + +As for how to assess the missing costs in advance, there are two ways: + + 1. Modeling. According to the work flow of SDRAM controller state machine, fit the calculation formula of SDRAM access time. The corresponding missing cost can be calculated by plugging in the block size. This approach is straightforward, but modeling accuracy is a challenge. For example, SDRAM row buffer and refresh operations can also affect SDRAM access time. But it is difficult to characterize the cost of these factors in an SDRAM visit. + 2. Statistics. By using the appropriate performance counter, the TMT of accessing SDRAM when icache is missing is calculated, so as to calculate the average missing cost. As a statistical method, it can be used by averaging after sampling. Take into account hard-to-model factors such as SDRAM's row buffer and refresh operations. But a row buffer is essentially a cache, and its performance can be affected by program locality. +Therefore, the test procedure also needs to be representative: +Train-scale tests that run directly on microbench are the best represented. Although it takes more time, it only needs to run once to calculate the average missing cost. The behavior of test scale test and train scale test is not exactly the same, but there is a certain representativeness, so that the average missing cost can be calculated quickly. +However, the representative of hello program is relatively weak, and using its statistical average missing cost to estimate may bring a large error. + + + +In actual projects, considering the complexity of the project, modeling is rarely used. +Therefore, it is also recommended to use statistical methods to assess the missing costs in advance. + + + +> #### todo:: Quick assessment of missing costs +> +> Achieve a quick assessment of missing costs based on the above. +> You can already count the number of misses using cachesim, +> You will then use these missing costs to estimate the expected benefits of different combinations of cache parameters. + + +### Memory layout of the program + + + +A change in the memory layout of a program can also significantly affect cache performance. +For example, if the icache block size exceeds 4 bytes. Sometimes some hot loop in the program may not be aligned to the boundaries of the cache block. Commands in the hotspot loop occupy more cache blocks. For example, if the hot loop of a program is located at the address '[0x1c, 0x34)', suppose the block size of an icache is 16 bytes. To read all the hotspot loop instructions into the icache, three cache blocks are required. + +```txt + [0x1c, 0x34) + 0x4 = [0x20, 0x38) ++------+------+------+------+ +------+------+------+------+ +| | | | 0x1c | | 0x20 | 0x24 | 0x28 | 0x2c | ++------+------+------+------+ +------+------+------+------+ +| 0x20 | 0x24 | 0x28 | 0x2c | | 0x30 | 0x34 | | | ++------+------+------+------+ +------+------+------+------+ +| 0x30 | | | | ++------+------+------+------+ +``` + + + +But if we fill in some blank bytes in front of the code, we can change the location of the hot loop. Thus, the hotspot loop commands occupy fewer cache blocks. For example, in the above example, we only need to fill 4 bytes of blank content in front of the program code. You can change the location of the hot loop to '[0x20, 0x38]'. In this case, the hotspot loop commands only need to occupy two cache blocks, and the saved cache block can be used to store other commands. Thereby improving the overall performance of the program. + + + +> #### todo:: Optimizes the memory layout of the program +> +> Try filling a few blank bytes in front of the program code as described above. +> Specifically, you can do this by modifying the code, or modifying the link script. +> After implementation, try to evaluate whether the filled blank bytes can optimize the performance of the program. + + + +In fact, the cache capacity of the above example is small, so the saved cache block accounts for a high proportion of the total cache. However, the cache capacity of modern processors is relatively sufficient, and the saving of one cache block may not significantly improve the performance of the program. But what we want to say is that program optimization is also an important direction to use the locality principle. Some programs can even improve performance several times after optimization. + + + +In the enterprise, for the critical application in the target scenario, the engineering team often adopts various methods to improve its performance; If enterprises have the ability to design their own processors, in addition to improving the performance of the processor. The compiler is also customized based on the parameters of the processor. +Compared to executables compiled using a public version of a compiler such as gcc from the open source community. Executable files compiled with a custom compiler run faster on the target processor. In particular, there are two metrics defined in [SPEC CPU 'base' and 'peak'][spec run rules]. The 'peak' standard allows different subitems to be compiled with different compilation optimization options. The whole benchmark achieves better performance than 'base' on the target platform. If you want to achieve a higher score in the 'peak' criteria, optimization at the software level is indispensable. + +[spec run rules]: https://www.spec.org/cpu2006/Docs/runrules.html#rule_1.5 + + +### Design Space Exploration (2) + + + +We've covered a lot of cache parameters, even the memory layout of the program, +These all affect the performance of the program running on the processor. Now we can consider these parameters comprehensively and select a set of parameters with better performance. Of course, design space exploration also needs to meet the constraints of area size. + + + +> #### danger: Limit the size of the area +> +> Your NPC needs to be in the nangate45 process provided by the 'yosys-sta' project by default, +> The comprehensive area does not exceed 25,000 $um^2$, which is also the area limit of the B stage flow sheet. +> Considering that subsequent tasks still need to implement pipelining, we recommend that the comprehensive area of NPC at this time should not exceed 23,000 $um^2$. +> +> It's not a huge area. For one thing, +Using this size limit helps to highlight the contribution of other parameters in designing space exploration, +> Otherwise, you only need to increase the cache capacity to achieve good performance. +> The impact of other parameters on performance improvement will be difficult to reflect; +> On the other hand, the project team expects that more students will be based on the stage B streaming film, +> Smaller area helps the project team to save the cost of the film. +> +> Currently you don't have to strictly meet the above area limits, if you exceed less than 5%, +> You can choose to optimize after the completion of the pipeline; +> But if the current combined area far exceeds the above limits, you may need to make major adjustments to your design, +> We recommend that you immediately carry out some area related optimization work. +> +> About the cost of the film, we can do some simple but not too rigorous estimates. +> Suppose a wafer fab provides the flow sheet service of the nangate45 process, the size of each block is' 2mm X 3mm ', and the price is 500,000 RMB. +> Then the price of each $um^2$is' 500,000 / (2 * 3 * 1,000,000) = 0.0834 'yuan. +> Generally speaking, the wire network connecting standard units also needs to occupy a certain area. At the same time, in order to avoid excessive congestion, some gaps will be left between the standard units. Therefore, the final chip area is usually larger than the overall area, which, as a rule of thumb, is generally 70% of the final chip area. +> According to the above estimation, a design with a combined area of $25,000 um^2 +> The final price needed to spend is' 25,000/0.7 * 0.0834 = 2978 'yuan. +> The biggest inspiration from the above estimates is that adding functional features to the CPU is not free. This is very different from designing for FPGas: in some competitions where FPgas are the target platform, +> Contestants usually try their best to convert the resources on the FPGA into CPU performance, +> There is no economic cost in this process. +> But the real tape-out film is not like this. You can think of the goal of stage B as designing a low-cost embedded processor (rv32e is the basic instruction set for embedded scenarios): +> Suppose you are an architect for an embedded CPU vendor and you need to find a way to improve the performance of your processor with a limited space budget. If the area exceeds expectations, the cost of the chip will rise, and the competitiveness in the market will be reduced. +> Under such conditions, you need to estimate the cost performance of adding a feature: +> Suppose a feature gives you a 10% performance boost, is it worth paying $500 more? + + + +> #### todo:: Space exploration of icache design +> +> Following the above introduction, explore the design space of icache and determine a design scheme to achieve better performance under the given constraints. +> After determining the solution, implement it through RTL and evaluate its performance on ysyxSoC. + + +> +> #### hint:: Some ideas to optimize the area +> +> If your estimated area significantly exceeds the above requirements, you will most likely need to optimize your design. +> There is no trick to optimizing the area, but overall it can be considered from the following directions: +> +> 1. Optimize logic overhead: Consider which logic functions are redundant and can be merged with other existing logic +> 2. Optimize storage overhead: Consider which storage units are redundant +> 3. Conversion between logic overhead and storage overhead: sometimes it stores a signal, +It is better to calculate it again, but this may also affect the critical path, which needs to be analyzed on a case-by-case basis +> +> For everyone, the requirements of the above area limits are not likely to be achieved with random writing, but they are not impossible to achieve. +> yzh's reference design has an area of $22730 um^2 after icache is added, and the frequency can reach 1081MHz. +> Running microbench shows a 'Total time' of 4.49s. +> We set the above area limits, on the one hand, to exercise how to optimize their own design: +> If you are a beginner, you always need a chance to get involved in this area of work, +> Establish an understanding of the association between each line of RTL code and area overhead through trial and error. +> +> On the other hand, it is also to let everyone experience the goal of architecture design again: +> Performance and area are mutually limited, and if you find your design difficult to optimize, +The simplest way is to reduce the cache size, but you have to pay a performance price; +> If you want both size and performance, you need to save as much space as possible. +> Then plan what this area can be used for to improve performance. + + + +> #### todo:: Evaluate the price/performance ratio of dcache +> +> You have estimated the performance benefits of dcache under ideal conditions above. +> Here we continue to estimate the price-performance ratio of this dcache. +> Assuming the size of the dcache is the same as the icache, how much does it cost? +> +> Although you haven't designed dcache yet, because dcache needs to support write operations. Its design must be more complex than the icache, so the occupied area should also be greater than the icache of the same capacity. +> Therefore, the estimated dcache price-performance ratio under these conditions is highly optimistic. +> If you consider the real performance benefits and actual area of dcache, the design of dcache will only be less cost-effective. +> +> Another direction to consider is, if the dcache area is used to expand the icache capacity, how much performance improvement can be achieved? + + + +> #### caution:: True architecture design +> Although the above icache design space exploration task is much simplified compared to design space exploration in real processors. But for most of the students, it was also the first time to touch the real processor architecture design. +> More, this is probably the first time that most students have been exposed to the whole process of a module design. +> From demand analysis, structural design, logic design, to functional verification, performance verification, performance optimization, +> Finally to the circuit level area assessment, timing analysis. Among them, logical design is often referred to as RTL coding. +> +> This task once again shows that architectural design is not equal to RTL coding. +> The work of architecture design is to find a set of design parameters that perform well in a design space that satisfies the constraints. +> But usually the design space is very large, and it takes a lot of time to fully evaluate the performance of one set of design parameters. Therefore, how to quickly evaluate the performance of different design parameters is a crucial problem for architecture design work. +> Therefore, simulators are an important tool for architecture design. +> With the simulator, we don't have to simulate circuit-level behavior (instead of running the verilator, run the cachesim), +> At the same time only need to simulate some necessary modules (do not simulate cache data, only need to simulate cache metadata). It also does not have to be driven by a processor (instead of running the full program, the corresponding itrace is played back). It is these differences that make the operation efficiency of the simulator an order of magnitude improvement over RTL simulation. +> This allows us to quickly evaluate the expected performance of different design parameters, helping us quickly eliminate clearly unsuitable design parameters. +> +> According to the experience of the Xiangshan team, it takes one week to run a program on the verilator, +> But running the same program in the full system simulator [gem5][gem5] takes only 2 hours. This means that the time it takes to evaluate a set of design parameters in an RTL simulation environment, +> You can use the simulator to explore the effects of 84 different design parameters. +> +> Simulators are also a common platform for architecture research. +ISCA has held several competitions based on [ChampSim][champsim] emulators. +> Including [cache replacement algorithm contest][cache replacement algorithm contest] and [cache prefetch algorithm contest][data prefetch]. Researchers evaluate the performance of various algorithms in simulators to quickly adjust the overall implementation of the algorithm and its detailed parameters. +> Although a qualified algorithm still needs to be implemented and verified at the RTL level. However, if you choose to explore various algorithms at the RTL level at the beginning, it is very inefficient. +> +> So, when you really understand 'architecture design! = RTL when encoding ', +> You are truly introduced to the field of architecture design. + +[gem5]: https://www.gem5.org/ +[champsim]: https://github.com/ChampSim/ChampSim +[cache replacement]: https://www.sigarch.org/call-contributions/the-2nd-cache-replacement-championship/ +[data prefetch]: https://www.sigarch.org/call-contributions/third-data-prefetching-championship/ + + +## Cache consistency + + + +When the store instruction modifies the contents of the data block, according to the semantics of the program. New data should be read from the corresponding address in the future, otherwise the execution of the program will be wrong. However, the data in memory may have multiple copies because of the cache mechanism. How to ensure that new data can be read from each copy is called the cache coherence problem. + + + +In computer systems, from the cache in the processor to distributed systems and the Internet. As long as there are copies of data, there will be consistency issues between them. After the NPC has added icache, we can reproduce a consistency problem by following 'smc.c' : + +```c +// smc.c +int main() { + asm volatile("li a0, 0;" + "li a1, UART_TX;" // change UART_TX to the correct address + "li t1, 0x41;" // 0x41 = 'A' + "la a2, again;" + "li t2, 0x00008067;" // 0x00008067 = ret + "again:" + "sb t1, (a1);" + "sw t2, (a2);" + "j again;" + ); + return 0; +} +``` + + + +The above program first assigns initial values to some registers, and then writes A character 'a' to the serial port at the label 'again'. Then rewrite the instruction at 'again' to 'ret', and finally jump back to 'again' to re-execute. According to the semantics of the program, the program should output A character 'a' and return from 'main()' via the rewritten 'ret' instruction. This type of Code that modifies itself while it is running is called "Self Modified Code". + + + +> #### comment:: Self-modifying code in the history of computer development +> +> In the past, when memory address space was very tight, self-modifying code was often used to improve memory utilization. +> This allows the program to do more with limited memory. For example, in the 1980s [FC Red and white machines had only 64KB of address space][NES MMIO], the ROM on the cassette took up 32KB. Some cartridges also have 8KB of RAM on them, but if there is no RAM on the cartridges, the program can only use the 2KB of RAM that is integrated into the CPU. +> In order to develop great games with such limited resources, +> Developers use a lot of black technology, self-modifying code is one of the methods. +> +> With the development of memory technology, memory capacity is not as tight as in the past. Coupled with the fact that self-modifying code is difficult to read and maintain, it is difficult to see the trace of self-modifying code in modern programs. + +[NES MMIO]: https://www.nesdev.org/wiki/CPU_memory_map + + + +> #### todo:: Reproduces cache consistency issues +> +> Compile the above program into AM and run it on NPC, what problems do you find? +> Try to give analysis and verify your ideas with waveform. +> If you don't find a problem, try increasing the capacity of the icache. + + + +A straightforward solution to the above problem is to keep all copies in the system consistent at all times. For example, every time the store instruction is executed, immediately check whether there are other copies in the system. If it exists, update it or invalidate it, ensuring that subsequent operations, no matter where they are accessed, +New data can be read directly (by updating mode), or new data can be read from the next storage layer due to missing (by invalid mode); +The x86 instruction set uses this scheme. But obviously, this increases the complexity of CPU design. In particular, in some high-performance processors, when the store instruction is executed, other components also access the various caches in the system at the same time. It is very challenging to prevent other parts from accessing outdated data until all copies of the store instruction are updated or invalid. + + + +The other scheme is more permissive, allowing copies in the system to be inconsistent at some point. But before the program can access this block of data, it needs to execute a special instruction instructing the hardware to process the outdated copy. In this way, the execution of the program is still able to access the correct data, and the result is still consistent with the semantics of the program. The RISC-V instruction set uses this scheme. +RISC-V has a 'fence.i' instruction. The semantics is that any subsequent fetch operation can see the data modified by the store instruction before it. Here, the 'fence.i' instruction is like a barrier, so that the fetch operation after it cannot cross the barrier to read the old data that was modified by the store instruction. In addition, it is described in the RISC-V manual as follows: + +```txt +RISC-V does not guarantee that stores to instruction memory will be made +visible to instruction fetches on a RISC-V hart until that hart executes +a FENCE.I instruction. +``` + + +That is, RISC-V allows a copy in icache to be inconsistent with memory at some point, which is also consistent with the discussion above. For more information about 'fence.i', RTFM is recommended. + + + +RISC-V only defines the semantics of 'fence.i' at the instruction set level. However, how to realize the function of 'fence.i' in the micro-structure level, there are many different schemes: + +| Scheme | When executing the store instruction | When executing 'fence.i' | When executing icache | +| :-: | :-: | :-: | :-: | +| (1) | Updates the corresponding block in the icache | nop | hit | +| (2) | Invalid corresponding block in icache | nop | Missing, access memory | +| (3) | - | flush icache | Missing, access memory | + + + +In fact, implementations (1) and (2) are the aforementioned "keep all copies in the system consistent at all times" solutions. It is the special case of "allowing copies in the system to be inconsistent at certain moments" : +Since all copies are consistent when the store instruction is executed, 'fence.i' can be implemented as' nop '. For scheme (3), the store command is executed without processing the copies in the icache, so these copies are in an inconsistent state. +Therefore, when executing 'fenc.i', it is necessary to achieve its barrier effect by flushing icache. The subsequent access to the icache must be missing, and new data in the memory can be accessed. +However, no matter what implementation scheme is used, the program needs to add the 'fence.i' instruction, so as to meet the requirements of the RISC-V manual for the program. Otherwise, the program will not run correctly on a processor using implementation scheme (3). + + + +In fact, the difference between these implementations is simply whether the issue of replica consistency is dealt with at the hardware or software level: +If the instruction set specification requires the processor to deal with duplicate consistency issues at the hardware level, the issue is transparent to the software. The programmer does not have to worry about where in the program to add instructions like 'fence.i', but at the cost of more complex hardware design; If the instruction set specification requires the processor to handle the consistency of copies at the software level, the hardware design is simpler, but at the cost of increasing the burden on the programmer. + + + +So the essence of the problem is a trade-off between the complexity of hardware design and the burden of program development. + + + +> #### todo:: Implements the fence.i directive +> +> Based on your understanding of the 'fence.i' instruction, choose a solution that you think is reasonable to implement the 'fence.i' instruction in NPCS. +> After implementation, add the 'fence.i' instruction in the appropriate place in the above 'smc.c' and re-run it on the NPC. +> If your implementation is correct, you will see the program output A character 'a' after successfully ending. +> +> Hint: You may encounter a compilation error related to 'fence.i', try to resolve the problem with the error message. + + + +In fact, there are more manifestations of the cache consistency problem in real computers. +We'll talk more about cache consistency as processors become more complex. + diff --git a/zh/index.md b/zh/index.md old mode 100755 new mode 100644