diff --git a/en/basic/1.9.md b/en/basic/1.9.md
new file mode 100644
index 0000000..a313641
--- /dev/null
+++ b/en/basic/1.9.md
@@ -0,0 +1,3197 @@
+<!-- # 性能优化和简易缓存 -->
+# Performance and Simple Cache
+
+<!-- TODO: 补充关于zicntr和zihpm的介绍, 并让学生查阅ecall和ebreak是否算入instret CSR
+-->
+TODO: Supplement the introduction about zicntr and zihpm, and ask students to check whether ecall and ebreak are counted as instret CSR.
+
+<!-- > #### danger::更新ysyxSoC
+> 我们在2024/04/21 18:00:00大幅更新了`ysyxSoC`项目,
+> 添加了访存延迟相关的框架代码, 同时重构了部分代码, 以方便后续的配置.
+> 同时, 由于后续配置组合较多, `ysyxSoC`项目难以提供各种预先生成的`ysyxSoCFull.v`,
+> 因此需要你配置一个Chisel构建环境来生成这个文件, 具体配置方式见下文.
+> 但如果你选择使用Verilog进行开发, 你仍然不需要编写Chisel代码. -->
+
+> #### Danger::Update ysyxSoC
+>
+> We greatly updated the 'ysyxSoC' project at 2024/04/21 18:00:00
+> We added the framework code related to memory latency, and refactored some of the code to facilitate subsequent configuration.
+> At the same time, due to the massive subsequent configuration combinations, the 'ysyxSoC' project is difficult to provide a variety of pre-generated 'ysyxsocfuller.v',
+> So you need to configure a Chisel build environment to generate this file, see below for details.
+> But if you choose to develop with Verilog, you still don't need to write Chisel code.
+
+<!-- > 由于重构内容较多, 此处采用手工方式更新项目.
+> 如果你在上述时间之前获得`ysyxSoC`的代码, 请进行如下操作: -->
+
+> Due to the massive reconstruction, we manually update the project here.
+> If you get the code for 'ysyxSoC' before the above time, please do the following:
+
+> ```bash
+> cd ysyx-workbench
+> mv ysyxSoC ysyxSoC.old
+> git clone git@github.com:OSCPU/ysyxSoC.git
+>   # install mill, at https://mill-build.com/mill/Intro_to_Mill.html
+> mill --version # Check whether mill is installed
+> cd ysyxSoC
+> make dev-init  # Pull the rocket-chip project and initialize
+> make verilog   # Generate build/ysyxSoCFull.v
+> cp ... ... # Copy the files in ysycSoC.old where the code has been written to the appropriate directory in ysyxSoC
+>            # For the Chisel code, we put it in the src/ directory and did some refactoring of the directory structure,
+>            # But you should easily find the location of the file
+> make verilog
+> cd am-kernels/benchmarks/microbench
+> make ARCH=riscv32e-ysyxsoc mainargs=test run  # Simulate to verify that the program works correctly
+> cd ysyx-workbench
+> rm -rf ysyxSoC.old  # Make sure you have copied all the code you have written
+> ```
+
+<!-- 接入ysyxSoC后, 你设计的NPC已经可以正确与各种设备交互, 从功能上来说已经可以参与流片了.
+根据"先完成, 后完美"的系统设计法则, 我们现在可以来讨论如何开展性能优化了. -->
+
+After connecting to ysyxSoC, your designed NPC can interact correctly with various devices and can participate in tape-out from a functional point of view. Based on the "done first, perfect later" system design principle, we can now discuss how to carry out performance optimization.
+
+<!-- 说是要开展性能优化, 但这只是最终目的.
+在一个复杂系统中, 我们将面临很多选择, 例如:
+哪些地方是值得优化的? 应该采用什么方案进行优化? 预期收益是多少? 这些方案的开销是什么?
+如果我们花费了很大的力气, 发现性能只提升了万分之一, 这肯定不是我们所期望的.
+因此, 相比于盲目地编写代码, 我们更需要一套可以指导我们可以回答上述问题的科学方法:
+1. 评估当前的性能
+2. 寻找性能瓶颈
+3. 采用合适的优化方法
+4. 评估优化后的性能, 对比获得的性能提升是否符合预期 -->
+
+It's about performance optimization, but that's the end goal. In a complex system, we will be faced with many choices, such as:
+What areas are worth optimizing? What options should be used for optimization? What's the expected payoff? What are the costs of these programs?
+If we spend a lot of effort and find that the performance has only improved by one part in ten thousand, this is certainly not what we expect. Therefore, rather than blindly writing code, we need a set of scientific methods that can guide us to answer these questions:
+
+1. Evaluate current performance
+2. Locate performance bottlenecks
+3. Use appropriate optimization methods
+4. Evaluate the performance after optimization and compare whether the performance improvement is in line with expectations
+
+<!-- ## 性能评估 -->
+## Performance evaluation
+
+<!-- 要谈优化, 首先要知道当前的系统运行得如何.
+因此我们需要对性能有一个量化的衡量指标, 而不是凭我们的感觉来判断“运行得好不好”.
+用这个指标来对当前的系统进行评估, 就是性能优化的第一步. -->
+
+To talk about optimization, you first need to know how well the current system is performing. Therefore, we need to have a quantitative measure of performance, rather than judging by our sense of "running well". Using this metric to evaluate the current system is the first step in performance optimization.
+
+<!-- 在我们的理解中, "性能高"基本上等同于"跑得快".
+因此, 一个直接衡量性能的指标就是程序的执行时间.
+所以, 评估一个系统的性能, 就是评估程序在这个系统上的执行时间. -->
+
+In our understanding, "high performance" is basically equivalent to "running fast ".
+Therefore, a direct measure of performance is the execution time of the program. Thus, to evaluate the performance of a system is to evaluate the execution time of a program on that system.
+
+<!-- ### 基准程序选择 -->
+### Benchmark program selection
+
+<!-- 那应该评估哪些程序呢?
+程序五花八门, 要对所有程序进行评估是不现实的,
+因此我们需要选择一些具有代表性的程序.
+所谓的具有代表性, 就是指优化技术在这些程序上带来的性能收益,
+与真实应用场景中的性能收益趋势基本一致. -->
+
+What programs should be evaluated? There are so many different programs that it's not realistic to evaluate them all, so we need to choose some representative programs. Representativeness refers to the performance gains that optimization techniques bring to these programs, it is consistent with the performance benefit trend in real application scenarios.
+
+<!-- 这里提到了“应用场景”, 说明对于不同的应用场景, 上述趋势很可能不尽相同.
+这意味着不同的应用场景需要不同的代表性程序, 这就产生了不同的benchmark.
+例如, Linpack用于代表超级计算场景, MLPerf用于代表机器学习训练场景,
+CloudSuite用于代表云计算场景, Embench用于代表嵌入式场景.
+对于通用计算场景, 最著名的benchmark就是SPEC CPU, 用于评估CPU通用计算能力.
+SPEC(Standard Performance Evaluation Corporation)是一个组织机构,
+目标是建立维护用于评估计算机系统的各种标准, [它定义并发布了多种场景下的benchmark](https://www.spec.org/benchmarks.html).
+除SPEC CPU外, 还有面向图形, 工作站, 高性能计算, 存储, 功耗, 虚拟化等各种场景的benchmark.
+
+一个benchmark通常还由若干子项构成, 例如, SPEC CPU 2006的整数测试包括如下子项: -->
+
+The "application scenarios" mentioned here indicate that the above trends are likely to differ for different application scenarios. This means that different application scenarios require different representative procedures, which results in different benchmarks. For example, Linpack is used to represent supercomputing scenarios, MLPerf is used to represent machine learning training scenarios, CloudSuite is used to represent cloud computing scenarios and Embench is used to represent embedded scenarios. For general-purpose computing scenarios, the most famous benchmark is the SPEC CPU, which is used to evaluate the general-purpose computing capabilities of a CPU.
+
+SPEC(Standard Performance Evaluation Corporation) is an organization, goal is to set up to evaluate the various standards of computer system maintenance, [it is defined and published a variety of scenarios of benchmark] (https://www.spec.org/benchmarks.html).
+In addition to SPEC cpus, there are benchmarks for graphics, workstations, high-performance computing, storage, power consumption, virtualization, and other scenarios.
+
+A benchmark also typically consists of several subentries. For example, the integer test of SPEC CPU 2006 includes the following subentries:
+
+| subitem        | Introduction                   |
+| ---            | ---                      |
+| 400.perlbench  | The Perl language detects spam |
+| 401.bzip2      | bzip compression algorithm |
+| 403.gcc        | gcc compiler |
+| 429.mcf        | Combinatorial optimization of single-station vehicle scheduling in large-scale public transportation |
+| 445.gobmk      | Go, AI searching questions |
+| 456.hmmer      | The method of gene recognition based on hidden Markov model was used to search gene sequence |
+| 458.sjeng      | Chess, AI search questions |
+| 462.libquantum | Quantum computing simulates prime number decomposition |
+| 464.h264ref    | Perform H.264 video encoding on YUV format source files |
+| 471.omnetpp    | Large CSMA/CD protocol Ethernet emulation |
+| 473.astar      | A\* algorithm pathfinding |
+| 483.xalancbmk  | XML to HTML format conversion |
+
+<!-- 除了整数测试, SPEC CPU 2006还包含浮点测试, 测试程序覆盖流体力学, 量子化学,
+生物分子, 有限元分析, 线性规划, 影像光线追踪, 计算电磁学, 天气预报, 语音识别等不同领域. -->
+
+In addition to integer tests, SPEC CPU 2006 also includes floating-point tests, test programs covering fluid mechanics, quantum chemistry, biomolecules, finite element analysis, linear programming, image ray tracing, computational electromagnetics, weather forecasting, speech recognition, etc.
+
+<!-- 当然, benchmark也需要与时俱进, 从而代表新时代的程序.
+截至2024年, SPEC CPU已经演进了6版, 从一开始的1989年, 1992年, 1995年,
+到2000年, 2006年, 最后是最新版本2017年.
+SPEC CPU 2017加入了一些新程序来代表新的应用场景, 例如生物医学成像, 3D渲染和动画,
+采用蒙特卡罗树搜索的人工智能围棋程序(很大概率受2016年AlphaGo的影响)等. -->
+
+Of course, benchmark also needs to move with The Times to represent a new era of programming. As of 2024, the SPEC CPU has evolved into 6 versions, beginning in 1989, 1992, 1995, and 1995. To 2000, 2006, and finally the latest version in 2017. SPEC CPU 2017 adds a number of new programs to represent new application scenarios, such as biomedical imaging, 3D rendering and animation, artificial intelligence Go program using Monte Carlo tree search (with high probability influenced by AlphaGo in 2016) and so on.
+
+<!-- > #### comment::CoreMark和Dhrystone不是好的benchmark
+> CoreMark和Dhrystone属于合成程序(synthetic program),
+> 意思是用若干个代码片段拼接起来的程序.
+> 例如, CoreMark由链表操作, 矩阵乘法和状态机转移操作这三个代码片段组成;
+> Dhrystone则由一些字符串操作的代码片段组成.
+>
+> 合成程序作为benchmark, 最大的问题是其代表性较弱:
+> CoreMark和Dhrystone能代表什么应用场景呢? 相比于SPEC CPU 2006中的各种真实应用,
+> CoreMark中的代码片段顶多只能算C语言的课后作业;
+> Dhrystone就离应用场景更远了, 其代码非常简单(使用短字符串常量),
+> 甚至在现代编译器的作用下, 循环体中的代码片段很可能被深度优化(是否还记得NEMU中的`pattern_decode()`),
+> 使得评估结果虚高, 无法客观反映系统的性能.
+> [这篇文章](https://www.transputer.net/tn/27/tn27.html)详细分析了Dhrystone作为benchmark的缺陷.
+>
+> 讽刺的是, 今天不少CPU厂商发布产品时,
+> 仍然采用CoreMark或Dhrystone的评测结果来标识产品的性能,
+> 甚至其中还不乏宣称是面向高性能场景的产品.
+> [体系结构一代宗师, 图灵奖得主David Patterson在介绍Embench时](https://www.sigarch.org/embench-recruiting-for-the-long-overdue-and-deserved-demise-of-dhrystone-as-a-benchmark-for-embedded-computing/),
+> 评价Dhrystone已经过时很久, 应该停止使用.
+> 事实上, Dhrystone的第一版是1984年发布的, 从1988年之后Dhrystone就再也没有得到维护和升级.
+> 和上世纪80年代相比, 今天的计算机领域已经发生了翻天覆地的变化,
+> 应用程序早已更新换代, 编译技术日趋成熟, 硬件算力也大幅提升,
+> 使用40年前的benchmark来评测今天的计算机, 其合理性确实有所欠缺. -->
+
+> #### Comment: CoreMark and Dhrystone are not good benchmarks
+
+> CoreMark and Dhrystone are part of the synthetic program,
+> A program that is composed of several pieces of code.
+> For example, CoreMark consists of three code fragments: linked list operations, matrix multiplication, and state machine transition operations;
+> Dhrystone consists of snippets of code for string operations.
+>
+> The biggest problem with compositing as a benchmark is that it is not representative:
+What use cases do CoreMark and Dhrystone represent? Compared to the various real-world applications in SPEC CPU 2006,
+> Code snippets in CoreMark can only be considered C homework at best;
+> Dhrystone is further from the application scenario, its code is very simple (using short string constants),
+> Even with modern compilers, snippets of code in the body of a loop are likely to be deeply optimized (remember 'pattern_decode()' in NEMU),
+> The evaluation results are inflated and cannot objectively reflect the performance of the system.
+> [this article] (https://www.transputer.net/tn/27/tn27.html) analyse Dhrystone benchmark as defects in detail.
+>
+> Ironically, when many CPU manufacturers release their products today,
+> Still use CoreMark or Dhrystone evaluation results to identify product performance,
+Even some of them claim to be products for high-performance scenarios.
+> [Architecture Guru, Turing Award winner David Patterson when introducing Embench] (HTTP: / / https://www.sigarch.org/embench-recruiting-for-the-long-overdue-and-deserved-demise-of-dhrystone-a s-a-benchmark-for-embedded-computing/), critic Dhrystone is long out of date and should be discontinued. In fact, the first version of Dhrystone was released in 1984, and Dhrystone has not been maintained or upgraded since 1988.
+The computer world has changed dramatically since the 1980s,
+> The application has been updated, the compilation technology is becoming more mature, the hardware computing power is also greatly improved,
+Using 40-year-old benchmarks to evaluate today's computers is flawed.
+
+<!-- 不过对于面向教学的"一生一芯"来说, SPEC CPU的程序有点过于真实了:
+一方面它们的规模很大, 即使在x86真机中也需要花费小时量级的时间来运行;
+另一方面它们需要运行在Linux环境中, 这意味着我们首先需要设计一个能正确启动Linux的CPU,
+然后才能运行SPEC CPU这个benchmark. -->
+
+However, the SPEC CPU's program is a little too realistic for a teaching-oriented "core for life" :
+On the one hand, they are very large, even on x86 real machines take hours to run;
+On the other hand, they need to run in a Linux environment, which means that we first need to design a CPU that can properly boot Linux, then you can run the SPEC CPU benchmark.
+
+<!-- 相反, 我们希望有一套适合教学场景的benchmark, 满足以下条件:
+* 规模不算太大, 在模拟器甚至在RTL仿真环境中的执行时间不到2小时
+* 可在裸机环境中运行, 无需启动Linux
+* 程序具有一定代表性, 不像CoreMark和Dhrystone那样采用合成程序 -->
+
+Instead, we want to have a set of benchmarks suitable for teaching scenarios that meet the following conditions:
+
+* The scale is not too large, the execution time in the simulator or even in the RTL simulation environment is less than 2 hours
+* Can run in bare metal environment without starting Linux
+* The program is representative, unlike CoreMark and Dhrystone, which use synthetic programs
+
+<!-- 事实上, `am-kernels`中集成的microbench就是一个不错的选择.
+一方面, microbench提供了多种规模的测试集, 模拟器可以采用`ref`规模,
+RTL仿真环境可以采用`train`规模;
+另一方面, microbench作为一个AM程序, 无需启动Linux即可运行;
+此外, microbench包含10个子项, 覆盖排序, 位操作, 语言解释器,
+矩阵计算, 素数生成, A\*算法, 最大网络流, 数据压缩, MD5校验和等场景.
+因此, 当后续讲义提到性能评估但未明确说明相应的benchmark时,
+将默认指代microbench的`train`规模. -->
+
+In fact, microbench integrated with am-kernels is a good choice.
+On the one hand, microbench provides a test set of multiple sizes, and the simulator can adopt the 'ref' scale. RTL simulation environment can adopt 'train' scale;
+microbench, on the other hand, is an AM program that can run without having to boot Linux;
+In addition, microbench contains 10 subitems, overlay sorting, bit manipulation, language interpreter, matrix calculation, prime number generation, A\* algorithm, maximum network flow, data compression, MD5 checksum and other scenarios.
+Therefore, when a subsequent handout mentions performance evaluation without specifying the corresponding benchmark, it will default to referring to the microbench 'train' scale.
+
+<!-- 如果处理器的应用场景比较明确, 例如运行超级玛丽,
+那么还可以直接把超级玛丽作为benchmark,
+相当于把“超级玛丽的游戏体验”作为“运行得好”的标准.
+和microbench不同, 超级玛丽是一个不会运行结束的程序,
+因此可以采用FPS作为量化指标来评估, 而不是采用运行时间. -->
+
+If the application scenario of the processor is specific, such as running Super Mario,
+Then you can just use Super Mario as a benchmark, it is equivalent to taking the "Super Mario Game experience" as the standard for "working well." Unlike microbench, Super Mario is a program that never runs out, therefore, FPS can be used as a quantitative metric to evaluate, rather than using run time.
+
+<!-- ## 寻找性能瓶颈 -->
+## Find performance bottlenecks
+
+<!-- ### 性能公式和优化方向 -->
+### Performance formula and optimization direction
+
+<!-- 我们可以测量benchmark的运行时间, 来得到系统的性能表现.
+但运行时间是一个单一的数据, 我们很难从中找到系统的性能瓶颈来优化运行时间,
+因此, 我们需要一些更细致的数据. -->
+
+We can measure the running time of the benchmark to get the performance of the system.
+But uptime is a single piece of data, and it's hard to find performance bottlenecks in the system to optimize uptime, so we need some more granular data.
+
+<!-- 事实上, 我们可以把程序的运行时间分解成以下3个因子: -->
+Actually, we can break down the running time of a program into the following three factors:
+
+```
+        time      inst     cycle     time
+perf = ------- = ------ * ------- * -------
+        prog      prog      inst     cycle
+```
+
+<!-- 性能优化的目标是减少程序的运行时间, 也即减小其中的每个因子, 这也揭示了性能优化的三个方向. -->
+
+The goal of performance optimization is to reduce the running time of the program, that is, to reduce each factor in it, which also reveals the three directions of performance optimization.
+
+<!-- 第一个优化方向是减少`程序执行的指令数量(即动态指令数)`, 可能的措施包括:
+1. 修改程序, 采用更优的算法.
+2. 采用更优的编译优化策略. 以gcc为例, 除了使用`-O3`, `-Ofast`等通用的优化等级参数外,
+   还可以对目标程序进行针对性的调参. gcc中与生成代码质量相关的编译选项有大约600多个,
+   选择合适的编译选项, 能大幅优化程序的动态指令数.
+   例如, yzh在开展某课题项目时, 仅使用`-O3`编译CoreMark, 运行10轮的动态指令数约312万条;
+   若额外打开一些针对性的编译选项, 则上述动态指令数可下降到约225万条, 程序运行的性能显著提升.
+3. 设计并使用行为更复杂的指令集. 我们知道, CISC指令集包含行为较复杂的指令,
+   若编译器使用这些复杂指令, 则可以降低动态指令数.
+   另外, 也可以在处理器中添加自定义的专用指令, 并让程序使用这些专用指令. -->
+
+The first optimization direction is to reduce the number of instructions executed by the program (i.e., the number of dynamic instructions). Possible measures include:
+
+   1. Modify the program and adopt a better algorithm.
+   2. Adopt better compilation optimization strategies. Take gcc as an example, in addition to the use of '-O3', '-Ofast' and other general optimization level parameters, there are about 600 compilation options in gcc related to the quality of the generated code. Selecting the appropriate compilation options can greatly optimize the number of dynamic instructions in the program. 
+   For example, when yzh carried out a project, it only used '-O3' to compile CoreMark, and the number of dynamic instructions running 10 rounds was about 3.12 million; if some additional targeted compilation options are turned on, the number of dynamic instructions can be reduced to about 2.25 million, and the performance of the program is significantly improved.
+   3. Design and use instruction sets with more complex behavior. We know that the CISC instruction set contains instructions with more complex behavior, if the compiler uses these complex instructions, the number of dynamic instructions can be reduced.Alternatively, you can add custom specialized instructions to the processor and have the program use them.
+
+<!-- 第二个优化方向是降低`平均每条指令执行所需的周期数`, 也即CPI(Cycles Per Instruction).
+一般也称提升CPI的倒数, 即IPC(Instructions Per Cycle), 即提升平均每周期所执行的指令数.
+这个指标反映了处理器的微结构设计效果, 强大的处理器通过每周期能执行更多的指令.
+因此, 通常通过优化处理器的微结构设计来提升IPC, 从而使处理器更快地完成程序的执行.
+微结构设计的优化又有不同的方向, 我们将在下文简单讨论. -->
+
+The second optimization direction is to reduce the "average number of Cycles Per Instruction execution", that is, the CPI(Cycles Per Instruction). It is also known as the reciprocal of increasing CPI, that is, IPC(Instructions Per Cycle), that is, increasing the average number of instructions executed per cycle. This indicator reflects the microstructural design of the processor, and powerful processors can execute more instructions per cycle. Therefore, IPC is often improved by optimizing the microprocessor's microstructure design, so that the processor can complete the execution of the program faster. The optimization of microstructure design has different directions, which we will briefly discuss below.
+
+<!-- 第三个优化方向是减少`每个周期的时间`, 也即提升单位时间内的周期数, 后者就是电路的频率.
+可能的优化措施包括:
+1. 优化数字电路的前端设计, 减小关键路径的逻辑延迟.
+2. 优化数字电路的后端设计, 减小关键路径的走线延迟. -->
+
+The third optimization direction is to reduce 'time per cycle', that is, to increase the number of cycles per unit time, which is the frequency of the circuit.
+Possible optimization measures include:
+
+   1. Optimize the front-end design of digital circuits to reduce the logic delay of critical paths.
+   2. Optimize the back-end design of the digital circuit to reduce the routing delay of the critical path.
+
+<!-- 如果可以量化地评估上述3个因子, 我们就能更好地评估上述三个优化方向的潜力,
+从而指导我们找到性能瓶颈.
+幸运的是, 这些指标其实不难获取:
+* 动态指令数可在仿真环境中直接统计
+* 有了动态指令数, 再统计周期数, 即可计算出IPC
+* 电路的频率可查看综合器的报告获取 -->
+
+If we can quantify these three factors, we can better assess the potential of these three optimization directions, this guides us to the performance bottleneck.
+Fortunately, these indicators are not hard to come by:
+
+* The number of dynamic instructions can be counted directly in the simulation environment
+* With the number of dynamic instructions, then the number of statistical cycles, you can calculate IPC
+* The frequency of the circuit can be obtained by viewing the synthesizer report
+
+<!-- > #### todo::统计IPC
+> 尝试在仿真环境中统计IPC.
+>
+> 实际上, 我们在实现总线的时候要求你评估程序的运行时间, 用的就是上述性能公式.
+> 不过当时我们是通过`程序执行所需的周期数 / 频率`来计算,
+> 上述性能公式只是把`程序执行所需的周期数`进一步拆成两个因子而已.
+> 但这样拆分还是能给我们提供更进一步的信息,
+> 毕竟`程序执行所需的周期数`与程序和处理器都相关,
+> 而`平均每条指令执行所需的周期数`就只和处理器的处理能力相关了. -->
+
+> #### todo:: Statistics IPC
+>
+> Try to count IPC in a simulation environment.
+>
+> In fact, when we implement the bus, we ask you to evaluate the running time of the program, using the performance formula described above.
+> But at that time we calculated it by the number of cycles/frequency required to execute the program.
+> The above performance formula only further breaks down the number of cycles required for program execution into two factors.
+But this split still gives us further information,
+> After all, 'the number of cycles required for program execution' is related to both the program and the processor,
+The average number of cycles per instruction execution is only related to the processing power of the processor.
+
+<!-- ### 简单处理器的性能模型 -->
+### Performance model for simple processors
+
+<!-- 即使IPC不难统计, 但就如统计得到的运行时间无法指导我们如何优化运行时间,
+统计得到的IPC也并不能指导我们如何优化IPC.
+为了找到性能瓶颈, 我们需要分析IPC受哪些因素的影响, 正如上文分析运行时间的因子一样.
+为此, 我们需要重新审视处理器是如何执行指令的. -->
+
+Even though IPC is not difficult to count, just as the uptime statistics do not tell us how to optimize the uptime, the IPC statistics do not tell us how to optimize the IPC.
+In order to find the performance bottleneck, we need to analyze the factors that affect IPC, just as we analyzed the run-time factors above.
+To do this, we need to re-examine how the processor executes instructions.
+
+```
+       /--- frontend ---\    /-------- backend --------\
+                                  +-----+ <--- 2. computation efficiency
+                             +--> | FU  | --+
+       +-----+     +-----+   |    +-----+   |    +-----+
+       | IFU | --> | IDU | --+              +--> | WBU |
+       +-----+     +-----+   |    +-----+   |    +-----+
+          ^                  +--> | LSU | --+
+          |                       +-----+
+1. instruction supply                ^
+                    1. data supply --+
+```
+
+<!-- 上图是一个简单的处理器结构图, 我们之前都是从功能的角度来理解这个图, 现在我们需要从性能的角度来理解它.
+我们可以将处理器划分成前端和后端, 其中, 前端包括取指和译码,
+剩余的模块属于后端, 负责真正执行指令并改变处理器状态.
+注意, 处理器的前后端划分和上文提到的数字电路的前后端设计并不相同.
+事实上, 处理器的前后端设计都属于数字电路的前端设计环节. -->
+
+The diagram above is a simple processor structure, we have been understanding this diagram from a functional perspective, now we need to understand it from a performance perspective.
+We can divide the processor into front end and back end, where the front end includes finger extraction and decoding, the remaining modules belong to the back end and are responsible for actually executing the instructions and changing the processor state.
+Note that the front and back end of the processor is not the same as the front and back end design of the digital circuit mentioned above.
+In fact, the front-end design of the processor belongs to the front-end design of the digital circuit.
+
+<!-- 要提升处理器的执行效率, 就需要做到:
+1. 处理器前端需要保证指令供给. 如果前端取不到足够的指令, 就无法完全发挥处理器的计算能力.
+   因为每一条指令的执行都需要先取指, 因此前端的指令供给能力将影响所有指令的执行效率.
+2. 处理器后端需要保证计算效率和数据供给
+   * 对于大部分计算类指令, 其执行效率取决于相应功能单元的计算效率.
+     例如, 乘除法指令的执行效率还受乘除法器的计算效率的影响.
+     类似的还有浮点执行和浮点处理器单元FPU等.
+   * 对于访存类指令, 其执行效率取决于LSU的访存效率.
+     特别地, 对于load指令, 处理器需要等待存储器返回数据, 然后才能将数据写回寄存器堆.
+     这意味着, load指令的执行效率取决于LSU和存储器的数据供给能力.
+     store指令则比较特殊, 因为store指令不需要写入寄存器堆,
+     原则上处理器不必等待数据完全写入存储器.
+     在高性能处理器中, 通常会设计一个store buffer部件,
+     处理器将store指令的信息写入store buffer后, 即认为store执行结束,
+     后续由store buffer部件控制将数据真正写入存储器中.
+     当然, 这增加了处理器设计的复杂度, 例如load指令还需要检查最新的数据是否在store buffer中. -->
+
+To improve the execution efficiency of the processor, you need to do:
+
+   1. The front end of the processor needs to ensure instruction supply. If the front end does not get enough instructions, it will not be able to fully use the computing power of the processor. Because the execution of each instruction needs to be taken first, the instruction supply capacity of the front end will affect the execution efficiency of all instructions.
+   2. The back-end of the processor needs to ensure computing efficiency and data supply
+
+* For most computation-class instructions, their execution efficiency depends on the computational efficiency of the corresponding functional unit.
+     For example, the execution efficiency of a multiplier and division instruction is also affected by the computational efficiency of the multiplier and divider. Similar to floating-point execution and floating-point processor unit FPU.
+* For memory access class instructions, the execution efficiency depends on the memory access efficiency of the LSU. In particular, for the load instruction, the processor needs to wait for the memory to return data before it can write it back to the register pile. This means that the efficiency of the load instruction depends on the data supply capacity of the LSU and the memory. The store instruction is special because the store instruction does not need to be written to the register pile, in principle, the processor does not have to wait for data to be fully written to memory. In high-performance processors, a store buffer component is usually designed, after the processor writes the information of the store instruction to the store buffer, the store execution is considered to be complete. The store buffer component then controls the actual writing of the data to the memory. Of course, this adds complexity to the processor design, for example, the load instruction also needs to check if the latest data is in the store buffer.
+
+<!-- 那么, 我们应该如何量化地评估处理器的指令供给, 计算效率和数据供给呢?
+换句话说, 我们真正想了解的是, 处理器在运行指定的benchmark时, IFU和LSU等模块有没有全速工作.
+为此, 我们又需要收集更多的信息. -->
+
+So, how can we quantitatively evaluate the instruction supply, computational efficiency, and data supply of a processor?
+In other words, what we really want to know is whether modules such as IFU and LSU are working at full speed when the processor is running the specified benchmark.To do that, we need to collect more information.
+
+<!-- ### 性能事件和性能计数器 -->
+### Performance events and performance counters
+
+<!-- 为了量化地评估处理器的指令供给, 计算效率和数据供给, 我们需要进一步理解影响它们的细致因素.
+以指令供给为例, 指令供给能力怎么算强呢?
+能直接反映指令供给能力的, 就是IFU是否取到了指令.
+为此, 我们可以把"IFU取到指令"看作一个事件, 来统计这个事件发生的频次,
+如果这个事件经常发生, 指令供给能力就强; 否则, 指令供给能力就弱. -->
+
+In order to quantitatively evaluate a processor's instruction supply, computational efficiency, and data supply, we need to further understand the detailed factors that affect them. Take instruction supply as an example, how can instruction supply capacity be considered strong?
+What can directly reflect the ability of the instruction supply is whether the IFU gets the instruction. To do this, we can consider "IFU fetch instruction "as an event and count the frequency of this event. If this event happens often, the command supply capacity is strong; Otherwise, the instruction supply capacity is weak.
+
+<!-- 这些事件称为性能事件(performance event), 通过它们,
+我们可以将性能模型中一些较为抽象的性能指标转化为电路上的具体事件.
+类似地, 我们还可以统计"LSU取到数据"这个事件发生的频次, 来衡量数据供给能力的强弱;
+统计"EXU完成计算"这个事件发生的频次, 来衡量计算效率的高低. -->
+
+These events are called performance events, and through them, we can translate some of the more abstract performance indicators in the performance model into concrete events on the circuit. Similarly, we can measure the strength of data supply by counting the frequency of the event "LSU gets data "; count the frequency of the event "EXU complete calculation "to measure the efficiency of the calculation.
+
+<!-- 要统计性能事件发生的频次, 我们只需要在硬件中添加一些计数器,
+在检测到性能事件发生时, 计数器的值就加1.
+这些计数器称为性能计数器(performance counter).
+有了性能计数器, 我们就可以观察"程序在处理器上运行的时间都花在哪里"了,
+相当于对处理器内部做profiling. -->
+
+To count the frequency of performance events, we just need to add some counters to the hardware, when a performance event is detected, the value of the counter is increased by 1.
+These counters are called performance counters. With a performance counter, we can see where the time spent running a program on the processor is, it's like profiling the inside of the processor.
+
+<!-- 要在电路上检测性能事件的发生并不难, 我们可以利用总线机制的握手信号进行检测.
+例如, IFU取指的R通道握手时, 表示IFU接收到AXI总线返回的数据, 从而完成一次取指操作,
+因此当R通道握手时, 我们就可以让相应的性能计数器加1. -->
+
+It is not difficult to detect the occurrence of performance events on the circuit, we can use the handshake signal of the bus mechanism to detect. For example, the handshake of the R channel of the IFU finger fetch indicates that the IFU has received the data returned by the AXI bus, thus completing a finger fetch operation. So when the R-channel handshake occurs, we can increment the corresponding performance counter by 1.
+
+<!-- > #### todo::添加性能计数器
+> 尝试在NPC中添加一些性能计数器, 至少包含如下性能事件的性能计数器:
+> * IFU取到指令
+> * LSU取到数据
+> * EXU完成计算
+> * 译码出各种类别的指令, 如计算类指令, 访存指令, CSR指令等
+>
+> 性能计数器本质上也是由电路实现的. 随着性能计数器的数量增加,
+> 它们占用的电路面积将越来越大, 甚至有可能影响电路中的关键路径.
+> 因此我们不要求性能计数器参与流片, 你只需要在仿真环境中使用它们即可:
+> 你可以用RTL实现性能计数器, 在仿真结束时通过`$display()`等方式输出它们的值,
+> 然后在综合时通过配置的方式选择不实例化它们;
+> 或者将性能事件的检测信号通过DPI-C接入到仿真环境, 在仿真环境中实现性能计数器.
+> 这样, 你就可以随心所欲地添加性能计数器, 不必担心影响电路的面积和频率.
+>
+> 实现后, 尝试运行microbench的test规模, 收集性能计数器的结果.
+> 如果你的实现正确, 语义相近的不同性能计数器的统计结果应存在一致性.
+> 例如, 译码得到的不同类别的指令的总数, 应与IFU取到指令的数量一致, 也与动态指令数一致.
+> 尝试挖掘更多的一致性关系并检查这些关系是否成立. -->
+
+> #### todo:: Adds a performance counter
+>
+> Try to add some performance counters to the NPC, including at least one of the following performance events:
+>
+> * IFU fetch instruction
+> * LSU gets data
+> * EXU Complete the calculation
+> * Decode various types of instructions, such as calculation class instructions, memory access instructions, CSR instructions, etc
+>
+> Performance counters are also essentially implemented by circuits. As the number of performance counters increases, they will take up more and more circuit area and may even affect the critical path in the circuit.
+> So we don't require performance counters to participate in the flow sheet, you just need to use them in the simulation environment:
+> You can use RTL to implement performance counters and output their values at the end of the simulation by means of '$display()' etc.
+> Then choose not to instantiate them by way of configuration during synthesis; or the detection signal of the performance event is connected to the simulation environment through DPI-C to realize the performance counter in the simulation environment. 
+> This way, you can add performance counters as much as you want without worrying about affecting the area and frequency of the circuit.
+> After implementation, try running a microbench test scale to collect results for performance counters.
+> If your implementation is correct, there should be consistency between different semantically similar performance counters.
+For example, the total number of instructions of different classes obtained by decoding should be the same as the number of instructions fetched by IFU, and also the number of dynamic instructions.
+> Try to find more consistent relationships and check if those relationships hold up.
+
+<!-- 相比于事件发生, 有时候我们更关心事件什么时候不发生, 以及为什么不发生.
+例如, 其实我们更关心IFU什么时候取不到指令, 以及IFU为什么取不到指令,
+梳理其中的缘由有助于我们理解指令供给的瓶颈在哪里,
+从而为提升处理器的指令供给提供指导.
+我们可以把"事件不发生"定义成一个新的事件, 并为新事件添加性能计数器. -->
+
+Sometimes we care more about when and why events don't happen than when they do. For example, we care more about when IFU can't get instructions and why IFU can't get instructions, figuring out why this is happening helps us understand where the bottlenecks in instruction supply are, i t provides guidance for improving instruction supply of processor. We can define the "event does not occur" as a new event and add a performance counter for the new event.
+
+<!-- > #### todo::添加性能计数器(2)
+> 在NPC中添加更多性能计数器, 并尝试分析以下问题:
+> * 每种类别的指令占多少比例? 它们各自平均需要执行多少个周期?
+> * IFU取不到指令的原因有哪些? 这些原因导致IFU取不到指令的几率分别是多少?
+> * LSU的平均访存延迟是多少? -->
+
+> #### todo:: Add a performance counter (2)
+>
+> Add more performance counters to NPCS and try to analyze the following issues:
+>
+> * What percentage of each type of instruction? How many cycles do they each need to execute on average?
+> * What are the reasons why IFU can't get instructions? What are the chances that IFU can't fetch an instruction for these reasons?
+> * What is the average memory access delay for LSU?
+
+
+<!-- > #### comment::性能计数器的trace
+>
+> 前文介绍的性能计数器的使用方式都是在仿真结束后输出并分析.
+> 如果我们每周期都输出性能计数器的值, 我们就能得到性能计数器的trace!
+> 根据这种trace, 借助一些绘图工具(如python的matplotlib绘图库),
+> 我们可以绘制出性能计数器的值随时间变化的曲线,
+> 将仿真过程中性能计数器的变化过程可视化,
+> 从而帮助我们更好地判断性能计数器的变化过程是否符合预期. -->
+
+> #### comment: trace the performance counter
+>
+> The use of the performance counter described above is output and analysis after the simulation.
+> If we output the value of the performance counter every cycle, we can get a trace of the performance counter!
+> According to this trace, with some drawing tools (such as python's matplotlib drawing library), we can plot the value of the performance counter over time,
+> Visualize how performance counters change during simulation, this helps us to better determine whether the performance counter changes as expected.
+
+<!-- ### 阿姆达尔定律(Amdahl's law) -->
+### Amdahl's law
+
+<!-- 性能计数器可以为处理器微结构的优化提供量化指导.
+那么, 性能瓶颈究竟在哪里? 哪些优化工作是值得做的呢? 优化工作的预期性能收益是多少?
+我们需要在开展具体的优化之前就回答这些问题,
+以帮助我们规避那些预期性能收益很低的优化工作,
+从而将更多时间投入到收益高的优化工作中.
+这听上去像一个预测未来的工作, 但Amdahl's law可以告诉我们答案. -->
+
+Performance counters can provide quantitative guidance for optimization of processor microstructures.So where is the performance bottleneck? What optimizations are worth doing? What is the expected performance benefit of the optimization effort?
+We need to answer these questions before we start making specific optimizations, to help us avoid optimizations with low expected performance gains, this will allow more time to be spent on optimization work with high returns. It sounds like a job of predicting the future, but Amdahl's law can tell us the answer.
+
+<!-- [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law)由计算机科学家Gene Amdahl在1967年提出,
+其内容是: -->
+
+[Amdahl's law] (https://en.wikipedia.org/wiki/Amdahl%27s_law) Gene Amdahl by computer scientists put forward in 1967,
+Its contents are:
+
+```
+The overall performance improvement gained by optimizing a single part
+of a system is limited by the fraction of time that the improved part
+is actually used.
+
+```
+
+<!-- 假设系统某部分实际使用的时间占比是`p`, 该部分在优化后的加速比是`s`,
+则整个系统的加速比为`f(s) = 1 / (1 - p + p / s)`, 这就是Amdahl's law的公式表示. -->
+
+Suppose that the actual time proportion of a certain part of the system is' p ', and the acceleration ratio of this part after optimization is' s', then the acceleration ratio of the whole system is' f(s) = 1 / (1-p + p/s) ', which is the formula expression of Amdahl's law.
+
+<!-- 例如, 某程序的运行过程分为独立的两部分A和B, 其中A占80%, B占20%.
+* 若将B优化5倍, 则整个程序的加速比是`1 / (0.8 + 0.2 / 5) = 1.1905`;
+* 若将B优化5000倍, 则整个程序的加速比是`1 / (0.8 + 0.2 / 5000) = 1.2499`;
+* 若将A优化2倍, 则整个程序的加速比是`1 / (0.2 + 0.8 / 2) = 1.6667`. -->
+
+For example, the running process of A program is divided into two independent parts A and B, of which A accounts for 80% and B accounts for 20%.
+
+* If B is optimized by a factor of 5, the acceleration ratio of the entire program is' 1 / (0.8 + 0.2/5) = 1.1905 ';
+* If B is optimized 5000 times, then the acceleration ratio of the entire program is' 1 / (0.8 + 0.2/5000) = 1.2499 ';
+* If A is optimized by a factor of 2, the acceleration ratio of the entire program is' 1 / (0.2 + 0.8/2) = 1.6667 '.
+
+```
+<------- A --------><-B->
+++++++++++++++++++++ooooo   Original program
+
+++++++++++++++++++++o       Optimize B by 5 times
+
+++++++++++++++++++++        Optimize B by 5000 times, and the running time after optimization is very short
+
+++++++++++ooooo             Optimize A by 2 times
+```
+
+<!-- 通常来说, 优化5000倍所需要付出的努力要比优化2倍大得多,
+但Amdahl's law告诉我们, 把B优化5000倍, 还不如把A优化2倍.
+这个反直觉的结论告诉我们, 不能仅考虑某部分的加速比,
+还需要考虑该部分的时间占比, 从整系统的角度评估一项技术的优化效果.
+因此, 对处理器的性能优化来说,
+提前通过性能计数器测量某个优化对象在运行时间中的占比, 就显得非常重要了. -->
+
+In general, optimizing 5000 times takes a lot more effort than optimizing 2 times, but Amdahl's law tells us that optimizing B by A factor of 5,000 is better than optimizing A by a factor of two. This counterintuitive result tells us that we can't just consider the acceleration ratio of one part, it is also necessary to consider the time proportion of this part and evaluate the optimization effect of a technology from the perspective of the whole system.
+Therefore, for processor performance optimization, it is important to measure the elapsed time of an optimized object in advance by a performance counter.
+
+<!-- > #### todo::根据性能计数器寻找合适的性能瓶颈
+>
+> 根据性能计数器的统计结果, 尝试挖掘一些潜在的优化对象,
+> 然后利用Amdahl's law估算它们能获取的理论收益, 从而确定系统的性能瓶颈位于何处. -->\
+
+> #### todo:: Find the appropriate performance bottleneck based on the performance counter
+>
+> According to the statistical results of the performance counter, try to find some potential optimization objects,
+Then Amdahl's law is used to estimate the theoretical benefits they can achieve to determine where the performance bottleneck of the system lies.
+
+<!-- > #### caution::从事计算机体系结构工作的专业素养
+>
+> 软件领域有一则广泛流传的忠告:
+> ```
+> 抛开workload谈优化就是耍流氓.
+> ```
+> 它的意思是, 优化方案的选择一定要基于负载的运行情况.
+>
+> 在体系结构领域更是如此, 我们不能依靠直觉来优化处理器设计, 觉得哪里有优化机会就改哪里,
+> 否则将很容易采取一个没有效果的方案, 甚至会在实际场景中造成性能倒退.
+> 相反, 根据评估数据采取合适的设计方案, 才是科学的做法.
+>
+> 事实上, Amdahl's law很容易理解, 如果不考虑专业背景,
+> 我们甚至可以将它包装成一道数学应用题给小学生解答.
+> 但我们也见过很多初学者"耍流氓", 说白了还是缺少相关的专业素养.
+>
+> 大家来学习“一生一芯”, 并不仅仅是学习RTL编码,
+> 更重要的是学习解决问题的科学方法, 锻炼出这个方向的专业素养,
+> 使得将来遇到真实问题时, 知道如何通过正确的方式解决. -->
+>
+> #### caution:: Professional quality in computer architecture work
+>
+> There is a piece of advice widely circulated in the software world:
+> ```Talking about optimization without workload is rogue.```
+> It means that the choice of optimization scheme must be based on the operation of the load.
+> This is especially true in the field of architecture, where we can't rely on intuition to optimize processor design and change where we see an opportunity for optimization.
+> Otherwise, it's easy to adopt a scheme that doesn't work, or even cause performance regression in real scenarios. On the contrary, it is scientific to adopt a suitable design plan based on the evaluation data.
+> In fact, Amdahl's law is easy to understand, without regard to professional background, we can even package it as a math word problem for elementary school students to solve.
+But we have also seen a lot of beginners "rogue ", to put it bluntly, or lack of relevant professional quality.
+> Everyone to learn "One Student One Chip", not only to learn RTL coding,
+> It is more important to learn scientific methods to solve problems, and exercise professional quality in this direction.
+> So that when you encounter real problems in the future, you know how to solve them in the right way.
+
+<!-- > #### caution::自顶向下的调试方法
+>
+> 你已经调试过很多功能bug了, 但其实还有一种bug叫性能bug,
+> 它的表现并不是程序出错或崩溃, 而是程序的运行性能低于预期.
+> 当然, 调试性能bug的过程和性能优化的过程有相同之处,
+> 都需要寻找系统中的性能瓶颈.
+>
+> 事实上, 调试功能bug和调试性能bug, 也有相似之处.
+> 调试功能bug时, 我们首先看到的是程序出错或崩溃的信息,
+> 但仅仅阅读这样的信息, 还很难找到bug;
+> 于是要通过各种层次的trace工具来理解程序的行为, 找到程序出错时的具体表现;
+> 然后再使用gdb/波形等工具, 在变量/信号层次进行细致分析.
+>
+> 调试性能bug时, 我们首先看到的是程序的运行时间,
+> 但它无法直接告诉我们性能瓶颈在哪里;
+> 于是通过性能公式将程序的运行时间分解成3个因子,
+> 从编译, 微结构, 频率这3个方向考察优化潜力;
+> 对于微结构, 光靠统计的IPC, 还是很难找到性能瓶颈,
+> 于是我们需要分析影响IPC的因素, 把处理器划分成三大部分,
+> 从指令供给, 数据供给和计算效率来理解处理器执行指令的过程,
+> 但我们需要更具体的量化数据;
+> 于是我们需要添加性能计数器, 来统计每个模块中性能事件发生的情况,
+> 最后通过Amdahl's law找到真正的性能瓶颈.
+>
+> 调试这两种bug都用到了类似自顶向下的分析方法, 这并不是巧合,
+> 而是计算机系统领域中抽象思维的体现: 抽象是理解复杂系统的唯一方式.
+> 事实上, 如果你尝试一开始就使用gdb/波形来调试, 你会感到非常困难,
+> 这是因为底层的细节数量巨大, 难以提供宏观视角的理解.
+> 因此, 我们需要从高层语义开始, 根据合适的路径向下追溯,
+> 在底层定位到一个很小的范围, 这才能帮助我们快速找到问题的所在. -->
+
+> #### caution:: Top-down debugging method
+>
+> You've been debugging a lot of feature bugs, but there's actually a kind of bug called a performance bug, its performance is not a program error or crash, but the performance of the program is lower than expected.
+> Of course, the process of debugging performance bugs is similar to the process of performance optimization, look for performance bottlenecks in the system.
+> In fact, debugging bugs and debugging performance bugs are also similar.
+> When debugging a bug, the first thing we see is that the program is faulty or crashes.
+> But just reading such information, it is difficult to find bugs; Therefore, it is necessary to understand the behavior of the program through various levels of trace tools to find the specific performance of the program when the error; Then use tools such as gdb/ waveform for detailed analysis at the variable/signal level.
+> When debugging a performance bug, the first thing we see is the running time of the program, but it doesn't tell us directly where the performance bottleneck is;
+> Then the running time of the program is decomposed into 3 factors by the performance formula.
+> Investigate the optimization potential from the three directions of compilation, microstructure and frequency; For microstructures, it is still difficult to find performance bottlenecks based on statistical IPC alone.
+> So we need to analyze the factors that affect IPC, divide the processor into three parts, understand the process of processor execution from instruction supply, data supply and computational efficiency.
+> But we need more concrete quantitative data; So we need to add a performance counter to count the occurrence of performance events in each module.
+> Finally, find the real performance bottleneck through Amdahl's law.
+> It's no coincidence that debugging both of these bugs uses a similar top-down analysis approach, it is the embodiment of abstract thinking in the field of computer systems: abstraction is the only way to understand complex systems.
+> In fact, if you try to debug using gdb/ waveform at first, you will find it very difficult.
+> This is because the amount of detail at the bottom is so large that it is difficult to provide a macro perspective.
+> Therefore, we need to start at the high level semantics and trace down the appropriate path,
+> Locate a small area at the bottom, which helps us quickly find the problem.
+
+<!-- ### 校准访存延迟 -->
+### Calibrate memory access delay
+
+<!-- 我们将NPC接入ysyxSoC后, SDRAM控制器等模块提供了更真实的访存过程.
+可以想象, 如果我们在接入ysyxSoC之前统计性能计数器,
+由于访存延迟的差异, 其结果将与接入ysyxSoC后大不相同.
+而不同的统计结果将会指导我们往不同的方向进行优化,
+但如果我们以流片为目标, 这些不同方向的优化很可能无法取得预期的效果.
+因此, 如果仿真环境的行为和真实芯片越接近, 评估结果的误差就越小,
+在性能计数器的指导下开展的优化所取得的性能提升就越真实. -->
+
+After we connect NPC to ysyxSoC, modules such as SDRAM controller provide a more realistic memory access process. You can imagine if we counted performance counters before we plugged in ysyxSoC, due to the difference in memory access delay, the result will be very different from that after connecting to ysyxSoC. And the different statistical results will guide us to optimize in different directions, but if we aim for the stream, these different directions of optimization are likely to fail to achieve the desired effect. So the closer the simulated environment behaves to the real chip, the less error there is in the evaluation, the more real the performance gains from optimizations guided by performance counters.
+
+<!-- 事实上, 之前的ysyxSoC环境是假设处理器和各种外设运行在同一频率下:
+verilator仿真的一个周期, 既是处理器中的一个周期, 也是外设中的一个周期.
+但实际上并非如此: 受电气特性的影响, 外设通常只能运行在低频率,
+例如SDRAM颗粒通常只能运行在100MHz左右, 过高的频率会导致时序违例,
+使得SDRAM颗粒无法正确工作; 但另一方面, 使用先进工艺的处理器通常能够运行在更高的频率,
+例如, yzh某个版本的多周期NPC在`yosys-sta`项目默认提供的nangate45工艺下频率达到约1.2GHz.
+在上述配置下, SDRAM控制器中经过1个周期, NPC应该经过12个周期,
+但verilator感知不到两者的频率差异, 仍然按照两者频率相同的假设进行仿真,
+使得仿真结果比真实芯片乐观很多, 从而也可能会使得一些优化措施无法在真实芯片中取得预期的效果. -->
+
+In fact, the previous ysyxSoC environment assumed that the processor and various peripherals were running on the same frequency:
+A cycle of verilator emulation is both a cycle in the processor and a cycle in the peripherals. But in practice this is not the case: due to electrical characteristics, peripherals can usually only operate at low frequencies, for example, SDRAM particles usually only operate at around 100MHz, and too high a frequency can lead to timing violations, making SDRAM particles not work correctly; But on the other hand, processors using advanced processes are often able to run at higher frequencies, for example, a version of the yzh multi-period NPC reaches a frequency of about 1.2GHz in the nangate45 process provided by default by the 'yosys-sta' project.
+In the above configuration, the SDRAM controller goes through 1 cycle, the NPC should go through 12 cycles, however, verilator does not perceive the frequency difference between the two, and still simulates under the assumption that the two frequencies are the same.
+The simulation result is much more optimistic than the real chip, which may also make some optimization measures can not achieve the expected effect in the real chip.
+
+<!-- > #### Danger::更新yosys-sta
+>
+> 我们在2024/04/09 08:30:00更新了`yosys-sta`项目, 添加了iEDA团队研发的网表优化工具,
+> 大幅优化了yosys生成的综合网表, 使其时序评估结果更接近商业工具.
+> 如果你在上述时间之前获得`yosys-sta`的代码, 请删除已有的`yosys-sta`项目并重新克隆. -->
+
+> #### Danger:: Update yosys-sta
+>
+> We updated the 'yosys-sta' project at 2024/04/09 08:30:00, adding the netlist optimization tool developed by the iEDA team.
+> The comprehensive netlists generated by yosys have been significantly optimized to bring their timing evaluation results closer to commercial tools.
+> If you obtained the code for 'yosys-sta' before the above time, delete the existing 'yosys-sta' project and clone it again.
+
+<!-- 为了获得更准确的仿真结果, 以指导我们进行更有效的优化, 我们需要对访存延迟进行校准(calibration).
+校准的方式有两种, 一种是使用支持多个时钟域的仿真器, 例如VCS或[ICARUS verilog](https://github.com/steveicarus/iverilog).
+和采用周期精确模型方式实现的verilator不同, 这种仿真器采用事件队列模型的方式实现,
+把Verilog中的每次计算都看作事件, 并且可以维护事件的延迟,
+从而可以正确地维护多时钟域中不同模块在不同频率下工作时每次计算的顺序.
+不过为了维护事件队列模型, 这种仿真器的运行速度通常要低于verilator. -->
+
+In order to obtain more accurate simulation results to guide us in more effective optimization, we need to conduct a calibration of the memory access delay. Calibration in one of two ways, one kind is to use simulators, support multiple clock domains such as VCS or [ICARUS verilog] (https://github.com/steveicarus/iverilog). Unlike verilator, which is implemented in a cycle-accurate model, this simulator is implemented in an event queue model. Consider every calculation in Verilog as an event, and you can maintain the latency of the event. Thus, the order of each calculation can be correctly maintained when different modules in the multi-clock domain work at different frequencies. However, in order to maintain the event queue model, such emulators usually run slower than verilator.
+
+<!-- 校准的第二种方式是修改RTL代码, 在ysyxSoC中插入一个延迟模块,
+负责将请求延迟若干周期, 来模拟设备在低频率运行的效果,
+使得NPC等待的周期数与其将来在高频率运行时所等待的周期数接近.
+这种方式的实现不算复杂, 而且可以使用较快的verilator来仿真, 我们选择这种方式.
+此外, 这种方式也适用于FPGA. -->
+
+The second way to calibrate is to modify the RTL code, insert a delay module into the ysyxSoC, responsible for delaying the request for several cycles to simulate the effect of the device operating at a low frequency. The number of cycles the NPC waits is close to the number of cycles it will wait in the future when running at high frequencies. This approach is not complicated to implement and can be simulated using a faster verilator, so we chose this approach. In addition, this approach also applies to FPGA.
+
+<!-- 很自然, 为了实现延迟模块, 我们只要在延迟模块收到设备的回复后,
+不马上回复给上游模块, 而是等待一定的周期数再回复, 但如何计算需要等待的周期数却是一个挑战.
+考虑上文yzh的例子, 如果一个请求在SDRAM控制器中需要花费6个周期, 那NPC应该总共等待`6 * 12 = 72`个周期;
+如果碰上SDRAM控制器在刷新SDRAM颗粒, 请求在SDRAM控制器中花费了10个周期, NPC应该总共等待`10 * 12 = 120`个周期;
+如果请求发往flash, 在SPI master中花费了150个周期, NPC应该总共等待`150 * 12 = 1800`个周期.
+可以看到, 延迟模块需要等待的周期数与设备服务请求所花费的时间有关,
+并不是一个固定的常数, 因此需要在延迟模块中动态计算.
+假设请求在设备中花费了`k`个周期, 处理器和设备的频率比是`r`(应有`r >= 1`),
+那么延迟模块中需要计算出处理器所需等待的周期数`c = k * r`.
+为了进行这一动态计算, 我们又需要考虑两个问题:
+1. 如何低开销地实现乘法?
+2. 如果`r`是小数, 如何实现小数的乘法?
+   例如`yosys-sta`项目报告的频率是550MH, 那么`r = 550 / 100 = 5.5`,
+   但如果把5.5按5来计算, 一个在设备端花费6周期请求将会在处理器端引入3个周期的误差,
+   对高速运行的CPU来说误差太大, 误差的积累会明显地影响性能计数器的值, 从而进一步影响优化的决策. -->
+
+Naturally, in order to implement the delay module, we just need to wait until the delay module receives a reply from the device, it does not reply to the upstream module immediately, but waits for a certain number of cycles before replying, but how to calculate the number of cycles to wait is a challenge.
+Considering the yzh example above, if a request takes 6 cycles in the SDRAM controller, the NPC should wait a total of '6 * 12 = 72' cycles; If the SDRAM controller is refreshing SDRAM particles and the request took 10 cycles in the SDRAM controller, the NPC should wait a total of '10 * 12 = 120' cycles; If the request goes to flash and takes 150 cycles in the SPI master, the NPC should wait a total of '150 * 12 = 1800' cycles.
+As you can see, the number of cycles the delay module needs to wait is related to the time spent on the device service request.
+It is not a fixed constant, so it needs to be dynamically calculated in the delay module.
+Suppose the request takes' k 'cycles in the device, and the processor/device frequency ratio is' r' (there should be 'r >= 1'), then the delay module needs to calculate the number of cycles the processor needs to wait 'c = k * r'.
+To perform this dynamic calculation, we need to consider two more questions:
+
+1. How to implement multiplication cheaply?
+2. If 'r' is a decimal, how to implement multiplication of decimals?
+For example, 'yosys-sta' reports a frequency of 550MH, then 'r = 550/100 = 5.5', but if 5.5 is calculated as 5, a request that takes 6 cycles on the device will introduce a 3 cycle error on the processor, the error is too large for a CPU running at high speed, and the accumulation of error will significantly affect the value of the performance counter, which further affects the optimization decision.
+
+<!-- 考虑到ysyxSoC的代码不参与综合和流片, 其实我们可以用一些简单的方法解决问题,
+例如用``*``来计算乘法的结果, 用定点数来表示小数.
+不过作为一个练习, 我们还是要求大家尝试可综合的实现方式,
+将来如果你需要在可综合电路中解决类似的问题, 你就知道怎么做了. -->
+
+Considering that ysyxSoC code does not participate in synthesis and streaming, there are some simple ways to solve the problem, for example, use '*' to calculate the result of multiplication, and use fixed-point numbers to represent decimals.
+But as an exercise, we're going to ask you to try something that can be integrated.
+In the future, if you need to solve a similar problem in a synthesizable circuit, you will know how to do it.
+
+<!-- 首先我们先考虑`r`是整数时, 如何实现乘法.
+既然延迟模块本身也需要等待设备的回复, 等待的时间正好是请求在设备中花费的周期数`k`,
+那干脆让延迟模块在等待的每个周期中对一个计数器进行累加, 每周期加`r`即可.
+对于给定的处理器频率和设备频率, `r`是个定值, 因此可以直接硬编码到RTL代码中.
+在延迟模块收到设备的回复后, 就进入等待状态, 每周期让计数器减1, 减到0时再将请求回复给上游. -->
+
+Let's first consider how to multiply when r is an integer.
+Since the delay module itself also needs to wait for the device's reply, the waiting time is exactly the number of cycles' k 'spent on the request in the device. Simply ask the delay module to add a counter for each cycle it waits, adding 'r' for each cycle. For a given processor frequency and device frequency, 'r' is a fixed value and can therefore be hardcoded directly into the RTL code. After the delay module receives a reply from the device, it enters a waiting state, reducing the counter by 1 per cycle, and then returning the request to the upstream when it reaches 0.
+
+<!-- 然后我们来考虑`r`是小数的情况.
+既然小数部分不方便处理, 直接截断又会引入较大误差,
+我们可以想办法将小数部分并入整数部分进行累加.
+事实上, 我们可以引入一个放大系数`s`, 累加时每周期往计数器加`r * s`,
+即累加结束时, 计数器的值为`y = r * s * k`,
+然后在进入等待状态前, 将计数器更新为`y / s`即可.
+因为`s`是一个常数, 因此`r * s`的结果也可以直接硬编码到RTL中,
+当然这里的`r * s`很有可能还不是整数, 这里我们将其截断为整数,
+虽然这理论上这仍然会引入一定的误差, 但我们可以证明误差比之前小很多.
+不过`y`的值是动态计算的, 不能硬编码到RTL中,
+因此对于一般的`s`, `y / s`需要计算除法.
+你应该很快想到, 我们可以取一些特殊的`s`, 来简化这一计算的过程!
+通过这种方式, 我们可以把误差减少到原来的`1/s`,
+即原来在累加阶段累积的误差达到`s`时, 在这种新方法下的误差才增加`1`. -->
+
+And then let's consider the case where r is a decimal.
+Since the decimal part is not convenient to handle, direct truncation will introduce a large error, we can figure out a way to add the decimal part to the whole part.
+In fact, we can introduce an amplification factor 's', adding 'r * s' to the counter every period. That is, at the end of the accumulation, the value of the counter is' y = r * s * k ', then update the counter to 'y/s' before entering the waiting state.
+Since 's' is a constant, the result of 'r * s' can also be hardcoded directly into RTL,
+Of course, r * s here is probably not an integer, so we truncate it to an integer.
+Although in theory this still introduces a certain amount of error, we can show that the error is much smaller than before.
+However, the value of 'y' is dynamically calculated and cannot be hard-coded into RTL.
+So for the general 's', 'y/s' needs to be divided.
+You should quickly think that we can take some special 's' to simplify the calculation process!
+In this way, we can reduce the error to 1/s.
+That is, when the original accumulated error in the accumulation stage reaches' s', the error under this new method is increased by '1'.
+
+<!-- 回顾当前的ysyxSoC, 其中SDRAM采用APB接口, 因此我们需要实现一个APB的延迟模块.
+ysyxSoC中已经包含一个APB延迟模块的框架, 并集成到APB Xbar的上游,
+可捕捉所有APB访问请求, 包括SDRAM的访问请求.
+但该框架并未提供延迟模块的具体实现, 因此默认无延迟效果.
+为了校准ysyxSoC中SDRAM的访问延迟, 你还需要实现APB延迟模块的功能. -->
+
+Looking back at the current ysyxSoC, SDRAM uses the APB interface, so we need to implement a delay module for APB. ysyxSoC already includes a framework for the APB delay module and is integrated upstream of APB Xbar. Capture all APB access requests, including SDRAM access requests. However, the framework does not provide a specific implementation of the delay module, so there is no delay effect by default. In order to calibrate the access latency of SDRAM in ysyxSoC, you also need to implement the functionality of the APB delay module.
+
+<!-- > #### todo::校准访存延迟
+>
+> 按照上文的介绍, 在ysyxSoC中实现APB延迟模块, 以校准仿真环境的访存延迟.
+> 具体地, 如果你选择Verilog, 你需要在`ysyxSoC/perip/amba/apb_delayer.v`中实现相应代码;
+> 如果你选择Chisel, 你需要在`ysyxSoC/src/amba/APBDelayer.scala`的`APBDelayerChisel`模块中实现相应代码,
+> 并将`ysyxSoC/src/amba/APBDelayer.scala`中的`Module(new apb_delayer)`修改为实例化`APBDelayerChisel`模块.
+>
+> 为了实现APB延迟模块, 你需要根据APB协议的定义, 梳理出一个APB事务何时开始, 何时结束.
+> 假设一个APB事务从`t0`时刻开始, 设备端在`t1`时刻返回APB的回复,
+> APB延迟模块在`t1'`时刻向上游返回APB的回复, 则应有等式`(t1 - t0) * r = t1' - t0`.
+>
+> 关于`r`的取值, 我们假设设备运行在100MHz的环境下,
+> 你可以根据`yosys-sta`的综合报告计算出`r`.
+> 至于`s`, 理论上当然是越大越好, 不过你只要选择一个实际中够用的`s`即可.
+> 至于多少够用, 就交给你来观察了, 这其实也是一种profiling.
+>
+> 实现后, 尝试取不同的`r`, 在波形中观察上述等式是否成立 -->
+
+> #### todo:: Calibrate memory access delay
+>
+> Implement the APB delay module in ysyxSoC as described above to calibrate the memory access delay of the simulation environment.
+> Specifically, if you choose Verilog, you need to implement the corresponding code in 'ysyxSoC/perip/amba/apb_delayer.v';
+> If you choose to Chisel, you need to implement ` ysyxSoC/SRC/amba/APBDelayer scala ` in ` APBDelayerChisel ` module to realize the corresponding code, and modify ` ysyxSoC/SRC/amba/APBDelayer scala in ` ` Module (new apb_delayer) ` to instantiate ` APBDelayerChisel ` Module.
+
+<!-- > #### todo::寻找最高的综合频率
+>
+> 除了频率之外, 面积也是电路的一个评价指标.
+> 但在工艺库的标准单元层次, 这两者本质上是互相制约的:
+> 功能相同的一类单元, 如果希望单元的逻辑延迟低,
+> 就需要通过更多的晶体管让其具备更强的驱动能力, 从而增大单元的面积.
+>
+> 鉴于面积和频率之前的制约关系, 综合器一般会用尽可能少的面积来达到给定的目标频率,
+> 而不会去单独考虑给定的电路最高能达到多少主频.
+> 如果你的电路质量比较高, 可能会出现综合报告的频率随目标频率提升而提升的现象,
+> 不过电路的综合面积也会随之增加.
+>
+> 因此, 如果你暂时不考虑面积的开销, 可以设置一个比较高的目标频率, 让综合器尝试去达到.
+> 在处理器设计中, 有一些因素会成为处理器频率的上限:
+> 1. 寄存器堆的读延迟. 通常寄存器堆的读操作需要在一个周期内完成,
+>    无法拆分到多个周期中, 因此处理器的频率不会超过寄存器堆的最大工作频率.
+> 1. 位宽和处理器字长相等的加法器延迟. 通常EXU的加法操作需要在一个周期内完成,
+>    如果这个加法操作需要多个周期完成, 将会大幅降低所有包含加法操作的指令的执行效率,
+>    包括加法指令, 减法指令, 访存指令(需要计算访存地址), 分支指令(需要计算目标地址),
+>    甚至是`PC + 4`的计算, 从而使得程序的IPC大幅降低.
+>    因此处理器的频率不会超过这个加法器的最大工作频率.
+> 1. SRAM的读写延迟. SRAM作为一种全定制单元, 无法通过逻辑设计来优化其读写延迟,
+>    因此只要使用了SRAM, 处理器的频率不会超过SRAM的最大工作频率.
+>
+> 你可以编写一些简单的小模块来单独评估这些部件的最大工作频率,
+> 为了避免I/O端口的影响, 你需要在这些部件的输入端和输出端都插入一些触发器.
+> SRAM作为一种全定制单元, 其最大工作频率通常会记录在相应的手册中,
+> 而且目前我们暂不使用SRAM, 你可以先不开展SRAM的评估工作.
+>
+> 评估后, 你可以将处理器的综合目标频率设置得比上述部件的最大工作频率更高,
+> 来指导综合器尽可能地综合出更高频率的结果.
+> 当然你也可以直接将目标主频设置成一个很难达到的值, 例如5000MHz,
+> 不过我们还是推荐你通过上述评估过程了解这些部件的最大工作频率. -->
+>
+> #### todo:: Find the highest combined frequency
+>
+> In addition to frequency, area is also an evaluation indicator of the circuit.
+But at the standard unit level of the process library, the two are essentially mutually restrictive:
+> A class of units with the same function, if you want the logical delay of the unit to be low, it is necessary to increase the area of the unit through more transistors to make it have stronger driving capacity.
+> Given the constraint between area and frequency, the synthesizer will generally use as little area as possible to achieve a given target frequency. It does not separately consider the maximum frequency that a given circuit can achieve.
+> If your circuit quality is relatively high, there may be a phenomenon that the frequency of the comprehensive report increases with the increase of the target frequency.
+> However, the overall area of the circuit will also increase.
+> Therefore, if you do not consider the area cost for the time being, you can set a relatively high target frequency for the synthesizer to try to reach.
+In processor design, there are some factors that will become the upper limit of processor frequency:
+> 1. Read delay of register pile. Usually, the read operation of the register pile needs to be completed in one cycle, cannot be split into multiple cycles, so the processor frequency will not exceed the maximum operating frequency of the register pile.
+> 2. Adder delay such as bit width and processor word length. Usually the addition operation of EXU needs to be completed in one cycle, if the addition operation takes multiple cycles to complete, it will significantly reduce the execution efficiency of all instructions containing the addition operation. Including addition instruction, subtraction instruction, memory access instruction (need to calculate the memory access address), branch instruction (need to calculate the target address), even 'PC + 4' calculation, which makes the IPC of the program significantly reduced. Therefore, the processor frequency will not exceed the maximum operating frequency of this adder.
+> 3. Read/write delay of SRAM. As a fully customized unit, SRAM cannot be logically designed to optimize its read/write latency.
+> Therefore, as long as SRAM is used, the processor's frequency will not exceed the maximum operating frequency of SRAM.
+>
+> You can write some simple small modules to individually evaluate the maximum operating frequency of these parts.
+> In order to avoid the I/O ports, you need to insert some triggers in both the input and output of these parts.
+> SRAM as a fully customized unit, its maximum operating frequency is usually recorded in the corresponding manual, and we do not use SRAM at present, you can not carry out the evaluation of SRAM.
+>
+> After evaluation, you can set the integrated target frequency of the processor higher than the maximum operating frequency of the above parts.
+> To guide the synthesizer to synthesize results at higher frequencies as much as possible.
+> Of course, you can also set the target frequency directly to a value that is difficult to achieve, such as 5000MHz, however, we recommend that you go through the above evaluation process to find out the maximum operating frequency of these components.
+
+<!-- > #### comment::可编程的计数器增量
+>
+> 上文提到的`r`对RTL设计来说是个常数, 但在可以动态调频的复杂处理器中并不是这样.
+> 对于这种复杂处理器, 我们需要把`r`设计成可编程的, 通过一个设备寄存器来存放`r`,
+> 软件进行动态调频后, 就将新的`r`写入到设备寄存器中.
+> 因此, 我们还需要将这个设备寄存器映射到处理器的地址空间中,
+> 使其可以被处理器通过SoC访问到. 当然, `s`也可以设计成可编程的.
+>
+> 不过, 这需要对ysyxSoC进行较多的改动, 所以我们就不要求大家实现上述可编程功能了. -->
+>
+> #### comment: Programmable counter increments
+>
+> The 'r' mentioned above is a constant for RTL design, but not in complex processors that can be dynamically tuned. For this complex processor, we need to make 'r' programmable by storing it in a device register. After the software performs dynamic frequency modulation, the new 'r' is written to the device register.
+> Therefore, we also need to map this device register to the address space of the processor.
+> Make it accessible to the processor via the SoC. Of course, s can also be designed to be programmable.
+> However, this requires a lot of changes to ysyxSoC, so we won't require you to implement the above programmable features.
+
+<!-- > #### todo::重新寻找优化瓶颈
+>
+> 添加延迟模块后, 重新运行一些测试并收集性能计数器的统计结果,
+> 然后根据Amdahl's law寻找性能瓶颈. -->
+>
+> #### todo:: Rediscover the optimization bottleneck
+>
+> After adding the delay module, rerun some tests and collect the performance counter statistics,
+> Then look for performance bottlenecks according to Amdahl's law.
+
+<!-- > #### todo::评估NPC的性能
+>
+> 添加延迟模块后, 运行microbench的train规模测试, 记录各种性能数据,
+> 包括主频信息和各种性能计数器.
+>
+> 校准访存延迟后, 在ysyxSoC中运行microbench的train规模测试预计需要花费数小时,
+> 但我们将得到与流片环境非常接近的性能数据.
+> 后续你可以在每次添加一个特性后, 就重新评估并记录性能数据,
+> 来帮助你梳理每一个特性带来的性能收益. -->
+>
+> #### todo:: Evaluate the performance of NPCS
+>
+> After adding the delay module, run microbench train-scale tests, record various performance data, includes frequency information and various performance counters.
+>
+> After calibrating the memory access delay, running microbench train-scale tests in ysyxSoC is expected to take several hours.
+> But we will get performance data very close to the streaming environment.
+> Then you can re-evaluate and record performance data each time you add a feature, to help you tease out the performance benefits of each feature.
+
+<!-- > #### danger::记录性能数据
+>
+> 接下来, 我们要求你记录每一次评估后的性能数据.
+> 如果你申请第六期的流片, 你将会提交这部分记录,
+> 如果记录的情况与实际开发过程不符, 你可能无法获得流片机会.
+> 我们希望通过这种方式强迫你去认识并理解NPC的性能变化情况,
+> 锻炼出处理器体系结构设计的基本素养,
+> 而不是仅仅将参考书籍中的架构图翻译成RTL代码.
+>
+> 具体地, 你可以按照如下方式记录:
+> | commit           | 说明            | 仿真周期数 | 指令数 | IPC | 综合频率 | 综合面积 | 性能计数器1 | 性能计数器2 | ... |
+> | ---              | ---             | ---        | ---    | --- | ---      | ---      | ---         | ---         | --- |
+> | 0123456789abcdef | 示例, 实现cache | 200000     | 10000  | 0.05|  750MHz  |  16000   |   3527      |   8573      | ... |
+>
+> 其中:
+> * 我们要求你在NPC工程目录下的`Makefile`中添加一条规则`make perf`,
+>   使得将来执行
+>   ```bash
+>   git checkout 表中的commit
+>   make perf
+>   ```
+>   后, 可以复现出对应表格中的性能数据.
+>   如果复现的情况与表中记录的数据严重不符, 在无法合理说明的情况下, 可能会被判定为违反学术诚信.
+>   * 你可以假设你正在进行一项科学研究, 你需要对研究过程中的实验数据负责:
+>     实验数据需要是可复现的, 能够在公开场合下经得起推敲.
+> * 你可以在学习记录中新建一个名为`NPC性能评估结果`的工作表, 来记录这些性能数据
+> * 你可以将`性能计数器1`和`性能计数器2`更换成相应性能计数器的实际名称
+> * 你可以根据实际情况, 记录更多的性能计数器
+> * 你也可以将你对性能数据的分析记录到说明一栏
+> * 我们鼓励你尽可能多地记录性能数据的条目, 从而帮助你量化地分析NPC的性能变化情况 -->
+
+> #### danger:: Records performance data
+>
+> Next, we ask you to record performance data after each evaluation.
+> If you apply for the sixth streaming film, you will submit this part of the record.
+> If the recorded situation does not match the actual development process, you may not get the streaming opportunity.
+> We want to force you to recognize and understand how NPC behave in this way, exercise the basic literacy of processor architecture design,
+> Instead of just translating architecture diagrams from reference books into RTL code.
+>
+> Specifically, you can record as follows:
+| commit           | comment            | simulation cycle | instruction count | IPC | synthesize frequency | synthesize area | performance counter 1 | performance counter 2 | ... |
+> | ---              | ---             | ---        | ---    | --- | ---      | ---      | ---         | ---         | --- |
+> | 0123456789abcdef | eg, complement cache | 200000     | 10000  | 0.05|  750MHz  |  16000   |   3527      |   8573      | ... |
+>
+> Where:
+>
+> * We ask you to add a rule 'make perf' to the Makefile in the NPC project directory.
+> Enable future execution
+>
+> ```bash
+> commit in the git checkout table
+> make perf
+> ```
+>
+> Then, the performance data in the corresponding table can be reproduced.
+> If the reproduced situation is materially inconsistent with the data recorded in the table, it may be judged as a violation of academic integrity if it cannot be reasonably explained.
+>
+> * You can assume that you are conducting a scientific study and that you are responsible for the experimental data during the study:
+> Experimental data need to be reproducible and able to stand up to scrutiny in public.
+> * You can create a new worksheet in the learning record called 'NPC Performance Evaluation Results' to record these performance data
+> * You can replace 'performance counter 1' and 'performance counter 2' with the actual names of the respective performance counters
+> * You can record more performance counters according to the actual situation
+> * You can also record your analysis of performance data in the description column
+> * We encourage you to record as many entries of performance data as possible to help you quantify the performance changes of NPC.
+
+<!-- > #### question::优化主频是否值得?
+>
+> 根据主频校准访存延迟后, 你会发现IPC大幅下降.
+> 可以预料到的是, 如果主频进一步提升, 访存延迟的周期数也会增加, 从而导致IPC降低.
+> 那么, 主频是否值得我们去优化呢?
+> 如果值得, 优化主频带来的性能收益具体是来自于哪里?
+> 如果不值得, 优化主频带来的性能倒退具体又体现在哪处?
+> 尝试结合性能计数器分析你的猜想. -->
+>
+> #### question: Is it worth optimizing the frequency?
+>
+> After calibrating the memory access delay against the main frequency, you will see a significant decrease in IPC.
+> It can be expected that if the frequency increases further, the number of cycles of the access delay will also increase, resulting in a decrease in IPC.
+> So, is the frequency worth optimizing?
+> If so, where do the performance gains from optimizing the frequency come from?
+> If it is not worth it, where is the performance regression caused by optimizing the frequency?
+> Try analyzing your guess with performance counters.
+
+<!-- > #### comment::校准FPGA上的访存延迟
+>
+> 现代的FPGA上一般含有DDR内存控制器.
+> 但受限于FPGA的实现原理, PL部分的CPU频率也与ASIC流程有很大差距,
+> 甚至CPU频率比内存控制器的频率还低. 例如, 内存控制器可以跑在200MHz,
+> 但CPU只能在FPGA上以数百MHz甚至数十MHz的频率运行,
+> 而在真实芯片中, CPU通常能运行在1GHz以上(例如, 第三代香山的目标主频是3GHz).
+> 显然, 在这样的评估环境中得到的性能数据, 对将流片作为目标的CPU性能测试来说是大幅失真的.
+> 通过校准访存延迟解决上述内存频率倒挂的情况,
+> 是处理器企业使用FPGA进行CPU性能评估之前必须解决的问题.
+>
+> 事实上, 由于真实的DDR是一个复杂的系统, 即使采用上文介绍的延迟模块的方案,
+> 也还需要考虑更多问题:
+> * 由于模拟电路部件的差异, FPGA中内存控制器的phy模块与ASIC的内存控制器不同,
+>   这可能会影响访存延迟
+> * 受限于FPGA的实现原理和FPGA上PLL的可配置范围,
+>   DDR控制器运行的频率也低于ASIC的内存控制器,
+>   但DDR颗粒无法等比例降频, 导致访存延迟不准确
+> * DDR控制器降频后, 其刷新频率等参数也与ASIC的内存控制器不一致
+>
+> 因此, 要在FPGA上很好地解决内存频率倒挂问题, 在业界也是一项不小的挑战.
+> 例如, 香山团队成立了一个小组, 由工程师带队来解决这一问题.
+>
+> 事实上, 是否需要校准FPGA的访存延迟, 取决于FPGA的使用场景和目标:
+> * 教学: 仅仅将FPGA作为功能测试的环境. 此时FPGA起到的作用就是加速仿真过程,
+>   无论处理器和内存控制器之间的运行频率比例如何, 理论上都不影响功能测试的结果.
+> * 比赛或科研项目: 将FPGA作为性能测试的环境, 同时也作为目标平台.
+>   此时不以流片为目标, 故不必校准访存延迟.
+> * 企业产品研发: 将FPGA作为性能测试的环境, 但同时把流片作为目标.
+>   我们会期望通过FPGA得出的性能数据尽可能与真实芯片一致.
+>   这时, 校准FPGA的访存延迟将是不可或缺的.
+>
+> "一生一芯"虽然作了诸多简化, 但总体上还是希望大家能体会到企业产品研发的大致流程,
+> 同时考虑到真实DDR控制器的校准存在诸多工程挑战, 因此我们不要求大家使用FPGA.
+> 相比之下, 在仿真环境中校准访存延迟, 要比在FPGA中容易很多,
+> 因此我们还是推荐大家在仿真环境中进行性能评估和优化. -->
+> 
+> #### comment:: Calibrates the memory access delay on the FPGA
+>
+> Modern FPGas typically contain DDR memory controllers.
+However, limited by the implementation principle of FPGA, the CPU frequency of PL part also has a big gap with the ASIC process.
+> Even the CPU frequency is lower than the memory controller frequency. For example, a memory controller can run at 200MHz, but the CPU can only run at hundreds or even tens of MHz on the FPGA. In real chips, the CPU can usually run at more than 1GHz (for example, the third-generation Xiangshan is targeting 3GHz). Obviously, the performance data obtained in such an evaluation environment is significantly distorted for CPU performance tests targeting the stream film.
+> Solving the above memory frequency inversion situation by calibrating the memory access delay, is an issue that a processor enterprise must address before using an FPGA to evaluate CPU performance.
+>
+> In fact, since real DDR is a complex system, even with the delay module scheme described above, there are also more questions to consider:
+>
+> * Due to the differences in analog circuit components, the phy module of the memory controller in the FPGA is different from the memory controller of the ASIC.
+> This may affect access latency
+> * Limited by the implementation principle of FPGA and the configurable range of PLL on FPGA,
+> DDR controllers also operate less frequently than ASIC memory controllers,
+> However, DDR particles cannot be reduced in equal proportion, resulting in inaccurate memory access delay
+> * After the DDR controller downfrequency, its refresh rate and other parameters are also inconsistent with the ASIC memory controller
+>
+Therefore, to solve the memory frequency inversion problem well on the FPGA is also a big challenge in the industry.
+For example, the Xiangshan team set up a group led by engineers to solve the problem.
+>
+> In fact, whether you need to calibrate the memory access delay of the FPGA depends on the usage scenario and the target of the FPGA:
+>
+> * Teaching: Use FPGas only as an environment for functional testing. At this point, the role of the FPGA is to accelerate the simulation process, no matter what the ratio of running frequencies between the processor and the memory controller is, it theoretically does not affect the results of the functional test.
+> * Competition or research project: Use the FPGA as the environment for performance testing, but also as the target platform.
+> The stream is not targeted at this time, so there is no need to calibrate the memory access delay.
+> * Enterprise product development: Use the FPGA as the environment for performance testing, but at the same time, the stream film as the target.
+> We would expect the performance data derived from the FPGA to be as consistent as possible with the real chip.
+> In this case, calibrating the memory access delay of the FPGA will be indispensable.
+> "A core for life" although a lot of simplification, but in general, I still hope that everyone can appreciate the general process of enterprise product development,
+And given the engineering challenges of calibrating real DDR controllers, we don't require you to use FPGA.
+> In contrast, calibrating memory access latency in a simulation environment is much easier than in an FPGA. Therefore, we recommend that you perform performance evaluation and optimization in a simulation environment.
+
+<!-- > #### todo::提升功能测试的效率
+>
+> 将校准访存延迟后的ysyxSoC仿真环境用于性能评估是很合适的,
+> 但你也会感觉到, 这一环境的仿真效率明显低于之前的`riscv32e-npc`:
+> 从microbench的train规模测试的运行时间来看,
+> `riscv32e-npc`的仿真效率是`riscv32e-ysyxsoc`的数十倍甚至上百倍.
+> 这其实反映出一种权衡关系: 要得到更准确的性能数据,
+> 就要仿真越多的细节(例如SDRAM控制器和SDRAM颗粒),
+> 从而需要在仿真1周期的过程中花费更多的时间, 最终导致仿真效率越低.
+> 相应地, 仿真效率较高的`riscv32e-npc`, 得到的性能数据则是不准确的.
+>
+> 那`riscv32e-npc`是否没有意义呢?
+> 事实上, 我们可以把`riscv32e-npc`作为一个功能测试的环境.
+> 如果在`riscv32e-npc`中存在一个功能bug,
+> 那么这个bug也很大概率存在于`riscv32e-ysyxsoc`之中,
+> 但显然在仿真效率更高的`riscv32e-npc`中调试这个bug是一个更合适的方案.
+> 这样, 我们就可以充分发挥两个仿真环境各自的优势, 取长补短,
+> 从整体上提升开发和测试的效率.
+>
+> 尝试修改相关的仿真流程, 从而支持NPC在`riscv32e-npc`和`riscv32-ysyxsoc`中仿真.
+> 其中, `riscv32e-npc`仍然采用`0x8000_0000`作为复位时的PC值. -->
+> #### todo:: Improve the efficiency of functional testing
+>
+> It is appropriate to use the ysyxSoC simulation environment after the calibrated memory access delay for performance evaluation. But you will also feel that the simulation efficiency of this environment is significantly lower than that of the previous' riscv32e-npc ':
+> From the running time of microbench's Train-scale tests,
+> The simulation efficiency of 'riscv32e-npc' is tens or even hundreds of times that of 'riscv32e-ysyxsoc'.
+> This reflects a trade-off: To get more accurate performance data,
+> The more details you want to simulate (such as SDRAM controllers and SDRAM particles),
+> Therefore, more time is spent in the process of simulation 1 cycle, resulting in lower simulation efficiency.
+> Accordingly, the performance data obtained from 'riscv32e-npc' with higher simulation efficiency is inaccurate.
+> Is' riscv32e-npc 'meaningless?
+In fact, we can use 'riscv32e-npc' as a functional testing environment.
+> If there is a bug in 'riscv32e-npc',
+> Then this bug also has a high probability of existing in 'riscv32e-ysyxsoc',
+But obviously debugging this bug in the more efficient simulation 'riscv32e-npc' is a more appropriate solution.
+In this way, we can give full play to the advantages of the two simulation environments, learn from each other,
+> Improve the overall efficiency of development and testing.
+> Try to modify the relevant simulation flow to support NPC emulation in 'riscv32e-npc' and 'riscv32-ysyxsoc'.
+> Where, 'riscv32e-npc' still takes' 0x8000_0000 'as the PC value at reset.
+
+<!-- ## 经典体系结构的4类优化方法 -->
+## 4 kinds of optimization methods for classical architecture
+
+<!-- 找到性能瓶颈后, 我们就可以考虑如何对其进行优化.
+经典体系结构中主要有4类优化方法:
+1. 局部性 - 利用数据访问的性质提升指令供给和数据供给能力. 代表性技术是缓存
+2. 并行 - 多个实例同时工作, 提升系统整体的处理能力. 并行方法又有很多分类:
+   * 指令级并行 - 同时执行多条指令, 相关技术包括流水线, 多发射, VLIW和乱序执行
+   * 数据级并行 - 同时访问多个数据, 相关技术包括SIMD, 向量指令/向量机
+   * 任务级并行 - 同时执行多个任务, 相关技术包括多线程, 多核, 多处理器和多进程;
+     GPU属于SIMT, 是一种介于数据级并行和任务级并行之间的并行方法
+3. 预测 - 在不知道正确选择时, 先投机地执行一个选择, 在后续检查选择是否正确, 如果预测正确, 就能降低等待的延迟, 从而获得性能收益. 如果预测错误, 就需要通过额外的机制进行恢复. 代表性技术包括分支预测和缓存预取
+4. 加速器 - 使用专门的硬件部件来执行特定任务, 从而提升该任务的执行效率, 一些例子包括:
+   * AI加速器 - 对AI负载的计算过程进行加速, 通常通过总线访问
+   * 自定义扩展指令 - 将加速器集成到CPU内部, 通过通过新增的自定义扩展指令来访问
+   * 乘除法器 - 可以视为一类加速器, 将RVM视为RVI的扩展,
+     通过乘除法指令控制专门的乘除法硬件模块, 来加速乘除法的计算过程 -->
+
+Once we have identified the performance bottleneck, we can consider how to optimize it.
+There are four main types of optimization methods in classical architecture:
+
+   1. Locality - Use the nature of data access to improve command supply and data supply capabilities. The typical technique is cache.
+   2. Parallel - multiple instances work at the same time to improve the overall processing capacity of the system. There are many categories of parallel methods:
+
+* Instruction level parallelism - executing multiple instructions at the same time, related techniques include pipelining, multi-emission, VLIW, and out-of-order execution
+* Data level parallelism - accessing multiple data at the same time, related techniques include SIMD, vector instruction/vector machine
+* Task-level parallelism - executing multiple tasks at the same time, related technologies include multi-threading, multi-core, multi-processor, and multi-process; GPU belongs to SIMT, which is a parallel method between data level parallelism and task level parallelism
+   3. Prediction - Perform a choice speculatively when you don't know the right choice, then check whether the choice is correct, and if the prediction is correct, you can reduce the delay in waiting, thereby achieving performance gains. If the prediction is wrong, it needs to be recovered through additional mechanisms. Typical techniques include branch prediction and cache prefetch
+   4. Accelerators - The use of specialized hardware components to perform specific tasks, thereby improving the efficiency of the execution of that task, some examples include:
+
+* AI Accelerator - Acceleration of the calculation process of AI loads, usually accessed via a bus
+* Custom extension instructions - The accelerator is integrated into the CPU and accessed through new custom extension instructions
+* Multiplier and divider - can be seen as a class of accelerators, treating RVM as an extension of RVI. The special hardware module of multiplication and division is controlled by the multiplication and division command to accelerate the calculation process of multiplication and division
+
+<!-- > #### caution::重新审视处理器体系结构设计
+>
+> 很多电子类专业的同学一开始很可能会把处理器体系结构设计理解成"用RTL开发一款处理器".
+> 但RTL编码只是处理器设计流程中的其中一个环节, 而且严格来说并不属于处理器体系结构设计的范畴.
+>
+> 事实上, 一名合格的处理器架构师应该具备如下能力:
+> 1. 理解程序如何在处理器上运行
+> 1. 对于支撑程序运行的特性, 能判断它们适合在硬件层次实现, 还是适合在软件层次实现
+> 1. 对于适合在硬件层次实现的特性, 能提出一套在各种因素的权衡之下仍然满足目标要求的设计方案
+>
+> 这些能力的背后其实反映出人们使用计算机的根本目的: 通过程序解决真实需求.
+> 如果在硬件层次添加的特性给程序带来的收益很低, 甚至程序根本不会使用这一特性,
+> 那相关方案的决策者确实算不上是专业的架构师.
+>
+> 事实上, 这些能力是需要刻意去锻炼的.
+> 我们见过不少同学, 能根据一些参考资料将流水线处理器的框图翻译成RTL代码,
+> 但却无法评估一个程序的运行时间是否符合预期, 也不知道如何进一步优化或实现新需求;
+> 还有一些同学设计出一个乱序超标量处理器, 但性能却比不上教科书上的五级流水线.
+> 这说明, 体系结构设计能力并不等同于RTL编码能力,
+> 也许这些同学在开发过程中确实理解了流水线和乱序超标量的基本概念,
+> 但却缺少全局的视野和理解, 只将关注点放在提升后端的计算效率上,
+> 很少甚至没有关注过指令供给和数据供给,
+> 导致处理器的访存能力远低于计算能力, 总体的性能表现不佳.
+> 因此, 即使能设计出一个正确的乱序超标量处理器, 也算不上是一个好的处理器,
+> 从某种程度上来说, 这些同学也还不具备处理器体系结构设计能力.
+>
+> "一生一芯"尝试从另一个思路锻炼大家的处理器体系结构设计能力:
+> 首先将程序运行起来, 从软硬件协同的视角理解程序运行的每一处细节;
+> 然后学习处理器性能评估的基本原理, 理解程序行为在硬件层次的微观表现;
+> 最后才是学习各种体系结构优化方法, 使用科学的评估手段理解这些优化方法对程序运行带来的真实收益.
+>
+> 这一学习方案和教科书有很大区别,
+> 这是因为处理器体系结构设计能力只能通过实践来锻炼,
+> 但采用教科书的理论课堂受限于课程体制, 无法考察学生的体系结构设计能力.
+> 因此, 你很容易通过教科书或者参考书籍入门,
+> 但如果你想成为这个方向的专业人士, 就要明白这些书籍的上限,
+> 在必要的阶段跨越书籍的边界, 通过针对性的训练锻炼出真正的体系结构设计能力. -->
+
+> #### caution:: Re-examine the processor architecture design
+> Many electronics majors are likely to start off thinking of processor architecture design as "developing a processor with RTL." But RTL coding is only one part of the processor design process, and strictly speaking does not belong to the scope of processor architecture design.
+>
+> In fact, a qualified processor architect should have the following abilities:
+>
+> 1. Understand how the program runs on the processor
+> 2. For the features that support the operation of the program, it can determine whether they are suitable for implementation at the hardware level or the software level
+> 3. For the features suitable for implementation at the hardware level, it can propose a set of design schemes that still meet the target requirements under the balance of various factors
+> These capabilities reflect the fundamental purpose for which people use computers: to solve real needs through programs. If a feature added at the hardware level has a low benefit to the program, or even if the program does not use the feature at all,
+> The decision makers of the relevant solutions are really not professional architects.
+>
+> In fact, these abilities need to be deliberately practiced.
+> We have met many students who can translate the block diagram of the pipeline processor into RTL code according to some references. But there is no way to assess whether a program is running as expected, or how to further optimize or implement new requirements;
+Some students have designed an out-of-order superscalar processor, but the performance is not as good as the textbook five-level pipeline.
+> This shows that the ability to design architecture is not the same as the ability to code RTL.
+> Maybe these students did understand the basic concepts of pipelining and out-of-order superscalar during the development process, but it lacks a global vision and understanding, and only focuses on improving the efficiency of back-end computing.
+> Little or no attention to instruction and data supply,
+> Lead to the memory access capacity of the processor is much lower than the computing capacity, and the overall performance is poor. Therefore, even if you can design a correct out-of-order superscalar processor, it is not a good processor. To some extent, these students also do not have the ability to design processor architecture.
+>
+> "A core for Life" tries to exercise your processor architecture design ability from another idea:
+> First of all, run the program and understand every detail of program operation from the perspective of software and hardware collaboration;
+Then learn the basic principles of processor performance evaluation and understand the microcosmic manifestation of program behavior at the hardware level;
+Finally, learn the various architectural optimization methods and use scientific evaluation methods to understand the real benefits of these optimization methods on the operation of the program.
+>
+> This program of study is very different from textbooks. This is because processor architecture design skills can only be practiced. However, the theoretical classroom using textbooks is limited by the curriculum system and cannot examine the students' ability of architecture design.
+> So it's easy to get started with a textbook or reference book, but if you want to become a professional in this direction, understand the limits of these books,
+> Cross the boundaries of books at the necessary stages to develop real architecture design skills through targeted training.
+
+<!-- ## 存储层次结构和局部性原理 -->
+## Memory hierarchy and locality principle
+
+<!-- 校准ysyxSoC的访存延迟后, 你应该发现性能瓶颈在于指令供给:
+取一条指令都要等待数十上百个周期, 流水线根本没法流水.
+要提升指令供给能力, 最合适的就是使用缓存技术.
+不过在介绍缓存之前, 我们需要先了解计算机的存储层次结构和局部性原理. -->
+
+After calibrating ysyxSoC's memory access delay, you should find that the performance bottleneck is the instruction supply:
+Take an instruction to wait for tens of hundreds of cycles, the pipeline can not flow.
+To improve the ability of instruction supply, the most suitable is to use caching technology. Before introducing caching, however, we need to understand the storage hierarchy and locality principles of computers.
+
+<!-- ### 存储层次结构(Memory Hierarchy) -->
+### Memory Hierarchy
+
+<!-- 计算机中存在不同的存储介质, 如寄存器, 内存, 硬盘和磁带,
+它们具有不同的物理特性, 因而各种指标也有所不同.
+可以从访问时间, 容量和成本这几个方面评估它们. -->
+
+There are different storage media in computers, such as registers, memory, hard disks and magnetic tape. They have different physical characteristics, so the various indicators are also different. They can be evaluated in terms of access time, capacity, and cost.
+
+```
+access time     /\        capacity    price
+               /  \
+   ~1ns       / reg\        ~1KB     $$$$$$
+             +------+
+   ~10ns    /  DRAM  \      ~10GB     $$$$
+           +----------+
+   ~10ms  /    disk    \    ~1TB       $$
+         +--------------+
+   ~10s /      tape      \  >10TB       $
+       +------------------+
+```
+
+<!-- * 寄存器. 寄存器的访问时间很短, 基本与CPU的主频一致.
+  目前的商业级高性能CPU主频大约3GHz, 因此寄存器访问时间小于1ns.
+  寄存器的容量很小, 通常小于1KB.
+  例如, RV32E有16个32位寄存器, 大小是512b.
+  此外，寄存器的制造成本较为昂贵, 若使用大量寄存器, 则将占用不少流片面积.
+* DRAM. DRAM的访问时间大约为10ns的量级, 其容量比寄存器大得多, 常用于内存.
+  其成本也低很多, 某电商平台上16GB内存条的价格是329元, 约20元/GB.
+* 机械硬盘. 机械硬盘的访问时间受限于其机械部件, 如盘片需要旋转, 通常需要10ms的量级.
+  相对地, 机械硬盘也拥有更大的容量, 通常能达到数TB;
+  其成本也更便宜, 某电商平台上4TB机械硬盘的价格是569元, 约0.139元/GB.
+* 固态硬盘. 固态硬盘也是目前流行的存储介质, 其存储单元采用NAND flash,
+  基于电的特性进行工作, 因此访问速度比机械硬盘快得多, 其读延迟接近DRAM,
+  但受限于flash单元的特性, 写延迟还是比DRAM高很多.
+  其成本稍高于机械硬盘, 某电商平台上1TB固态硬盘的价格是699元, 约0.683元/GB.
+* 磁带. 磁带的存储容量非常大, 成本也非常低, 但访问时间很长, 约10s的量级,
+  因此目前已很少使用, 通常用在数据备份的场景.
+  某电商平台上30TB磁带机的价格是1000元, 约0.033元/GB. -->
+
+* Register. The access time of the register is very short, basically consistent with the main frequency of the CPU. Current commercial-grade high-performance cpus are clocked at about 3GHz, so register access times are less than 1ns. Registers are small in size, usually less than 1KB. For example, the RV32E has 16 32-bit registers and is 512b in size.
+In addition, the manufacturing cost of registers is more expensive, if a large number of registers are used, it will occupy a lot of flow area.
+* DRAM. The access time of DRAM is about 10ns, its capacity is much larger than the register, and it is often used in memory. Its cost is also much lower, the price of 16GB memory on an e-commerce platform is 329 yuan, about 20 yuan /GB.
+* Mechanical hard drives. The access time of mechanical hard drives is limited by their mechanical components, such as platters that need to be rotated, usually in the order of 10ms. In contrast, mechanical hard drives also have a larger capacity, often up to several terabytes;
+Its cost is also cheaper, the price of 4TB mechanical hard disk on an e-commerce platform is 569 yuan, about 0.139 yuan /GB.
+* SSD. SSD is also a popular storage medium, and its storage unit uses NAND flash.
+Works based on electrical properties, so access is much faster than mechanical hard drives, and its read latency is close to DRAM,
+However, due to the characteristics of flash units, write latency is still much higher than DRAM. Its cost is slightly higher than the mechanical hard disk, the price of 1TB solid state disk on an e-commerce platform is 699 yuan, about 0.683 yuan /GB.
+* Tape. The storage capacity of tape is very large, the cost is very low, but the access time is very long, about 10s. Therefore, it is rarely used at present and is usually used in data backup scenarios. The price of a 30TB tape drive on an e-commerce platform is 1000 yuan, about 0.033 yuan /GB.
+
+<!-- 可见, 由于存储介质物理特性的限制, 没有一种存储器能同时满足容量大, 速度快, 成本低等各种指标.
+因此, 计算机通常集成多种存储器, 并通过一定的技术将它们有机组织起来, 形成存储层次结构,
+在整体上达到容量大, 速度快, 成本低的综合指标.
+这听上去有点不可思议, 不过关键在于如何把各种存储器有机组织起来. -->\
+
+It can be seen that due to the limitations of the physical characteristics of the storage medium, no memory can meet various indicators of large capacity, fast speed and low cost at the same time. Therefore, computers usually integrate a variety of memories and organize them organically through certain technologies to form a storage hierarchy.
+On the whole, it achieves the comprehensive index of large capacity, fast speed and low cost. It sounds a little crazy, but the key is how to organize the various kinds of memory.
+
+<!-- ### 局部性原理 -->
+### principle of locality
+
+<!-- 其实上述的组织方式是有讲究的, 其中的奥秘就是程序的局部性原理.
+计算机架构师发现, 程序在一段时间内对存储器的访问通常集中在一个很小的范围:
+* 时间局部性 - 访问一个存储单元后, 短时间内可能再次访问它
+* 空间局部性 - 访问一个存储单元后, 短时间内可能访问它的相邻存储单元 -->
+
+In fact, the above way of organization is exquisite, and the secret is the principle of locality of the procedure. Computer architects have found that a program's access to memory over a period of time is usually concentrated in a small area:
+
+* Time locality - After accessing a storage unit, it may be accessed again within a short period of time
+* Spatial locality - After a storage unit is accessed, its neighboring storage units may be accessed within a short period of time
+
+<!-- 上述现象与程序的结构和行为有关:
+* 程序大多数时候将顺序执行或循环执行, 二者分别遵循空间局部性和时间局部性
+* 编写程序时, 相关的变量在源代码中的位置相距不远, 或者采用结构体来组织,
+  编译器也会为其分配分配相近的存储空间, 从而呈现出空间局部性
+* 程序执行过程中访问变量的次数通常不小于变量的数量(否则将存在未被使用的变量),
+  因此必定有变量会被多次访问, 从而呈现出时间局部性
+* 对于数组, 程序通常使用循环来遍历, 从而呈现出空间局部性 -->
+
+The above phenomena are related to the structure and behavior of the program:
+
+* Programs will most of the time execute sequentially or in a loop, following spatial locality and temporal locality, respectively
+* When writing a program, the related variables are located near each other in the source code, or are organized by structures. The compiler will also allocate a similar amount of storage space for its allocation, thus showing spatial locality
+* The number of variables accessed during program execution is usually not less than the number of variables (otherwise there will be unused variables). Therefore, there must be variables that are accessed multiple times, thus presenting temporal locality
+
+<!-- > #### option::观察程序的局部性
+>
+> 程序的局部性和内存访问有关, 很自然地, 我们可以通过mtrace来观察它!
+> 在NEMU中运行一些程序, 并获取mtrace.
+> 之后, 你需要对mtrace的输出进行二次处理, 尝试借助一些绘图工具呈现你的结果. -->
+
+> #### option:: Observe program locality
+>
+> Program locality is related to memory access, and naturally, we can observe it through mtrace!
+> Run some programs in NEMU and get mtrace.
+> After that, you need to do a second processing of the mtrace output and try to render your results with some drawing tools.
+
+<!-- > #### question::链表的局部性
+>
+> 遍历链表的过程中是否呈现出局部性?
+> 尝试比较访问数组元素和链表元素时, 何者的局部性更优. -->
+
+> #### question:: The locality of the linked list
+>
+> Is there locality in traversing the linked list?
+> Try to compare which is more local when accessing array elements versus linked list elements.
+
+<!-- 局部性原理告诉我们, 程序对存储器的访问表现出集中的特性.
+这说明, 即使慢速存储器的容量很大, 程序在一段时间内只会访问很小的一部分数据.
+既然如此, 我们可以先将这部分数据从慢速存储器中搬运到快速存储器中, 然后在快速存储器中访问它们. -->
+
+The principle of locality tells us that program access to memory exhibits a centralized character. This shows that even though the capacity of slow memory is large, the program will only access a small portion of the data at a time. In this case, we can first move this data from the slow memory to the fast memory, and then access it in the fast memory.
+
+<!-- 这就是存储层次结构中各种存储器之间组织方式的诀窍:
+将各种存储器按层次排列, 上层存储器速度快但容量小, 下层存储器容量大但速度慢;
+访问数据时, 先访问速度较快的上层存储器,
+如果数据在当前层级(称为命中), 则直接访问当前层级的数据;
+否则(称为缺失), 就在下一层级寻找, 下层将目标数据及其相邻数据传递给上层.
+其中, "把目标数据传递给上层存储器"利用了时间局部性, 期望下次访问目标数据时能在速度快的存储器中命中;
+而"把相邻数据传递给上层存储器"则利用了空间局部性, 期望下次访问相邻数据时也能在速度快的存储器中命中. -->
+
+This is the trick to the way the various stores are organized in a storage hierarchy:
+All kinds of memory are arranged according to the hierarchy, the upper level memory is fast but the capacity is small, the lower level memory is large but the speed is slow;
+When accessing data, first access the faster upper storage. If the data is at the current level (called hit), the data at the current level is directly accessed.
+Otherwise (called missing), it is looked for at the next level, which passes the target data and its neighbors to the upper level. Among them, "pass the target data to the upper layer memory" makes use of the time locality, expecting that the target data can be hit in the fast memory when it is accessed next time;
+On the other hand, "pass adjacent data to upper storage" takes advantage of spatial locality in the expectation that the next time adjacent data is accessed, it will also be hit in the faster memory.
+
+<!-- 例如, 在访问DRAM时, 如果数据不存在, 则访问机械硬盘, 并将目标数据及其相邻数据搬运到DRAM中,
+下次访问这些数据时, 即可在DRAM中命中, 从而直接访问DRAM中数据.
+通过这种方式, 我们近似得到了一个访问速度接近DRAM, 容量接近机械硬盘的存储器!
+在成本方面, 以上文电商平台的报价为例, 一根16GB的内存条和一块4TB的机械硬盘的总价格不到900元,
+但如果要采购4TB的内存条, 则需要`329 * (4TB / 16GB) = 84224`元! -->
+
+For example, when accessing DRAM, if the data does not exist, the mechanical hard disk is accessed and the target data and its adjacent data are transported to the DRAM.
+The next time the data is accessed, it can be hit in DRAM, thus directly accessing the data in DRAM. In this way, we approximate a memory with access speed close to DRAM and capacity close to mechanical hard disk! In terms of cost, the quotation of the above e-commerce platform, for example, the total price of a 16GB memory bar and a 4TB mechanical hard disk is less than 900 yuan, however, if you want to purchase 4TB of memory, you need '329 * (4TB / 16GB) = 84,224' yuan!
+
+<!-- 当然, 天下没有免费的午餐, 要实现上文的效果是有条件的, 计算机系统的设计需要满足局部性原理:
+一方面, 计算机系统需要设计并实现存储层次结构;
+另一方面, 程序员也需要开发出局部性较好的程序, 使其在存储层次结构中能获得较好的性能.
+如果程序的局部性较差, 访问的数据不具备集中的特性,
+将导致大部分访问都无法在快速存储器中命中, 从而使得整个系统的表现接近于访问慢速存储器. -->
+
+Of course, there is no free lunch in the world, to achieve the above effect is conditional, the design of the computer system needs to meet the principle of locality:
+On the one hand, a computer system needs to design and implement a storage hierarchy.
+On the other hand, programmers also need to develop a better local program, so that it can obtain better performance in the storage hierarchy. If the program is poorly localized and the data accessed does not have centralized characteristics. As a result, most accesses fail to hit the fast memory, making the performance of the entire system close to accessing the slow memory.
+
+<!-- ## 简易缓存 -->
+## Simple cache
+<!-- ### 缓存介绍 -->
+### Cache introduction
+
+<!-- 回到上文的性能瓶颈, 为了优化指令供给, 实际上我们需要做的是提升访问DRAM的效率.
+为此, 我们只需要按照计算机存储层次结构的思想, 在寄存器和DRAM之间添加一层存储器即可.
+这就是缓存(cache)的思想.
+也即, 在访问DRAM之前, 先访问cache, 若命中, 则直接访问;
+若缺失, 则先将数据从DRAM读入cache, 然后再访问cache中的数据. -->
+
+Going back to the performance bottleneck above, in order to optimize instruction supply, what we actually need to do is improve the efficiency of accessing DRAM. To do this, we only need to add a layer of memory between registers and DRAM in accordance with the idea of computer storage hierarchy. This is the idea of cache.
+In other words, before accessing the DRAM, access the cache first, and if the hit, directly access; If it is missing, data is read from the DRAM into the cache and then accessed from the cache.
+
+<!-- 上面的cache属于狭义范畴, 指的是处理器缓存(CPU cache).
+事实上, 广义的cache并不仅仅是指代访存通路上的那个硬件模块,
+在计算机系统中, cache无处不在:
+磁盘控制器中也包含cache, 用于缓存从磁盘中读出的数据;
+我们之前介绍的SDRAM中的行缓冲, 本质上也是SDRAM存储阵列的缓存;
+操作系统也会通过软件的方式为磁盘等存储设备维护一个缓存,
+用于存储最近访问的数据, 这个缓存本质上是一个大型结构体数组,
+它在内存中分配, 因此操作系统负责在磁盘和内存之间进行数据搬运;
+缓存对分布式系统也是至关重要, 如果要访问的数据不在本地缓存,
+就需要访问远端, 浏览器的网页缓存和视频内容缓存都属于这种情况. -->
+
+The above cache belongs to the narrow category, referring to the processor cache (CPU cache). In fact, cache in the broad sense doesn't just refer to the hardware module on the access path, in computer systems, cache is ubiquitous: The disk controller also contains a cache for caching data read from disks. The row buffering in SDRAM we introduced earlier is essentially the cache of SDRAM storage arrays; The operating system also maintains a cache through software for storage devices such as disks, sed to store recently accessed data, this cache is essentially a large array of structures. It is allocated in memory, so the operating system is responsible for moving data between disk and memory; Caching is also critical for distributed systems, and if the data to be accessed is not cached locally. You need to access the remote side, and this is the case with the browser's web cache and video content cache.
+
+<!-- > #### comment::对软件程序可见的cache
+>
+> 在一些处理器中, cache是对软件程序可见的.
+> 例如CUDA GPU编程模型中的shared memory, 在组织层次上和CPU cache一样,
+> 都是寄存器和DRAM之间的一层存储器;
+> 但和CPU cache不同的是, GPU提供了专门用于访问shared memory的访存指令,
+> 因此GPU程序可以通过指令来将数据从内存中读入shared memory的指定位置. -->
+>
+> #### comment:: cache visible to software programs
+>
+> In some processors, the cache is visible to software programs.
+> For example, shared memory in the CUDA GPU programming model is the same as CPU cache at the organizational level.
+> All are a layer of memory between registers and DRAM; But unlike CPU caches, Gpus provide memory access instructions specifically for accessing shared memory. Therefore, a GPU program can use instructions to read data from memory into a specified location in shared memory.
+
+<!-- 为了方便描述, 我们将从DRAM读入的数据称为一个数据块,
+cache中存放的数据块称为cache块(有的教科书也称为cache行, 英文为cache line).
+自然地, 设计cache需要关注如下问题:
+* 数据块的大小应该是多少?
+* 如何检查访存请求是否在cache中命中?
+* cache的容量通常比DRAM小, 如何维护cache块和DRAM中数据块之间的映射关系? cache满了后怎么办?
+* CPU可能会执行写操作, 从而更新数据块中的数据, cache应如何维护? -->
+
+For the sake of description, we will call the data read from the DRAM a block of data,
+The data blocks stored in the cache are called cache blocks (some textbooks also call them cache lines, or cache lines in English).
+Naturally, designing a cache requires the following concerns:
+
+* What size should the data block be?
+* How do I check whether the cache access request is matched?
+* The capacity of cache is usually smaller than that of DRAM. How do I maintain the mapping between cache blocks and data blocks in DRAM? What do I do when the cache is full?
+* The CPU may perform write operations to update the data in the data block. How should the cache be maintained?
+
+<!-- ### 简易指令缓存 -->
+### Simple instruction cache
+
+<!-- 我们先来考虑指令缓存(instruction cache, 简称icache),
+由于IFU的取指过程无需写入内存, 因此icache是只读的,
+我们可以先不考虑如何处理CPU写入数据块的情况.
+至于块大小, 我们先取一条指令的长度, 即4B.
+这也许不是一个最好的设计, 但对icache来说, 小于4B肯定不是一个好的设计,
+否则取出一条新指令时, 还需要进行多次访存.
+至于大于4B是否更好, 我们后面再来评估. -->
+
+Let's start with the instruction cache (icache). The icache is read-only because the IFU finger fetching process does not need to be written to memory. We can forget about how to handle the case of CPU writing data blocks. For the block size, let's first take the length of an instruction, which is 4B. This may not be the best design, but less than 4B is certainly not a good design for icache. Otherwise, multiple memory accesses are required to retrieve a new instruction. As to whether greater than 4B is better, we'll evaluate that later.
+
+<!-- 为了检查访存请求是否在cache中命中, 很自然地,
+除了存储数据块本身, cache还需要记录块的一些属性.
+最直接的方式就是记录这个块的一种唯一编号,
+不过我们还希望, 这个唯一编号的计算方式要足够简单.
+既然数据块是从内存来的, 我们可以按块大小对内存进行编号,
+内存地址`addr`对应的数据块的编号即为`addr / 4`, 这个编号称为块的标签(tag).
+这样, 我们只要计算出访存地址的tag, 然后和每个cache块的tag对比,
+就能知道目标块是否在cache中. -->
+
+To check if the access request is hit in the cache, naturally, in addition to storing the data block itself, the cache also needs to record some properties of the block. The most direct way to do this is to record a unique number for the block. But we also want this unique number to be calculated in a simple way.
+Since the blocks come from memory, we can number the memory by block size. The number of the block corresponding to the memory address 'addr' is' addr / 4 ', and this number is called the tag of the block. So, we just calculate the tag of the cache address and compare it to the tag of each cache block, we could know if the target block is in the cache.
+
+<!-- 然后来考虑cache块的组织问题.
+根据存储层次结构, cache的容量不可能和DRAM一样大, 通常也不可能小到只有1个cache块,
+因此需要考虑读入一个新块时, 应该将其读入到哪个cache块中.
+由于cache中有多个cache块, 因此我们也可以给cache块进行编号.
+最简单的组织方式就是将一个新块读入到固定的cache块中, 这种组织方式称为直接映射(direct-mapped).
+为此, 我们需要明确从内存地址`addr`到cache块号的映射关系.
+假设cache可存放`k`个cache块, 一种简单的映射关系是`cache块号 = (addr / 4) % k`,
+也即, 对于内存地址为`addr`的数据块, 它将被读入编号为`(addr / 4) % k`的cache块. -->
+
+Then consider the organization of cache blocks.
+Depending on the storage hierarchy, cache capacity cannot be as large as DRAM, and usually cannot be as small as 1 cache block. Therefore, it is necessary to consider which cache block to read into when reading a new block. Since there are multiple cache blocks in the cache, we can also number the cache blocks. The simplest way to organize is to read a new block into a fixed cache block, which is called direct-mapped.
+To do this, we need to clarify the mapping from the memory address 'addr' to the cache block number. If the cache can hold 'k' cache blocks, a simple mapping relationship is' cache block number = (addr / 4) % k '. That is, for a data block with memory address 'addr', it will be read into a cache block numbered '(addr / 4) % k'.
+
+<!-- 显然, 多个数据块可能会映射到相同的cache块,
+这时需要决定应该保留已有的cache块, 还是往cache块中读入新块.
+根据程序的局部性原理, 将来访问新块的概率更大,
+因此读入新块时, 应该将已有的cache块替换为新块,
+使得接下来一段时间内访问新块时能都在cache中命中. -->
+
+Obviously, multiple data blocks may map to the same cache block. At this point, you need to decide whether to keep the existing cache block or read the new block into the cache block. According to the locality principle of the program, the probability of accessing a new block in the future is greater. Therefore, when a new cache block is read, the existing cache block should be replaced with the new block. This enables new blocks to be hit in the cache in the following period of time.
+
+<!-- 我们可以把所有cache块看作一个数组, cache块号就是数组的索引(index), 因此cache块号也称为块索引.
+对于块大小是`b`字节, 共`k`个cache块的直接映射cache, 有`tag = addr / b`, `index = (addr / b) % k`.
+为了方便计算, 通常取`b`和`k`为2的幂, 假设`b = 2^m`, `k = 2^n`.
+假设`addr`为32位, 则有`tag = addr / 2^m = addr[31:m]`, `index = (addr / 2^m) % 2^n = addr[m+n-1:m]`,
+可见, index实际上是tag中的低`n`位.
+由于在直接映射的cache中, index不同的数据块必定会被映射到不同的cache块中,
+即使它们的tag的高`m`位(即`addr[31:m+n]`)相同,
+因此, 记录tag时只需要记录`addr[31:m+n]`即可. -->
+
+We can treat all cache blocks as an array, and the cache block number is the index of the array, so the cache block number is also called the block index. For a block size of 'b' bytes, a total of 'k' of cache blocks directly mapped cache, have 'tag = addr/b', 'index = (addr/b) % k'.
+In order to facilitate calculation, it is usual to take 'b' and 'k' as powers of 2, assuming 'b = 2^m', 'k = 2^n'. Suppose ` addr ` for 32-bit, have ` tag = addr / 2 ^ m = addr [31: m] `, ` index = (addr/m ^ 2) % 2 ^ n = addr [m + n - 1: m] `. As you can see, index is actually the low n bit in the tag. In a directly mapped cache, different index data blocks must be mapped to different cache blocks. Even if their tags have the same high m bits (addr[31:m+n]). Therefore, it is only necessary to record 'addr[31:m+n]' when recording the tag.
+
+<!-- 一个访存地址可以划分成以下3部分: tag, index, offset.
+其中tag部分作为数据块在cache中的唯一编号, index部分作为数据块在cache中的索引,
+offset部分属于块内偏移, 指示需要访问数据块中的哪部分数据. -->
+
+A memory address can be divided into the following three parts: tag, index, offset.
+The tag part is the unique number of the data block in the cache, and the index part is the index of the data block in the cache. The offset part is an in-block offset that indicates which part of the data in the block needs to be accessed
+
+```
+ 31    m+n m+n-1   m m-1    0
++---------+---------+--------+
+|   tag   |  index  | offset |
++---------+---------+--------+
+```
+
+<!-- 最后, 在系统复位时, cache中无任何数据, 此时所有cache块均无效.
+为了标识一个cache块是否有效, 需要为每个cache块添加一个有效位(valid).
+valid和tag统称为cache块的元数据(metadata), 其含义是用于管理数据的数据, 此处被管理的数据就是cache块. -->
+
+Finally, when the system is reset, there is no data in the cache, and all cache blocks are invalid. To identify whether a cache block is valid, add a valid bit to each cache block.
+valid and tag are collectively referred to as cache block metadata. They refer to the data used to manage data. In this case, the managed data is a cache block.
+
+<!-- 综上, 上述icache的工作流程如下:
+1. IFU向icache发送取指请求
+2. icache获得取指请求的地址后, 根据index部分索引出一个cache块,
+   判断其tag与请求地址的tag是否相同, 并检查该cache块是否有效.
+   若同时满足上述条件, 则命中, 跳转到第5步
+3. 通过总线在DRAM中读出请求所在的数据块
+4. 将该数据块填入相应cache块中, 更新元数据
+5. 向IFU返回取出的指令 -->
+
+In summary, the icache workflow is as follows:
+
+   1. The IFU sends a finger fetch request to the icache
+   2. After the icache obtains the address requested for obtaining a finger, it indexes a cache block based on index. Check whether the tag of the cache block is the same as that of the request address, and check whether the cache block is valid. If the above conditions are met at the same time, the match is hit and go to Step 5
+   3. Read the requested data block in the DRAM through the bus
+   4. Fill the data block into the corresponding cache block and update the metadata
+   5. Return the fetch command to IFU
+
+<!-- 整理出工作流程后, 你应该知道如何实现icache了: 还是状态机!
+上述工作流程甚至包含了一次总线的访问, 因此icache的实现也可以看成是总线状态机的扩展.
+你已经很熟悉总线的实现了, 因此如何实现icache的状态机, 就交给你来梳理吧! -->
+
+Once you've sorted out your workflow, you should know how to implement icache: a state machine! The above workflow even includes a bus access, so the implementation of icache can also be seen as an extension of the bus state machine. You are already familiar with the implementation of the bus, so how to achieve the icache state machine, let you comb it!
+
+<!-- > #### todo::实现icache
+>
+> 根据上述流程, 实现一个简单的icache, 块大小为4B, 共16个cache块.
+> 一般来说, cache的存储阵列(包括数据和元数据)都通过SRAM来实现,
+> 但在ASIC流程中使用SRAM涉及到选型和实例化, 其中SRAM的选型可能会影响到数据和元数据的存放.
+> 作为第一个cache的练习, 为简单起见, 此处先通过触发器来实现存储阵列, 提高实现的灵活性.
+>
+> 实现时, 建议将相关参数实现成可配置的, 以便于后续评估不同配置参数的性能表现.
+> 实现后, 尝试评估其性能表现. -->
+>
+> #### todo:: Implements the icache
+>
+> Implement a simple icache with a block size of 4B and 16 cache blocks.
+> In general, the cache's storage array (including data and metadata) is implemented through SRAM. However, the use of SRAM in the ASIC process involves selection and instantiation, where the selection of SRAM may affect the storage of data and metadata.
+> As the first cache exercise, for the sake of simplicity, the storage array is implemented through a trigger to improve the implementation flexibility.
+> During implementation, you are advised to configure related parameters to facilitate subsequent evaluation of performance of different configuration parameters.
+> After implementation, try to evaluate its performance.
+
+<!-- > #### todo::适合缓存的地址空间
+>
+> 并不是所有的地址空间都适合缓存, 只有存储器类型的地址空间才适合缓存.
+> 此外, SRAM的访问延迟本身就只有1周期, 因此也无需缓存,
+> 将缓存块留给其他的地址空间是一个更合适的方案. -->
+
+> #### todo:: An address space suitable for caching
+>
+> Not all address Spaces are suitable for caching, only memory type address Spaces are suitable for caching.
+> In addition, the access latency of SRAM itself is only 1 cycle, so there is no need for caching.
+> Leaving the cache block for another address space is a more appropriate solution.
+
+<!-- > #### todo::估算dcache的理想收益
+>
+> 通常, LSU也有与其配对的缓存, 称为数据缓存(data cache, 简称dcache).
+> 在实现dcache之前, 我们可以先估计它在理想情况下的性能收益.
+> 假设dcache的容量无限大, 访问dcache的命中率为100%, 且dcache的访问延迟为1周期,
+> 尝试根据性能计数器估算添加这样一个dcache带来的性能收益.
+>
+> 如果你的估算正确, 你应该发现此时添加dcache是不值得的, 我们将在下文继续讨论这个问题. -->
+
+> #### todo:: Estimate the ideal return of dcache
+>
+> Typically, an LSU also has a cache paired with it, called a data cache (dcache).
+Before implementing dcache, we can first estimate its performance benefits under ideal conditions.
+> Assume that the capacity of dcache is infinite, the hit ratio of dcache access is 100%, and the access delay of dcache is 1 period.
+> Try to estimate the performance benefit of adding such a dcache based on the performance counter.
+> If your estimate is correct, you should find that adding dcache at this point is not worth it, and we will continue to discuss this issue below.
+
+<!-- ## 形式化验证 -->
+## formal verification
+
+<!-- 借助DiffTest, 你应该很容易保证接入icache后, 给定的程序仍然可以正确运行.
+但如何保证icache对于任意程序都能正确运行呢? -->
+
+With DiffTest, it should be easy to ensure that a given program still runs correctly after you plug in icache. But how do you ensure that icache works correctly for any program?
+
+<!-- 这看上去是一个很困难的问题, 相信你一定碰到过这种情况:
+代码运行给定的测试用例都是正确的, 但哪天运行一个其他的测试, 就会出错.
+无论是从原理上分析还是从实践中总结, 光靠测试是无法证明一个模块的正确性的,
+除非这些测试用例覆盖了所有程序, 或者是覆盖了被测试模块的所有输入情况.
+程序的数量是无限的, 要把所有程序都测试一遍并不现实,
+不过被测试模块的输入是有限的, 至少遍历所有输入在理论上是可行的. -->
+
+This seems like a very difficult question, and I'm sure you've encountered this situation:
+The code runs a given test case correctly, but any day it runs a different test, something goes wrong. Whether it is analyzed in principle or summarized in practice, it is impossible to prove the correctness of a module by testing alone. Unless these test cases cover all programs, or all input cases of the module under test. The number of programs is infinite, and it's not practical to test them all. However, the input of the module under test is limited, and it is at least theoretically possible to iterate over all the inputs.
+
+<!-- 如果要遍历一个模块的所有输入, 一方面要生成可以覆盖所有输入情况的测试集,
+另一方面还需要有方法能判断一个具体的输入是否正确.
+即使这些能做到, 要把所有测试全部运行完, 也需要很长时间, 这通常是难以忍受的.
+软件测试理论中的等价类测试方法可以把本质行为相似的测试进行归类,
+从等价类中选择一个测试来代表整个等价类的测试, 从而降低测试集的大小.
+但等价类应该如何划分, 是需要根据被测模块的逻辑人工决定的.
+不过, 根据软件领域另一则广泛流传的忠告, 需要人工干预的流程, 都会存在出错的风险. -->
+
+If you want to iterate over all the inputs in a module, on the one hand, you want to generate a test set that covers all the input cases,
+On the other hand, there needs to be a way to determine whether a specific input is correct. Even if this can be done, it takes a long time to run all the tests, which is often unbearable. The equivalence class testing method in software testing theory can classify the tests whose essential behavior is similar. Select one test from the equivalence class to represent the tests of the entire equivalence class, thereby reducing the size of the test set. However, how equivalence classes should be divided is decided manually according to the logic of the module under test. But according to another popular piece of software advice, any process that requires human intervention is at risk of going wrong.
+
+<!-- ### 形式化验证的基本原理 -->
+### The basic principles of formal verification
+
+<!-- 那么, 能不能让工具帮我们自动寻找测试用例呢? 还真有这种工具!
+求解器(Solver)是一类在给定约束条件下寻找可行解的数学工具,
+其本质类似求解方程组或线性规划等数学问题.
+例如, [Z3][z3]是一个[可满足性模理论(Satisfiablity Modulo Theories, SMT)][smt]问题的求解器,
+它可以求解一个包含实数, 整数, 比特, 字符, 数组, 字符串等内容的命题是否成立.
+事实上, 只要能将问题表达成一阶逻辑语言的某个子集, 就能交给SMT求解器来求解,
+因此SMT求解器也可以用来求解类似数独的复杂问题.
+SMT求解器广泛应用于定理自动证明, 程序分析, 程序验证和软件测试等领域.
+下面是一个在python中使用Z3求解方程组的例子. -->
+
+So, can we have tools that automatically find test cases for us? There is such a tool!
+A Solver is a class of mathematical tools that find feasible solutions under given constraints. It is essentially similar to solving mathematical problems such as equations or linear programming. For example, [Z3][z3] is a solver for the [Satisfiablity Modulo Theories, SMT][smt] problem. It can solve whether a proposition containing real numbers, integers, bits, characters, arrays, strings and so on is true.
+In fact, as long as the problem can be expressed as a subset of the first-order logic language, it can be handed over to the SMT solver to solve. Therefore, SMT solvers can also be used to solve complex problems like Sudoku. SMT solvers are widely used in automatic theorem proving, program analysis, program verification and software testing.
+Here is an example of using Z3 to solve a system of equations in python.
+
+```python
+#!/usr/bin/python
+from z3 import *
+
+x = Real('x')  # define variable
+y = Real('y')
+z = Real('z')
+s = Solver()
+s.add(3*x + 2*y - z == 1)    # define constraints
+s.add(2*x - 2*y - 4*z == -2)
+s.add(-x + 0.5*y - z == 0)
+print(s.check())  # Find out if there is a feasible solution: sat
+print(s.model())  # Output feasible solution: [y = 14/25, x = 1/25, z = 6/25]
+```
+
+[z3]: https://github.com/Z3Prover/z3
+[smt]: https://en.wikipedia.org/wiki/Satisfiability_modulo_theories
+
+<!-- 在测试验证领域, 有一类基于求解器的验证方法, 称为形式化验证,
+其核心思想是将设计作为约束条件, 将输入作为变量, 将"至少一个验证条件不成立"作为求解目标,
+把这些内容用一阶逻辑语言表达出来, 并转换成求解器识别的语言, 然后尝试让求解器寻找是否存在可行解.
+例如, 若某设计中有`assert(cond1)`和`assert(cond2)`这两个验证条件,
+则尝试让求解器寻找是否存在输入, 使得`!cond1 || !cond2`成立.
+若可行解存在, 则说明求解器找到了一个违反验证条件的测试用例, 这个反例可以帮助我们调试并改进设计;
+若可行解不存在, 则说明所有输入都不会违反验证条件, 从而证明了设计的正确性!
+可见, 无论求解器能否找到可行解, 对设计者来说都是极好的消息. -->
+
+In the field of test verification, there is a class of solver based verification methods called formal verification. Its core idea is to take the design as the constraint condition, the input as the variable, and "at least one verification condition is not valid" as the solution goal. Express this in a first-order logic language, translate it into a language that the solver recognizes, and then try to get the solver to find out if there is a viable solution. For example, if a design has two validation conditions: assert(cond1) and assert(cond2), attempts to have the solver look for the presence of inputs such that '! cond1 || ! cond2 'is established. If a feasible solution exists, it means that the solver has found a test case that violates the verification condition, and this counterexample can help us debug and improve the design. If the feasible solution does not exist, it means that all the inputs do not violate the verification condition, thus proving the correctness of the design! It can be seen that whether the solver can find a viable solution is excellent news for the designer.
+
+<!-- > #### caution::不要迷信UVM测试的100%覆盖率报告
+>
+> 如果你了解UVM, 你应该知道UVM的目标是提升覆盖率.
+> 但如果你认为提升覆盖率是测试验证的最终目标, 那你很有可能还不了解测试验证.
+>
+> 事实上, 测试验证的最终目标, 是证明设计的正确性, 或者是找出设计中的所有bug.
+> 但有经验的工程师都知道, 即使做到了100%的覆盖率, 还是可能会有一些bug没有被测出来,
+> 而且还估算不了这些没被测出来的bug还有多少.
+>
+> 之所以覆盖率的目标被企业广泛采用, 一方面是因为覆盖率是一个容易量化和统计的指标.
+> 如果用严格的语言来描述"一个事件被覆盖", 就是
+> ```
+> 存在一个测试用例, 其运行成功, 且在运行过程中触发了该事件.
+> ```
+> 这里的事件可以是执行到某行代码(行覆盖率), 某信号发生翻转(翻转覆盖率),
+> 某状态机的状态发生转移(状态机覆盖率), 自定义条件被满足(功能覆盖率)等.
+> 而"覆盖率达到100%", 则是
+> ```
+> 对于每个事件, 都存在一个测试用例, 其运行成功, 且在运行过程中触发了该事件.
+> ```
+> 注意我们可以通过不同的测试用例覆盖不同的事件.
+> 按照这个定义, 只需要在仿真过程中添加一些标志即可统计出覆盖率.
+> 甚至大多数RTL仿真器(包括verilator)都提供了自动统计覆盖率的功能,
+> 如果你想学习如何统计覆盖率, 只需要RTFM.
+>
+> 另一方面, 从上述定义也可以看出, 提升覆盖率其实是验证工作的下限,
+> 覆盖率太低, 只能说明验证工作做得还不充分, 这和"未测试代码永远是错的"是一致的.
+> 但测试验证的最终目标是
+> ```
+> 对于所有测试用例, 均运行成功.
+> ```
+> 相比之下, "覆盖率达到100%"其实是"设计正确"的必要不充分条件, 是非常宽松的,
+> 甚至我们很容易就可以举出一个反例:
+> 某模块有两个功能, 针对每个功能, 已分别被各自的测试用例覆盖, 此时功能覆盖率达到100%;
+> 但在运行那些需要两个功能进行交互的测试用例时, 就出错.
+>
+> 和低覆盖率的验证工作相比, 达到更高的覆盖率固然能提升设计正确的概率,
+> 但我们想说的是, 即使覆盖率达到100%, 还是远远不够的. 尤其是对复杂系统来说,
+> 一些藏得很深的bug通常在多个边界条件同时满足的情况下才会触发.
+> 与其坚持以100%覆盖率作为验证目标, 我们更鼓励大家积极思考如何通过其他方法和技术寻找更多潜在的bug,
+> 对验证工作的实际意义也更大. -->
+>
+> #### caution:: Don't trust 100% coverage reports for UVM testing
+> 
+> If you know UVM, you should know that the goal of UVM is to improve coverage. But if you think that improving coverage is the ultimate goal of test validation, you probably don't know much about test validation.
+>
+> In fact, the ultimate goal of test validation is to prove the correctness of the design, or to find all the bugs in the design. But experienced engineers know that even if 100% coverage is achieved, there may still be some bugs that are not detected.
+> And it's impossible to estimate how many of these undetected bugs are out there.
+> The goal of coverage is widely adopted by enterprises, in part because coverage is an easily quantifiable and statistical indicator.
+> If you use strict language to describe "an event is covered ", it is
+> ```
+> There is a test case that ran successfully, and the event was triggered during the run.
+> ```
+> The event here can be execution of a line of code (line coverage), a signal flipping (flip coverage),
+> The state of a state machine is transferred (state machine coverage), user-defined conditions are met (function coverage), and so on.
+> and "coverage reaches 100%", yes
+> ```
+> For each event, there is a test case that runs successfully and triggers the event during its run.
+> ```
+> Note that we can override different events with different test cases.
+> According to this definition, you only need to add some flags during the simulation to calculate the coverage. Even most RTL emulators (including Verilators) provide automatic coverage statistics,
+> If you want to learn how to count coverage, just need RTFM.
+>
+> On the other hand, it can also be seen from the above definition that improving coverage is actually the lower limit of verification work,
+> Too low coverage only indicates that the validation has not been done enough, which is consistent with "untested code is always wrong".
+> But the ultimate goal of test validation is
+> ```
+> Ran successfully for all test cases.
+> ```
+> In contrast, "100% coverage" is actually a necessary but not sufficient condition for "correct design", which is very loose.
+> Even we can easily cite a counter-example:
+> A module has two functions, for each function, has been covered by its own test cases, then the function coverage reaches 100%;
+But when you run test cases that require the two functions to interact, something goes wrong.
+>
+> Achieving higher coverage certainly improves the probability of a correct design compared to low-coverage verification efforts. But what we want to say is that even if coverage reaches 100%, it's still far from enough. Especially for complex systems,
+> Some hidden bugs are usually triggered when multiple boundary conditions are met at the same time.
+> Instead of insisting on 100% coverage as the verification goal, we encourage people to actively think about how to find more potential bugs through other methods and techniques.
+> The practical significance of the verification work is also greater.
+
+<!-- ### 形式化验证的简单示例 -->
+### Simple example of formal validation
+
+<!-- #### 基于Chisel的形式化验证流程 -->
+#### Formal validation process based on Chisel
+
+<!-- Chisel的测试框架[chiseltest][chiseltest]已经集成了形式化验证的功能,
+可以将FIRRTL代码翻译成能被Z3识别的语言, 并让Z3证明给定的`assert()`是否正确.
+若能找到反例, 则生成该反例的波形辅助调试, 非常方便.
+有了形式化验证工具, 我们再也不需要为测试用例覆盖不全面而苦恼, 甚至连测试用例也不需要编写了.
+一个字, 香! -->
+
+Chisel's testing framework [chiseltest][chiseltest] has integrated formal validation capabilities. You can translate FIRRTL code into a certain language that is recognized by Z3 and have Z3 prove that a given 'assert()' is correct. If the counterexample can be found, it is very convenient to generate the waveform of the counterexample to assist debugging. With formal validation tools, we no longer have to worry about incomplete test case coverage, or even the need to write test cases. Nice!
+
+[chiseltest]: https://github.com/ucb-bar/chiseltest
+
+<!-- 下面给出一个对Chisel模块进行形式化验证的例子: -->
+Here is an example of formal validation of the Chisel module:
+
+```scala
+import chisel3._
+import chisel3.util._
+import chiseltest._
+import chiseltest.formal._
+import org.scalatest.flatspec.AnyFlatSpec
+
+class Sub extends Module {
+  val io = IO(new Bundle {
+    val a = Input(UInt(4.W))
+    val b = Input(UInt(4.W))
+    val c = Output(UInt(4.W))
+  })
+  io.c := io.a + ~io.b + Mux(io.a === 2.U, 0.U, 1.U)
+
+  val ref = io.a - io.b
+  assert(io.c === ref)
+}
+
+class FormalTest extends AnyFlatSpec with ChiselScalatestTester with Formal {
+  "Test" should "pass" in {
+    verify(new Sub, Seq(BoundedCheck(1)))
+  }
+}
+```
+
+<!-- > #### danger::不再使用Utest
+>
+> 随着Chisel版本的演进, Utest已不被支持, 因此我们也建议不再使用Utest.
+> 如果你在2024/04/11 01:00:00之前获取`chisel-playground`的代码,
+> 请参考[新版本的`build.sc`][new build.sc]中的`object test`修改你的`build.sc`. -->
+ 
+> #### danger:: No longer use Utest
+>
+> As the Chisel version has evolved, Utest is no longer supported, so we also recommend that it not be used.
+> If you get the code for 'chisel-playground' before 2024/04/11 01:00:00,
+> Please modify your build.sc by referring to 'object test' in [new build.sc] [new build.sc].
+
+[new build.sc]: https://github.com/OSCPU/chisel-playground/blob/master/build.sc
+
+<!-- 上述例子的`Sub`模块通过"取反加1"实现了补码减法的功能.
+为了验证`Sub`模块实现的正确性, 代码将"取反加1"的计算结果与通过减法运算符得到的结果进行对比,
+我们预期`assert()`应该对任意输入都成立.
+为了展示形式化验证的效果, 我们在`Sub`模块的实现中注入了一个bug:
+在`io.a`为`2`时不进行"加1"操作, 此时补码减法的结果是错误的. -->
+
+The 'Sub' module of the above example implements the function of complement subtraction by "inverting and adding 1". In order to verify the correctness of the implementation of the 'Sub' module, the code compares the result of the "inverse plus 1" calculation with the result obtained by the subtraction operator. We expect that 'assert()' should hold for any input. To demonstrate the effect of formal validation, we injected a bug into the implementation of the 'Sub' module:
+The "add 1" operation is not performed when 'io.a' is' 2 ', in which case the result of complement subtraction is incorrect.
+
+<!-- 在调用chiseltest提供的形式化验证功能时, 上述代码还需要传入一个`BoundedCheck(1)`的参数,
+这个参数用来指定SMT求解器需要证明的周期数.
+例如, `BoundedCheck(4)`表示让SMT求解器尝试证明被测模块在复位之后的4个周期内,
+在任意输入信号下都不违反`assert()`.
+对于组合逻辑电路, 我们只需要让SMT求解器在1周期内求解即可. -->
+
+The code also needs to pass in a 'BoundedCheck(1)' parameter when calling the formal validation function provided by chiseltest. This parameter specifies the number of cycles that the SMT solver needs to prove. For example, 'BoundedCheck(4)' means asking the SMT solver to try to prove that the module under test has been reset for 4 cycles. 'assert() 'is not violated under any input signal. For combinational logic circuits, we only need to let the SMT solver solve in 1 cycle.
+
+<!-- 在运行上述测试之前, 你还需要安装`z3`: -->
+Before running the above tests, you also need to install 'z3' :
+
+```bash
+apt install z3
+```
+<!-- 安装后, 通过``mill -i __.test``运行测试, 输出信息如下: -->
+After installation, run the test through mill-i __.test with the following output:
+
+```
+Assertion failed
+    at SubTest.scala:16 assert(io.c === ref)
+- should pass *** FAILED ***
+  chiseltest.formal.FailedBoundedCheckException: [Sub] found an assertion violation 0 steps after reset!
+  at chiseltest.formal.FailedBoundedCheckException$.apply(Formal.scala:26)
+  at chiseltest.formal.backends.Maltese$.bmc(Maltese.scala:92)
+  at chiseltest.formal.Formal$.executeOp(Formal.scala:81)
+  at chiseltest.formal.Formal$.$anonfun$verify$2(Formal.scala:61)
+  at chiseltest.formal.Formal$.$anonfun$verify$2$adapted(Formal.scala:61)
+  at scala.collection.immutable.List.foreach(List.scala:333)
+  at chiseltest.formal.Formal$.verify(Formal.scala:61)
+  at chiseltest.formal.Formal.verify(Formal.scala:34)
+  at chiseltest.formal.Formal.verify$(Formal.scala:32)
+  at FormalTest.verify(SubTest.scala:19)
+  ...
+```
+<!-- 上述信息说明求解器找到了一个在复位后第0个周期就违反`assert()`的测试用例.
+进一步地, 开发者可以通过波形文件`test_and_run/Test_should_pass/Sub.vcd`辅助调试.
+修正`Sub`模块中的错误后, 重新运行上述测试后将不再输出错误信息,
+表示求解器找不到反例, 也即证明了代码的正确性. -->
+
+The above information indicates that the solver found a test case that violated 'assert()' on cycle 0 after the reset. Further, developers can aid debugging with the waveform file test_and_run/Test_should_pass/Sub.vcd. After fixing the error in the 'Sub' module, re-running the above test will no longer output the error message. The representation solver cannot find a counterexample, which proves the correctness of the code.
+
+<!-- #### 基于Verilog的形式化验证流程 -->
+#### Formal verification process based on Verilog
+
+<!-- chiseltest的形式化验证流程是将FIRRTL代码转换成能被Z3识别的语言,
+不涉及Verilog, 因此上述流程不支持基于Verilog开发的项目.
+如果你使用Verilog进行开发, 可以使用基于Yosys的形式化验证流程,
+其中[SymbiYosys][symbiyosys]是这一流程的前端工具. -->
+
+chiseltest's formal validation process is to convert FIRRTL code into a language that is recognized by Z3. Verilog is not involved, so the above process does not support projects developed on Verilog. If you're developing with Verilog, you can use a formal validation process based on Yosys. [SymbiYosys][symbiyosys] is the front-end tool for this process.
+
+[symbiyosys]: https://symbiyosys.readthedocs.io/en/latest/
+
+<!-- 下面给出一个对Verilog模块进行形式化验证的例子: -->
+Here is an example of formal validation of the Verilog module:
+
+```verilog
+// Sub.sv
+`define FORMAL
+
+module Sub(
+  input [3:0] a,
+  input [3:0] b,
+  output [3:0] c
+);
+
+  assign c = a + ~b + (a == 4'd2 ? 1'b0 : 1'b1);
+
+`ifdef FORMAL
+  always @(*) begin
+    c_assert: assert(c == a - b);
+  end
+`endif  // FORMAL
+
+endmodule
+```
+
+<!-- 上述例子的`Sub`模块通过"取反加1"实现了补码减法的功能.
+为了验证`Sub`模块实现的正确性, 代码将"取反加1"的计算结果与通过减法运算符得到的结果进行对比,
+我们预期`assert()`应该对任意输入都成立.
+为了展示形式化验证的效果, 我们在`Sub`模块的实现中注入了一个bug:
+在`a`为`2`时不进行"加1"操作, 此时补码减法的结果是错误的. -->
+
+The 'Sub' module of the above example implements the function of complement subtraction by "inverting and adding 1". In order to verify the correctness of the implementation of the 'Sub' module, the code compares the result of the "inverse plus 1" calculation with the result obtained by the subtraction operator. We expect that 'assert()' should hold for any input. To demonstrate the effect of formal validation, we injected a bug into the implementation of the 'Sub' module:
+The "add 1" operation is not performed when 'a' is' 2 ', then the result of complement subtraction is wrong.
+
+<!-- 编写上述`Sub.sv`文件后, 你还需要编写SymbiYosys的配置文件``*.sby``, 这个文件一般由以下几个部分组成:
+
+* task: 可选项, 用于指定所需执行的任务
+* options: 必须项, 用于将代码中的`assert`, `cover`等语句和模型相对应
+* engines: 必须项, 用于指定求解的模型
+* script: 必须项, 包含测试需要的Yosys脚本
+* files: 必须项, 用于指定测试的文件
+
+以下是配置文件`Sub.sby`的示例: -->
+
+After writing the above 'Sub.sv' file, you also need to write the SymbiYosys configuration file '*.sby', which is generally composed of the following parts:
+
+* task: Optional to specify the task to be executed
+* options: A must for matching 'assert', 'cover', etc. statements in the code to the model
+* engines: Required to specify the model to be solved
+* script: A required item that contains the Yosys script required for testing
+* files: Required to specify the files for testing
+
+Here is an example of the configuration file 'Sub.sby' :
+
+```sby
+[tasks]
+basic bmc
+basic: default
+
+[options]
+bmc:
+mode bmc
+depth 1
+
+[engines]
+smtbmc
+
+[script]
+read -formal Sub.sv
+prep -top Sub
+
+[files]
+Sub.sv
+```
+
+<!-- 上述配置有一个`depth`的选项, 用来指定SMT求解器需要证明的周期数.
+例如, `depth 4`表示让SMT求解器尝试证明被测模块在复位之后的4个周期内,
+在任意输入信号下都不违反`assert()`.
+对于组合逻辑电路, 我们只需要让SMT求解器在1周期内求解即可. -->
+
+The above configuration has a 'depth' option to specify the number of cycles the SMT solver needs to prove. For example, 'depth 4' means having the SMT solver try to prove that the module under test has been reset for four cycles. 'assert() 'is not violated under any input signal. For combinatorial logic circuits, we only need to let the SMT solver solve in 1 cycle.
+
+<!-- 在进行形式化验证之前, 你需要从[这个链接][oss release]下载相应的工具.
+解压缩后, 执行命令行`path-to-oss-cad-suite/bin/sby -f Sub.sby`进行形式化验证, 输出信息如下: -->
+Before formal verification, you will need to download the appropriate tool from [this link][oss release]. After decompression, run the command line 'path-to-oss-cad-suite/bin/sby -f Sub.sby' for formal verification, and the output information is as follows:
+
+[oss release]: https://github.com/YosysHQ/oss-cad-suite-build/releases
+
+```
+SBY 16:52:19 [Sub_basic] engine_0: ##   0:00:00  Checking assumptions in step 0..
+SBY 16:52:19 [Sub_basic] engine_0: ##   0:00:00  Checking assertions in step 0..
+SBY 16:52:19 [Sub_basic] engine_0: ##   0:00:00  BMC failed!
+SBY 16:52:19 [Sub_basic] engine_0: ##   0:00:00  Assert failed in Sub: c_assert
+SBY 16:52:19 [Sub_basic] engine_0: Status returned by engine: FAIL
+SBY 16:52:19 [Sub_basic] summary: Elapsed clock time [H:MM:SS (secs)]: 0:00:00 (0)
+SBY 16:52:19 [Sub_basic] summary: Elapsed process time [H:MM:SS (secs)]: 0:00:00 (0)
+SBY 16:52:19 [Sub_basic] summary: engine_0 (smtbmc) returned FAIL
+SBY 16:52:19 [Sub_basic] summary: counterexample trace: Sub_basic/engine_0/trace.vcd
+SBY 16:52:19 [Sub_basic] summary:   failed assertion Sub.c_assert at Sub.sv:11.9-11.37 in step 0
+SBY 16:52:19 [Sub_basic] DONE (FAIL, rc=2)
+SBY 16:52:19 The following tasks failed: ['basic']
+```
+
+<!-- 上述信息说明求解器找到了一个在复位后第0个周期就违反`assert()`的测试用例.
+进一步地, 开发者可以通过波形文件`Sub_basic/engine_0/trace.vcd`辅助调试.
+修正`Sub`模块中的错误后, 重新运行上述命令后将输出成功信息,
+表示求解器找不到反例, 也即证明了代码的正确性. -->
+
+The above information indicates that the solver found a test case that violated 'assert()' on cycle 0 after the reset. Further, developers can use the waveform file 'Sub_basic/engine_0/trace.vcd' to aid debugging. After correcting the error in the 'Sub' module, re-running the above command will output a success message. The representation solver cannot find a counterexample, which proves the correctness of the code.
+
+<!-- ### 通过形式化验证测试icache -->
+### Test icache with formal validation
+
+<!-- 我们的目标是通过形式化验证的方式来证明icache的正确性,
+因此首先我们需要设计相应的REF, 并确定正确性条件.
+由于cache属于提升访存效率的技术, 它不应该影响访存结果的正确性,
+也即, 无论是否有cache, 访存请求的行为应当一致.
+因此, 我们可以将一个最简单的访存系统作为REF,
+它接收来自CPU的访存请求, 然后直接访问内存;
+与之相对的DUT, 则是让访存请求经过cache.
+对于正确性条件, 我们只要检查读请求返回的结果是否一致即可. -->
+
+Our goal was to prove icache's correctness through formal verification,
+Therefore, first we need to design the corresponding REF and determine the correctness condition. Cache is a technology that improves memory access efficiency. Therefore, it should not affect the accuracy of memory access results. That is, the behavior of the access request should be the same regardless of whether the cache is present.
+Therefore, we can use one of the simplest memory access systems as REF. It receives memory access requests from the CPU and then accesses memory directly. In contrast, the DUT lets the access request pass through the cache. For correctness conditions, we simply check that the results returned by the read request are consistent.
+
+<!-- 根据上文的分析, 我们很容易编写出验证顶层模块的伪代码.
+这里我们用Chisel作为伪代码, 如果你使用Verilog开发, 你仍然可以借鉴相关思路来编写验证顶层模块. -->
+Based on the above analysis, it is easy to write pseudo-code that validates the top-level module. Here we use Chisel as pseudo-code, but if you are developing with Verilog, you can still borrow some ideas to write validation top-level modules.
+
+```scala
+class CacheTest extends Module {
+  val io = IO(new Bundle {
+    val req = new ...
+    val block = Input(Bool())
+  })
+
+  val memSize = 128  // byte
+  val mem = Mem(memSize / 4, UInt(32.W))
+  val dut = Module(new Cache)
+
+  dut.io.req <> io.req
+
+  val dutData = dut.io.rdata
+  val refRData = mem(io.req.addr)
+  when (dut.io.resp.valid) {
+    assert(dutData === refData)
+  }
+}
+```
+
+<!-- 上述伪代码只给出了大致的框架, 你需要根据你的具体实现来补充部分细节:
+* 屏蔽写操作, 你可以将写使能相关的信号置为`0`
+* 让cache缺失时从`mem`中读出数据, 由于测试对象不产生写操作,
+  因此DUT和REF可以使用相同的存储器
+* 由于REF直接从`mem`中读出数据, 没有任何延迟,
+  而DUT从cache中读出数据时需要经历若干周期, 因此需要对`assert()`的时机进行同步:
+  REF读出数据后, 需要等待DUT返回读结果后才能进行检查, 显然, 这很容易通过状态机来实现
+* 由于形式化验证工具会遍历每个周期的所有输入情况, 因此输入信号每个周期都会变化,
+  你可能需要借助寄存器暂存一些结果
+* 借助"形式化验证工具会遍历每个周期的所有输入情况"的特性,
+  我们可以在测试的顶层定义一些`block`信号,
+  用于测试AXI相关的代码能否在随机延迟的场景下工作,
+  例如`dut.io.axi.ar.ready := arready_ok & ~block1`,
+  `dut.io.axi.r.valid := rvalid_ok & ~block2` -->
+
+The above pseudocode only gives the general framework, you need to fill in some details according to your specific implementation:
+
+* Mask the write operation, you can set the write enable-related signal to '0'
+* Read data from 'mem' when cache is missing, since the test object does not generate write operations. So DUT and REF can use the same memory
+* Since REF reads data directly from 'mem' without any delay. The DUT needs to go through several cycles to read data from the cache, so it needs to synchronize the timing of 'assert()' :
+After the REF reads the data, it needs to wait for the DUT to return the read result before it can be checked, which is obviously easy to do with the state machine
+* Since the formal validation tool traverses all input cases for each cycle, the input signal changes each cycle. You may need to use registers to temporarily store some results
+* With the "formal validation tool walks through all input cases per cycle" feature,
+We can define some 'block' signals at the top of the test. Used to test whether AXI-related code can work under random delay scenarios,
+For example, dut.io.axi.ar.ready := arready_ok & ~block1 ', `dut.io.axi.r.valid := rvalid_ok & ~block2`
+
+<!-- > #### option::通过形式化验证测试icache的实现
+>
+> 尽管这不是必须的, 我们还是强烈建议你尝试这一现代的测试验证方法,
+> 体会"使用正确的工具解决问题"带来的爽感.
+> 关于SMT求解器需要证明的周期数(`BoundedCheck()`或`depth`), 你可以挑选一个合适的参数,
+> 使cache能在形式化验证工具证明的周期内处理3~4个请求,
+> 从而测试cache能否正确处理任意连续的请求. -->
+>
+> #### option:: Tests the icache implementation through formal validation
+>
+> Although this is not required, we strongly recommend that you try this modern test verification method,
+> Experience the exhilaration of "using the right tools to solve problems".
+> Regarding the number of cycles that the SMT solver needs to prove (' BoundedCheck() 'or' depth '), you can pick an appropriate parameter,
+> Enable cache to process 3~4 requests within the period demonstrated by the formal validation tool,
+> To test whether the cache can handle any consecutive requests correctly.
+
+<!-- 形式化验证看上去全是好处, 但其实形式化验证有一个致命的缺点, 就是状态空间爆炸问题.
+随着设计规模的增长和证明周期数的上升, 求解器需要遍历的空间也越大.
+事实上, 一阶逻辑语言在理论上是不可判定的;
+即使是可判定的子集, 在算法复杂度的意义上通常也是NP-Hard问题.
+这意味着, 求解器的运行时间很可能随设计规模呈指数增长.
+因此, 形式化验证一般在单元测试中使用. -->
+
+Formal verification seems to have all the advantages, but in fact, formal verification has a fatal disadvantage, which is the problem of state space explosion. As the design scale increases and the number of proof cycles increases, the space that the solver needs to traverse becomes larger. In fact, first-order logical languages are theoretically undecidable; Even decidable subsets are usually NP-Hard problems in the sense of algorithmic complexity. This means that the run time of the solver is likely to increase exponentially with the size of the design. For this reason, formal validation is commonly used in unit testing.
+
+<!-- > #### comment::使用最新版本的Z3
+>
+> Z3作为一个开源项目, 其版本一直在迭代.
+> 求解器作为一类功能成熟的软件, 迭代的方向自然是性能优化,
+> 也即, 采用新版本的Z3, 可能比采用旧版本有明显的性能提升.
+> 如果你使用几年前的Linux发行版, 通过`apt`安装的Z3版本可能比较老旧,
+> 如果你想提升Z3的运行效率, 可以尝试从[代码仓库][z3]编译并安装Z3, 具体操作请RTFM. -->
+>
+> #### comment:: Using the latest version of Z3
+>
+> Z3 As an open source project, its version has been iterating.
+> Solver as a kind of mature software, the direction of iteration is naturally performance optimization.
+> That is, using the new version of the Z3 may have a significant performance improvement over using the old version.
+> If you are using a Linux distribution from a few years ago, the Z3 version installed through 'apt' may be older,
+> If you want to improve the efficiency of Z3, you can try compiling and installing z3 from [code repository][z3], please refer to RTFM.
+
+<!-- ## 缓存的优化 -->
+## Cache optimization
+
+<!-- 由于缓存技术主要用于提升访存效率, 因此很自然地, 我们应该通过访存相关的指标来评价缓存的性能表现.
+通常通过AMAT(Average Memory Access Time)来评估缓存的性能表现, 假设缓存的命中率为`p`: -->
+
+Since caching technology is mainly used to improve memory access efficiency, it is natural that we should evaluate the performance of the cache by accessing related metrics. AMAT(Average Memory Access Time) is usually used to evaluate the performance of the cache, assuming that the cache hit ratio is' p ':
+
+```
+AMAT = p * access_time + (1 - p) * (access_time + miss_penalty)
+     = access_time + (1 - p) * miss_penalty
+```
+<!-- 其中`access_time`为cache的访问时间, 即从cache接收访存请求到得出命中结果所需的时间,
+`miss_penalty`为cache缺失时的代价, 此处即访问DRAM的时间. -->
+
+Where 'access_time' indicates the access time of the cache, that is, the time required for receiving the access request from the cache to obtaining the matching result. 'miss_penalty' is the cost when the cache is missing, in this case, the time for accessing DRAM.
+
+<!-- 这个等式给优化缓存的性能表现提供了指导方向:
+减少访问时间`access_time`, 提升命中率`p`, 或者减少缺失代价`miss_penalty`.
+在目前的NPC中, 访问时间在架构设计上能优化的空间不多, 更多是受具体实现的影响, 如周期数和关键路径.
+因此, 后续我们重点讨论命中率和缺失代价的优化. -->
+
+This equation provides a guideline for optimizing cache performance:
+Reduce access time 'access_time', increase hit rate 'p', or reduce miss cost 'miss_penalty'. In current NPC, access time has less room for architectural design optimization and is more influenced by the specific implementation, such as cycle number and critical path. Therefore, we will focus on the optimization of hit rate and missing cost.
+
+<!--> #### todo::统计AMAT
+>
+> 在NPC中添加合适的性能计数器, 统计icache的AMAT. -->
+
+> #### todo:: Statistics AMAT
+>
+> Add an appropriate performance counter to the NPC to count the AMAT of the icache.
+
+<!-- ### 优化命中率 -->
+### Optimize hit ratio
+
+<!-- 要提升命中率, 也就是要降低缺失率, 为此, 我们需要先了解cache的缺失有哪些原因. -->
+
+To improve the hit rate, that is, to reduce the miss rate, we need to first understand the reasons for missing cache.
+
+<!-- #### cache缺失的3C模型 -->
+#### cache missing 3C model
+
+<!-- 计算机科学家[Mark Hill][mark mill]在其1987年的[博士论文][mark mill phd thesis]中提出3C模型,
+刻画了cache缺失的3种类型:
+1. Compulsory miss, 强制缺失, 定义为在一个容量无限大的cache中所发生的缺失,
+   表现为在第一次访问一个数据块时所发生的缺失
+2. Capacity miss, 容量缺失, 定义为不扩大cache容量就无法消除的缺失,
+   表现为因cache无法容纳所有所需访问的数据而发生的缺失
+3. Conflict miss, 冲突缺失, 定义为除上述两种原因外引起的缺失,
+   表现为因多个cache块之间相互替换而发生的缺失 -->
+
+Computer scientist [Mark Hill][mark mill] proposed the 3C model in his [mark mill phd thesis] in 1987,
+Three types of cache loss are described:
+
+   1. Compulsory miss, defined as a loss occurring in a cache of unlimited capacity. It is represented by a loss that occurs when a block of data is accessed for the first time.
+   2. Capacity miss, defined as an absence that cannot be eliminated without expanding the cache capacity. A loss occurs because the cache cannot hold all the data it needs to access
+   3. Conflict miss, defined as the absence caused by the above two reasons. The deletion occurs because multiple cache blocks replace each other
+
+[mark mill]: https://pages.cs.wisc.edu/~markhill/
+[mark mill phd thesis]: https://www2.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-381.pdf
+
+<!-- 有了3C模型, 我们就可以为每种类型的缺失提出针对性的方案, 来降低相应的缺失率了. -->
+
+With the 3C model, we can propose specific solutions for each type of loss to reduce the corresponding loss rate.
+
+<!-- #### 降低Compulsory miss -->
+#### Reduce Compulsory miss
+
+<!-- 为了降低Compulsory miss, 原则上需要在访问一个数据块之前就将其读入cache中,
+但上述cache工作流程并不支持这一功能, 因此需要添加新的机制才能实现, 这种机制称为预取(prefetch).
+不过从理论上来说, Compulsory miss只指代首次访问一个数据块时发生的缺失,
+在访存次数较多的场合, Compulsory miss的占比并不高,
+因此在这里我们暂不深入讨论如何减少Compulsory miss, 感兴趣的同学可以搜索并阅读预取的相关资料. -->
+
+In order to reduce Compulsory miss, in principle, it is necessary to read a data block into the cache before accessing it. However, the cache workflow above does not support this feature, so a new mechanism, called prefetch, is needed to implement it. However, theoretically speaking, Compulsory miss only refers to the miss that occurs when a data block is accessed for the first time. The proportion of Compulsory miss is not high in the occasions with more visits. Therefore, here we will not discuss how to reduce Compulsory miss, interested students can search and read the relevant materials of prefetch.
+
+<!-- #### 降低Capacity miss -->
+#### Reduce Capacity miss
+
+<!-- 为了降低Capacity miss, 根据其定义, 只能扩大cache容量, 从而更好地利用时间局部性.
+但cache容量并非越大越好, 一方面, cache容量越大, 意味着流片面积越大, 从而增加了流片成本;
+另一方面, 存储阵列越大, 意味着其访问延迟越高, 从而增加了cache的访问时间, 降低cache的性能表现.
+因此, 在实际的项目中, 一味增大cache容量并不是一个合理的方案, 需要综合各方因素后做出权衡. -->
+
+To reduce Capacity miss, by definition, you can only expand the cache capacity to make better use of time locality. However, the larger cache capacity is not the better. On the one hand, the larger cache capacity means the larger the flow area, which increases the flow cost. On the other hand, the larger the storage array, the higher the access latency, which increases the cache access time and degrades the cache performance. Therefore, in actual projects, blindly increasing cache capacity is not a reasonable solution, and it needs to be balanced after considering various factors.
+
+<!-- #### 降低Conflict miss -->
+#### Reduce Conflict miss
+
+<!-- 为了降低Conflict miss, 需要考虑如何减少多个cache块之间相互替换的情况.
+上文介绍了cache中直接映射的组织方式, 每个数据块只能读入到固定index的cache块中,
+如果多个数据块的index相同, 那么后读入的数据块将会替换先读入的数据块.
+因此, 一个减少替换情况的想法就是采用新的cache块组织方式, 允许数据块读入到多个cache块中. -->
+
+In order to reduce Conflict miss, it is necessary to consider how to reduce the substitution of multiple cache blocks. The preceding section describes the organization of direct mapping in cache. Each data block can be read only into a cache block with a fixed index. If multiple data blocks have the same index, the later data block will replace the earlier data block. Therefore, one idea to reduce substitution is to adopt a new cache block organization that allows data blocks to be read into multiple cache blocks.
+
+<!-- 一种极端的情况是, 每个数据块都可以存放到任意cache块中, 这种组织方式称为全相联(fully-associative).
+而具体存放到哪一个cache块中, 首先当然是存放到无效的cache块中;
+若所有的cache块均有效, 则由替换算法来决定.
+不同的替换算法会影响cache访问的命中率,
+一般来说, 替换算法需要选择一个将来最不可能被访问的cache块.
+对于给定的访存序列, 我们可以设计出一套最优的替换算法, 使得Conflict miss最少;
+但在实际情况中, 我们无法提前得知将来的访存序列,
+因此替换算法的设计变成一个"根据过去预测未来"的问题,
+即需要根据过去每个cache块的访问情况, 预测出一个将来最不可能被访问的cache块.
+常见的替换算法有如下几种:
+* FIFO, 先进先出, 替换最旧读入的cache块
+* LRU, 最近最少用, 替换在最近一段时间内访问次数最少的cache块
+* random, 随机替换 -->
+
+At one extreme, each data block can be stored in any cache block, which is called fully associative. The specific cache block to be stored in, of course, the first is stored in the invalid cache block. If all cache blocks are valid, it is determined by the replacement algorithm. Different replacement algorithms affect the cache access hit ratio.
+In general, the replacement algorithm needs to select a cache block that is least likely to be accessed in the future. For a given memory access sequence, we can design an optimal replacement algorithm to minimize Conflict miss. But in practice, we can't know the future memory sequence in advance. So the design of the replacement algorithm becomes a problem of predicting the future based on the past, That is, you need to predict a cache block that is least likely to be accessed in the future based on the access status of each cache block in the past. Common replacement algorithms are as follows:
+
+* FIFO, first-in-first-out, replaces the oldest read cache block
+* LRU, least recently used, replaces the cache block that has the fewest accesses in the latest period of time
+* random, random substitution
+
+<!-- 配合合适的替换算法, 全相联组织方式能以更大的概率替换掉一个将来一段时间内最不可能被访问的cache块,
+从而能够在最大程度上降低Conflict miss.
+但由于全相联组织方式可以将数据块存放到任意cache块中, 这需要付出两点代价.
+首先, 访存地址无需划分出index部分, 故除了offset部分外, 剩余的均为tag部分.
+因此, 我们需要在存储阵列中花费更多的存储开销来存储cache块的tag部分. -->
+
+With the appropriate replacement algorithm, the fully associative organization can replace a cache block that is least likely to be accessed in the future with a greater probability.Thus, Conflict miss can be reduced to the greatest extent. However, because the fully associative organization can store data blocks in any cache block, there are two costs.
+First of all, the access address does not need to be divided into the index part, so except the offset part, the rest is the tag part. Therefore, we need to spend more storage overhead in the storage array to store the tag portion of the cache block.
+
+```
+ 31                m m-1    0
++-------------------+--------+
+|        tag        | offset |
++-------------------+--------+
+```
+
+<!-- 其次, 判断命中时, 需要与所有cache块检查其tag是否匹配, 这需要使用很多比较器, 从而增加面积开销.
+由于这些代价, 全相联组织方式一般只在cache块数量较少的场景下使用. -->
+
+Secondly, when judging a hit, it is necessary to check whether the tag of all cache blocks matches, which requires the use of many comparators, thus increasing the area overhead. Due to these costs, the fully associative organization is generally only used when the number of cache blocks is small.
+
+<!-- 组相联(set-associative)是直接映射和全相联的折中, 其思想是给所有cache块分组,
+在组间通过直接映射方式选出一个组, 然后在组内通过全相联方式选出一个cache块,
+也即, 每个数据块都可以存放到组号为`tag % 组数`中的任意一个cache块.
+如果每个组中有`w`个cache块, 则称为`w路组相联`(`w`-way set-associative). -->
+
+set-associative is a compromise between direct mapping and full associative mapping. The idea is to group all cache blocks. Select a cache block by direct mapping between groups, and select a cache block by full association within the group. That is, each data block can be stored in any cache block with the group number tag % Number of groups. If there are 'w' cache blocks in each group, it is called 'w' -way set-associative.
+
+<!-- 在组相联组织方式中, 一个访存地址可以划分成以下3部分: tag, index, offset,
+其中index部分作为组索引, 故其位宽为`n = log2(w)`. -->
+
+In the group association organization, a memory address can be divided into the following three parts: tag, index, offset. The index part is the group index, so its bit width is' n = log2(w) '.
+
+```
+ 31    m+n m+n-1   m m-1    0
++---------+---------+--------+
+|   tag   |  index  | offset |
++---------+---------+--------+
+```
+
+<!-- 判断命中时, 只需要与组内所有cache块检查其tag是否匹配即可, 在`w`不大时, 比较器的面积开销是可以接受的. -->
+
+When detecting a hit, you only need to check whether the tags of all cache blocks in the group match. When w is not large, the area cost of the comparator is acceptable.
+
+<!-- 实际上, 全相联和直接映射都可以视为组相联的特例: `w=1`时即为直接映射, `w=cache块总数`时即为全相联.
+现代CPU通常采用8或16路组相联. -->
+
+In fact, both full association and direct mapping can be regarded as special cases of group association: 'w=1' is a direct mapping, 'w= total number of cache blocks' is a full association. Modern cpus usually use 8 - or 16-way connections.
+
+<!-- #### 块大小的选择 -->
+#### Block size selection
+
+<!-- 块大小是一个特殊的参数.
+如果cache块较大, 一方面能够降低tag的存储开销,
+另一方面则能够在cache块中存放更多的相邻数据, 从而能更好地捕捉到程序的空间局部性, 降低Conflict miss.
+但为了读出更多的相邻数据, cache的缺失代价也会随之上升;
+同时, 对于一定的cache容量, 更大的cache块也意味着cache块的数量更少,
+如果程序的空间局部性不明显, 则意味着需要更多更小的cache块,
+此时较大的cache块反而会增加Conflict miss. -->
+
+Block size is a special parameter. If the cache block is large, on the one hand, the tag storage overhead can be reduced. On the other hand, it can store more adjacent data in the cache block, which can better capture the spatial locality of the program and reduce Conflict miss. However, in order to read more adjacent data, the cost of cache loss will increase; At the same time, for a given cache capacity, a larger cache block also means a smaller number of cache blocks. If the spatial locality of the program is not obvious, it means that more and smaller cache blocks are needed. In this case, a larger cache block will increase Conflict miss.
+
+```txt
+// Programs with poor spatial locality
+// Two cache blocks of size 4
+               1111       2222        cache
+|--------------oooo-------oooo-----|  memory, 'o' is the hot data that the program accesses
+
+// One cache block of size 8
+               11111111               cache
+|--------------oooo-------oooo-----|  memory
+
+
+
+// A program with good spatial locality
+// Two cache blocks of size 4
+               11112222               cache
+|--------------oooooooo------------|  memory
+
+// One cache block of size 8
+               11111111               cache
+|--------------oooooooo------------|  memory
+```
+
+<!-- ### 设计空间探索 -->
+### Design space exploration
+
+<!-- 上文提到了非常多和cache相关的参数,
+如何选择一组合适的参数, 使得在给定的资源下达到较好的表现,
+则属于cache的设计空间探索(Design Space Exploration, DSE)问题.
+目前, 我们关心的表现包括IPC, 主频, 面积,
+其中, 主频和面积可以通过`yosys-sta`项目快速评估,
+而对于IPC, 通常需要在校准过访存延迟的ysyxSoC环境中运行完整的程序后才能获得. -->
+
+There are a lot of cache-related parameters mentioned above.How to select a set of appropriate parameters to achieve better performance under a given resource. It belongs to the Design Space Exploration (DSE) of cache. Currently, we care about performance including IPC, frequency, area. Among them, the main frequency and area can be quickly evaluated through the 'yosys-sta' project. IPC, on the other hand, is usually obtained after running the full program in a ysyxSoC environment with a calibrated access memory delay.
+
+<!-- 但不同参数之间的组合情况太多了, 如果针对每一种参数组合,
+都要花费小时级别的时间才能得到其IPC, 设计空间探索的效率将会十分低下.
+上文提到, 数据精度和仿真效率是权衡关系,
+因此为了提升设计空间探索的效率, 一种方式是牺牲IPC的统计精度,
+通过较低的开销来统计一个能体现IPC变化趋势的指标. -->
+
+But there are so many combinations of different parameters that if you look at each of them. It takes hours of time to get its IPC, and the efficiency of designing space exploration will be very low. As mentioned above, data accuracy and simulation efficiency are tradeoffs. So one way to improve the efficiency of design space exploration is to sacrifice the statistical precision of IPC. A lower overhead is used to calculate an indicator that can reflect the changing trend of IPC.
+
+<!-- 当我们调整cache的各种参数时, 直接影响的是AMAT,
+因此我们可以认为CPU中其他部分的执行开销保持不变.
+而根据AMAT的定义, 调整上述参数并不影响cache的访问时间, 故可以将其看作一个常数.
+因此我们真正需要关注的, 只有程序在发生cache缺失时需要等待的总时间,
+我们将其称为总缺失时间(Total Miss Time, TMT).
+事实上, TMT可以代表IPC的变化趋势: TMT越小, 每条指令访存所需的周期数也越小, 从而IPC越大. -->
+
+When we adjust the various parameters of the cache, it directly affects AMAT,
+Therefore, we can assume that the execution overhead of the rest of the CPU remains the same. According to the definition of AMAT, adjusting the above parameters does not affect the cache access time, so it can be regarded as a constant.
+So what we really care about is the total time that the program has to wait for a cache loss. We call this Total Miss Time (TMT). In fact, TMT can represent the changing trend of IPC: the smaller the TMT, the smaller the number of cycles required for each instruction to access memory, and thus the larger the IPC.
+
+<!-- 你之前通过性能计数器统计过AMAT时, 应该也统计过TMT了, 但这需要在ysyxSoC中运行完成的程序.
+为了低开销地统计TMT, 我们从`缺失次数 * 缺失代价`这个角度,
+考虑如何低开销地统计缺失次数和缺失代价. -->
+
+You should have counted TMT before when you counted AMAT through the performance counter, but you need to run the finished program in ysyxSoC. In order to calculate TMT at low cost, we start from the Angle of 'missing times * missing costs'. Consider how to count the number of misses and the cost of misses cheaply.
+
+<!-- 对于缺失次数的统计, 我们有如下观察:
+* 对于给定的程序, cache的访问次数是固定的.
+  只要获得程序运行的itrace, 将其输入到icache中, 就可以模拟icache工作的过程,
+  从而统计icache的缺失次数, 因此不必仿真整个ysyxSoC, 甚至连NPC都不需要.
+* NPC执行程序时, 需要通过当前指令的执行结果来得知下一条需要执行的指令.
+  但itrace已经包含了完整的指令流, 因此在统计TMT时, 我们只需要指令流的PC值, 而不需要指令本身.
+* icache的数据部分是用来作为指令回复给NPC的IFU的,
+  但由于统计TMT并不需要NPC, 因此icache的数据部分也不需要, 只需要保留元数据部分即可.
+  事实上, 对于给定的访存地址序列, cache的缺失次数与访存内容无关,
+  通过维护元数据即可统计出正确的缺失次数. -->
+
+For the statistics of missing times, we have the following observations:
+
+* The number of cache accesses for a given program is fixed. As long as you get the itrace of the program running and input it into the icache, you can simulate the process of icache working. This allows you to count the number of icache misses, so you don't have to emulate the entire ysyxSoC, not even NPC.
+* When an NPC executes a program, it needs to know the next instruction to be executed by the execution result of the current instruction. But itrace already contains the complete instruction stream, so when counting TMT, we only need the PC value of the instruction stream, not the instruction itself.
+* The data part of the icache is used as a command to reply to the IFU of the NPC.
+However, since NPC are not required for TMT statistics, the data part of icache is not required, and only the metadata part is required. In fact, for a given access address sequence, the number of cache misses is independent of the access content. The correct number of misses can be counted by maintaining metadata.
+
+<!-- 因此, 为了统计icache的缺失次数, 我们完全没有必要每次都把程序完整运行一遍.
+我们真正需要的是一个简单的cache功能模拟器, 我们称它为cachesim.
+cachesim接收指令流的PC序列(简化版的itrace), 通过维护元数据来统计这一PC序列的缺失次数.
+至于指令流的PC序列, 我们可以通过NEMU来快速生成. -->
+
+Therefore, in order to count the number of missing icache, we do not need to run the program completely every time. What we really need is a simple cache function emulator, which we'll call cachesim. Cachesim receives a sequence of PC instruction streams (a simplified version of itrace) and maintains metadata to count the number of missing PC sequences. As for the PC sequence of instruction flow, we can quickly generate it through NEMU.
+
+<!-- 而对于缺失代价, 由于cachesim不包含ysyxSoC中的访存细节, 因此原则上我们也无法准确地获取这个值.
+但根据上文的讨论, 只有块大小这个参数会影响缺失代价,
+因此我们可以在ysyxSoC中统计出一个平均缺失代价, 并将其作为一个常数来估算TMT. -->
+
+As for missing costs, cachesim does not contain the memory access details in ysyxSoC, so in principle we cannot get this value accurately. But as discussed above, only the block size parameter affects the missing cost. So we can calculate an average missing cost in ysyxSoC and use it as a constant to estimate TMT.
+
+<!-- > #### todo::实现cachesim
+>
+> 根据上文的介绍, 实现一个简单的cache模拟器.
+>
+> 借助cachesim, 我们可以来对icache进行性能测试的DiffTest.
+> 具体地, 我们可以把cachesim作为性能测试的REF,
+> 在NPC中运行一个程序得到的性能计数器结果,
+> 应该与cachesim执行相应PC序列统计得到的缺失次数等数据完全一致.
+> 如果不一致, 则可能是RTL实现有性能bug,
+> 这种性能bug是无法通过和NEMU进行功能测试的DiffTest或形式化验证发现的.
+> 例如, 即使icache一直缺失, 程序仍然可以在NPC上正确运行.
+> 当然, 也有可能是作为REF的cachesim有bug.
+> 但不管怎样, 能有REF作为对比, 总归是不会吃亏的.
+>
+> 不过为了得到一致的itrace, 你可能需要对NEMU做一些改动,
+> 使其能运行`riscv32e-ysyxsoc`的镜像文件. -->
+
+> #### todo:: Implements the cachesim
+>
+> Implement a simple cache emulator as described above.
+>
+> With cachesim, we can perform a DiffTest performance test on the icache.
+> Specifically, we can use cachesim as a REF for performance tests.
+> Performance counter results from running a program in an NPC,
+> It should be exactly the same as the number of misses obtained by cachesim from the corresponding PC sequence statistics.
+> If not, there may be a performance bug in the RTL implementation.
+> This performance bug cannot be found through DiffTest or formal validation of functional testing against NEMU. For example, even if the icache is missing all the time, the program will still run correctly on the NPC.
+> Of course, it is also possible that the cachesim as REF has a bug. But in any case, can have REF as a comparison, it will not lose.
+>
+> But in order to get consistent itrace, you may need to make some changes to NEMU,
+> Enable it to run 'riscv32e-ysyxsoc' image file.
+
+<!-- > #### option::压缩trace
+>
+> 如果你得到的itrace非常大, 可以考虑从以下方面进行压缩:
+> * 采用二进制方式存储itrace, 而不是文本方式
+> * 大部分时候指令是顺序执行的, 对于一段连续的PC序列,
+>   我们可以只记录第一个PC和连续执行的指令数量
+> * 通过`bzip2`相关工具对生成的itrace文件进一步压缩,
+>   然后在cachesim的代码中通过`popen("bzcat 压缩后的文件路径", "r")`得到一个可读的文件指针.
+>   关于`popen()`的使用方式, 请RTFM -->
+
+> #### option:: compresses the trace
+>
+> If you get a very large itrace, consider compressing it in the following ways:
+>
+> * Store itrace in binary format instead of text
+> * Most of the time instructions are executed sequentially, for a continuous sequence of PCS,
+> We can only record the first PC and the number of instructions executed consecutively
+> * Further compress the generated itrace file with the 'bzip2' related tool,
+> Then cachesim code by 'popen("bzcat compressed file path ", "r")' to get a readable file pointer.
+> For use of 'popen()', please refer to RTFM
+
+<!-- > #### todo::使用cachesim进行设计空间探索
+>
+> 有了cachesim, 我们就可以快速评估不同cache参数组合的预期收益了.
+> 对于一个参数组合, cachesim的评估效率比ysyxSoC能快几千倍甚至上万倍.
+>
+> 除此之外, 我们还能利用多核同时评估多个参数组合:
+> 具体地, 首先让cachesim通过命令行方式传入各种cache参数,
+> 然后通过脚本启动多个cachesim, 分别传入不同的参数组合即可.
+> 通过这种方式, 我们可以在数分钟内获得几十组参数组合的评估结果,
+> 从而帮助我们快速选择出合适的参数组合.
+>
+> 尝试打通这一快速评估的流程, 并评估若干组参数组合.
+> 不过, 我们还没有对缺失代价进行优化, 从而评估出更合理的TMT;
+> 另外, 我们也还没有给出面积大小的限制.
+> 这些因素都会影响最终的决策, 因此目前你不必做出最终的设计选择. -->
+
+> #### todo:: Design space exploration with cachesim
+>
+> With cachesim, we can quickly evaluate the expected benefits of different cache parameter combinations.
+> Cachesim can be evaluated thousands or even tens of thousands of times faster than ysyxSoC for a single parameter combination.
+>
+> In addition, we can evaluate multiple parameter combinations simultaneously with multi-core:
+> Specifically, first let cachesim pass in various cache arguments from the command line.
+> Then use the script to start multiple cachesim and pass different parameter combinations.
+> In this way, we can obtain evaluation results of dozens of parameter combinations in a matter of minutes.
+> To help us quickly choose the right combination of parameters.
+>
+> Try to get through this quick evaluation process and evaluate several parameter combinations.
+> However, we have not optimized the missing costs to evaluate a more reasonable TMT;
+> In addition, we have not given a limit on the area size. All of these factors affect the final decision, so you don't have to make a final design choice for now.
+
+<!-- ### 优化缺失代价 -->
+### Optimization missing cost
+
+<!-- 如果cache发生缺失, 将要到下一层存储层次中访问数据, 因此缺失代价就是下一层存储层次的访问时间.
+降低缺失代价的技术有很多, 目前我们先讨论其中一种: 总线传输的突发访问.
+如果cache块的大小和总线的数据位宽相同, 那么只需要一次总线传输事务就可以完成数据块的访问, 此时优化的空间并不大.
+但对于更大的cache块, 原则上需要多次总线传输事务才能完成数据块的访问, 其中就存在优化的机会. -->
+
+If the cache is missing, the data will be accessed to the next storage layer, so the cost of the missing is the access time of the next storage layer. There are many techniques to reduce the missing cost, and we will discuss one of them first: burst access over bus transmission. If the size of the cache block is the same as the data bit width of the bus, then only one bus transfer transaction is needed to complete the access to the data block, and the optimization space is not large. However, for larger cache blocks, in principle, multiple bus transfers are required to complete access to the data block, where there is an opportunity for optimization.
+
+<!-- #### 块大小和缺失代价 -->
+#### block size and missing cost
+
+<!-- 对当前的cache设计来说, 下一层存储层次就是SDRAM.
+为了进一步分析, 我们先给SDRAM的访问时间建立如下的简单模型,
+把SDRAM的访问时间分成4段, 因此对于一个独立的总线传输事务来说, 开销是`a+b+c+d`.
+假设总线的数据位宽是4字节, cache的块大小是16字节.
+若采用4个独立的总线传输事务, 则所需开销是`4(a+b+c+d)`. -->
+
+With current cache designs, the next layer of storage is SDRAM. For further analysis, we first establish the following simple model for SDRAM access time. The SDRAM access time is divided into four segments, so the overhead for an independent bus-transmitted transaction is' a+b+c+d '. Assume that the data bit width of the bus is 4 bytes and the block size of the cache is 16 bytes. If four separate buses are used to transmit the transaction, the cost is' 4(a+b+c+d) '.
+
+```txt
++------------------------ arvalid set valid
+|   +-------------------- The AR channel shakes hands and receives read requests
+|   |       +------------ The state machine transitions to the READ state and sends the READ command to the SDRAM particle
+|   |       |     +------ SDRAM particles return read data
+|   |       |     |   +-- R-channel handshake, return read data
+V a V   b   V  c  V d V
+|---|-------|-----|---|
+```
+
+<!-- > #### todo::支持更大的块大小
+>
+> 修改cachesim中的块大小, 使其为总线数据位宽的4倍, 并通过独立的总线传输事务方式来估算缺失代价.
+>
+> 实现后, 与之前的评估结果进行对比, 并尝试分析原因. -->
+
+> #### todo:: Supports larger block sizes
+>
+> Change the block size in cachesim to four times the bus data bit width, and estimate the missing cost by using a separate bus transfer transaction.
+>
+> After implementation, compare the results with the previous evaluation and try to analyze the reasons.
+
+<!-- > #### todo::支持更大的块大小(2)
+>
+> 修改icache的实现, 使其支持更大的块大小.
+> 实现时, 建议将块大小的参数实现成可配置的, 以便于后续的评估工作. -->
+
+> #### todo:: Supports larger block sizes (2)
+>
+> Modify the implementation of icache to support larger block sizes.
+> When implementing, it is recommended that the block size parameters be configured to facilitate subsequent evaluation.
+
+<!-- #### 总线的突发传输 -->
+#### AXI4 burst transmission
+
+<!-- 与SDRAM颗粒类似, AXI总线也支持"突发传输"(burst transfer),
+即在一次总线传输事务中包含多次连续的数据传输, 一次数据传输称为一个"节拍"(beat).
+在AXI总线协议中, 需要通过`AR`通道的`arbusrt`信号指示此次传输是否采用突发传输,
+若是, 则还需要通过`arlen`信号指示此次突发传输的节拍数量. -->
+
+Similar to SDRAM particles, the AXI bus also supports "burst transfer". That is, a bus transmission transaction contains several consecutive data transfers, and one data transfer is called a "beat". In the AXI bus protocol, the arbusrt signal through the AR channel is required to indicate whether the transmission adopts burst transmission. If so, it is also necessary to indicate the number of beats transmitted by the 'arlen' signal.
+
+<!-- 当然, 光有总线协议的支持还不够, 还需要AXI的master端发起突发传输事务,
+以及slave端支持突发传输事务的处理.
+icache作为master端, 我们将发起突发传输事务作为实验内容留给大家.
+另一方面, ysyxSoC提供的SDRAM控制器确实支持突发传输事务的处理,
+可以将上述4次总线传输包含在一次总线事务中, 从而有效降低完整读出一个数据块的开销:
+首先, 通过突发传输, `AR`通道只需要进行一次握手, 与上文的方案相比, 开销可节省`3a`;
+其次, SDRAM控制器可以将一次突发传输事务拆分成多个发往SDRAM颗粒的READ命令,
+当SDRAM颗粒返回一次读数据时, 如果还有地址连续的READ命令,
+SDRAM控制器的状态机则直接转移到READ状态继续发送READ命令,
+与上文的方案相比, 开销可节省`3b`;
+最后, SDRAM控制器对`R`通道的回复可与下一次READ命令的发送同时进行,
+从而将`R`通道握手的开销隐藏起来, 与上文的方案相比, 开销可节省`3d`. -->
+
+Of course, the support of the bus protocol is not enough, but also requires the master side of AXI to initiate burst transmission transactions. And the slave side supports the processing of burst transmission transactions. Icache as the master side, we will initiate a burst transmission transaction as the experimental content left to you. On the other hand, the SDRAM controller provided by ysyxSoC does support the processing of burst transmission transactions. The above four bus transfers can be included in a single bus transaction, effectively reducing the overhead of a complete read-out of a block of data:
+First of all, through the burst transmission, 'AR' channel only needs to do one handshake, compared with the above scheme, the cost can save '3a';
+Second, the SDRAM controller can split a burst transmission transaction into multiple READ commands sent to the SDRAM particle. When the SDRAM particle returns a READ data, if there is still an address sequential read command. The state machine of the SDRAM controller is transferred directly to the READ state and continues to send the READ command. Compared with the above scheme, the cost can be saved by '3b';
+Finally, the SDRAM controller's reply to the 'R' channel can occur simultaneously with the next READ command transmission. Thus, the overhead of the 'R' channel handshake is hidden, and the overhead can be saved by '3d' compared with the above scheme.
+
+```txt
+  a     b      c    d
+|---|-------|-----|---| <-------------------- The first beat
+                  |-----|---| <-------------- The second beat
+                        |-----|---| <-------- The third beat
+                              |-----|---| <-- The forth beat
+```
+
+<!-- 综上, 采用突发传输所需开销为`a+b+4c+d`, 和上文的方案相比, 可以节省`3(a+b+d)`的开销.
+推广到一般情况, 若cache块的大小是总线数据位宽的`n`倍, 则采用突发传输可以节省`n(a+b+d)`的开销.
+虽然表面上看`a`, `b`, `d`都不大, 但别忘了这是SDRAM控制器视角的开销,
+校准访存延迟后, 对CPU来说有可能节省数十甚至上百周期的开销. -->
+
+Above all, the cost of using burst transmission is' a+b+4c+d ', which can save '3(a+b+d)' cost compared with the above scheme. Generalized to the general case, if the cache block size is n times the bus data bit width, then the use of burst transmission can save n(a+b+d) overhead. Although on the surface 'a', 'b', 'd' are not large, but don't forget that this is the overhead of the SDRAM controller perspective. After calibrating the memory access delay, it is possible to save tens or even hundreds of cycles for the CPU.
+
+<!-- #### 突发传输的实现 -->
+#### Implementation of burst transmission
+
+<!-- 可以看到, 要获得突发传输的收益, 首先需要让cache使用更大的块.
+但根据上文的讨论, 更大的cache块也可能会增加Conflict miss,
+从而带来负收益, 具体取决于程序的空间局部性.
+至于何者更优, 就要通过benchmark进行评估了.
+要在cachesim中评估TMT, 我们还需要获得采用突发传输时的缺失代价.
+为此, 我们需要先在ysyxSoC环境中实现AXI的突发传输.
+但这一改动涉及较多细节, 因此我们分多个步骤来进行. -->
+
+As you can see, to reap the benefits of burst transmission, you first need to make the cache use larger blocks. But as discussed above, larger cache blocks can also increase Conflict miss. This results in negative returns, depending on the spatial locality of the program. As for which is better, it is up to benchmark to evaluate.
+To evaluate TMT in cachesim, we also need to capture the missing cost of using burst transport. To do this, we need to implement AXI burst transmission in ysyxSoC environment first. But this change involves a lot of detail, so we went through it in several steps.
+
+<!-- 之前我们为了方便测试, 先接入了APB接口的SDRAM控制器.
+但APB协议不支持突发传输, 要采用突发传输, 首先我们需要将SDRAM控制器更换成AXI接口的版本. -->
+
+In order to facilitate the test, we first connected the SDRAM controller with the APB interface. But the APB protocol does not support burst transmission, to use burst transmission, first we need to replace the SDRAM controller with the AXI interface version.
+
+<!-- 此外, 由于SDRAM控制器采用数据位宽为32位的AXI接口,
+而ysyxSoC采用数据位宽为64位的AXI接口, 因此需要进行数据位宽的转换.
+ysyxSoC中已经包含一个AXI数据位宽转换模块的框架, 并接入到AXI SDRAM控制器的上游,
+但该框架并未提供数据位宽转换的具体实现.
+你需要在ysyxSoC中实现AXI数据位宽转换模块, 使得AXI事务可以正确地传输数据. -->
+
+In addition, because the SDRAM controller uses an AXI interface with a 32-bit data bit width. ysyxSoC uses a 64-bit AXI interface, so it needs to convert the data bit width.
+ysyxSoC already includes a framework for an AXI data bit-wide conversion module that is plugged into the upstream of the AXI SDRAM controller. However, the framework does not provide a specific implementation of data bit-width conversion. You need to implement the AXI bit-width conversion module in ysyxSoC so that AXI transactions can transfer data correctly.
+
+<!-- 由于NPC的字长是32位, 即使ysyxSoC采用数据位宽为64位的AXI接口,
+原则上也不会出现完整的64位数据传输.
+因此我们可以对这个AXI数据位宽转换模块进行简化,
+一方面只需要实现64位到32位的转换, 无需考虑其他数据位宽之间的转换;
+另一方面可以假设不会出现完整的64位数据传输,
+从而不必考虑如何将一次64位的数据传输拆分成2次32位的数据传输. -->
+
+Since the word length of the NPC is 32 bits, even though ysyxSoC uses the AXI interface with 64 bits of data width. In principle, there will be no full 64-bit data transfer.
+So we can simplify this AXI data bit width conversion module. On the one hand, it is only necessary to realize the conversion from 64 to 32 bits, without considering the conversion between other data bit widths; On the other hand, it can be assumed that no full 64-bit data transfer will occur. So you don't have to think about how to split a 64-bit data transfer into two 32-bit data transfers.
+
+<!-- > #### todo::集成AXI接口的SDRAM控制器
+>
+> 你需要完成以下内容:
+> 1. 实现AXI数据位宽转换模块. 具体地,
+>    如果你选择Verilog, 你需要在`ysyxSoC/perip/amba/axi_data_width_converter_64to32.v`中实现相应代码;
+>    如果你选择Chisel, 你需要在`ysyxSoC/src/amba/AXI4DataWidthConverter.scala`的
+>    `AXI4DataWidthConverter64to32Chisel`模块中实现相应代码,
+>    并将`ysyxSoC/src/device/SDRAM.scala`中的`Module(new AXI4DataWidthConverter64to32)`
+>    修改为实例化`AXI4DataWidthConverter64to32Chisel`模块.
+> 1. 在`ysyxSoC/src/SoC.scala`的`Config`对象中将`sdramUseAXI`变量修改为`true`,
+>    修改后重新生成`ysySoCFull.v`.
+>
+> 完成后, 尝试运行一些测试程序, 来检查你的实现是否正确. -->
+>
+> #### todo:: SDRAM controller with integrated AXI interface
+>
+> You need to complete the following:
+>
+> 1. Implement AXI data bit width conversion module. Specifically,
+> If you choose Verilog, you need to implement the corresponding code in 'ysyxSoC/perip/amba/axi_data_width_converter_64to32.v';
+> If you choose to Chisel, you need to ` ysyxSoC/SRC/amba/AXI4DataWidthConverter scala `
+> ` AXI4DataWidthConverter64to32Chisel ` module to realize the corresponding code,
+> and will `ysyxSoC/SRC/device/SDRAM.scala` in  ` Module (new AXI4DataWidthConverter64to32) `
+> Changed to instantiate ` AXI4DataWidthConverter64to32Chisel ` module.
+> 2. Change the variable 'sdramUseAXI' to 'true' in the 'Config' object of 'ysyxSoC/src/SoC.scala',
+> Re-generate 'ysysocfuller.v' after modification.
+>
+> When you're done, try running some tests to check that your implementation is correct.
+
+<!-- > #### todo::让icache支持突发传输
+>
+> 修改icache的实现, 使其采用突发传输方式访问SDRAM中的数据块.
+>
+> 实现正确后, 记录突发传输的波形, 并与未采用突发传输时的波形进行对比,
+> 你应该能观察到, 采用突发传输确实能提升效率. -->
+
+> #### todo:: Enables the icache to support burst transmission
+>
+> Modify the implementation of icache to use burst transfer to access data blocks in SDRAM.
+>
+> After the correct implementation, record the burst transmission waveform and compare it with the waveform without burst transmission.
+> You should be able to observe that the use of burst transmission does increase efficiency.
+
+<!-- #### 校准AXI接口SDRAM的的访存延迟 -->
+#### Calibrates the memory access delay of AXI interface SDRAM
+
+<!-- 虽然目前突发传输方式已经可以工作, 但并未对相应的访存延迟进行校准,
+得到的性能数据是不准确的.
+和之前实现的APB延迟模块类似, 我们需要一个AXI延迟模块来校准相应的访存延迟. -->
+
+Although the burst transmission mode has been working, the corresponding memory access delay has not been calibrated. The resulting performance data is inaccurate. Similar to the previously implemented APB delay module, we need an AXI delay module to calibrate the corresponding access delay.
+
+<!-- 由于AXI协议比APB协议复杂, AXI延迟模块的实现还需要额外考虑如下问题:
+* AXI的读写通道是独立的, 因此原则上需要为读事务和写事务分别设计用于控制延迟的计数器.
+  不过当前的NPC是多周期的, 不会同时发送读请求和写请求,
+  因此当前你也可以设计一套统一的计数器来进行控制.
+  但接下来实现流水线的时候, 你还是需要采用两套独立的计数器.
+* AXI有完整的握手信号, 等待握手也涉及到设备的状态,
+  因此这段时间也应该属于校准范围内, 也即, 应该将valid信号有效的时刻视为事务的开始.
+* AXI支持突发传输, 因此传输模式和APB不同.
+  * 以读事务为例, 突发传输可能包含多次数据的接收, 都需要分别分别校准.
+    假设一个AXI突发读事务从`t0`时刻开始, 设备端分别在`t1`, `t2`时刻返回数据,
+    AXI延迟模块在`t1'`, `t2'`时刻向上游返回数据,
+    则应有等式`(t1 - t0) * r = t1' - t0`和`(t2 - t0) * r = t2' - t0`.
+  * 写事务的突发传输包含多次数据的发送, 由于设备接收一次数据也需要至少花费一个周期,
+    这在CPU看来已经经过了`r`个周期, 因此也需要对数据的发送时刻进行校准.
+    不过目前我们还没有实现dcache, LSU不会发起突发写事务,
+    因此目前可暂不实现突发写事务的校准, 但对于单个写事务, 则仍需校准. -->
+
+Because AXI protocol is more complex than APB protocol, the implementation of AXI delay module needs to consider the following issues:
+
+* The read and write channels of AXI are independent, so in principle separate counters for read and write transactions need to be designed to control latency. However, current NPCS are multi-periodic and do not send read and write requests at the same time.
+So now you can also design a uniform set of counters to control. But then when you implement the pipeline, you still need to use two separate sets of counters.
+* AXI has a complete handshake signal, waiting for the handshake also involves the state of the device. Therefore, this period of time should also fall within the scope of calibration, that is, the moment when the valid signal is valid should be regarded as the beginning of the transaction.
+* AXI supports burst transmission, so the transmission mode is different from APB.
+* In the case of read transactions, a burst transmission may involve multiple data receives, all of which need to be calibrated separately. Suppose an AXI burst read transaction starts at 't0' and the device side returns data at 't1' and 't2' respectively.
+AXI delay module returns data upstream at 't1',' t2'. Have a ` equation (t1 - t0) * r = t1 '- t0 ` and ` (t2 - t0) * r = t2' - t0 `.
+* The burst transmission of write transactions involves multiple data transmissions, since it takes at least one cycle for the device to receive data once. This appears to the CPU to have passed 'r' cycles, so the data sending time also needs to be calibrated.
+However, we have not implemented dcache yet, and LSU does not initiate burst write transactions. Therefore, the calibration of burst write transactions can not be implemented for the time being, but for a single write transaction, calibration is still required.
+
+<!-- > #### todo::实现AXI延迟模块
+>
+> 按照上文的介绍, 在ysyxSoC中实现AXI延迟模块.
+> 具体地, 如果你选择Verilog, 你需要在`ysyxSoC/perip/amba/apb4_delayer.v`中实现相应代码;
+> 如果你选择Chisel, 你需要在`ysyxSoC/src/amba/AXI4Delayer.scala`的`AXI4DelayerChisel`模块中实现相应代码,
+> 并将`ysyxSoC/src/amba/AXI4Delayer.scala`中的`Module(new axi4_delayer)`修改为实例化`AXI4DelayerChisel`模块.
+>
+> 为了简化实现, 目前你可以认为突发传输的节拍数量不会大于8.
+> 实现后, 尝试取不同的`r`, 在波形中观察上述等式是否成立. -->
+ 
+> #### todo:: Implements the AXI delay module
+>
+> Implement the AXI delay module in ysyxSoC as described above.
+> Specifically, if you choose Verilog, you need to implement the corresponding code in 'ysyxSoC/perip/amba/apb4_delayer.v';
+> If you choose to Chisel, you need to ` ysyxSoC/SRC/amba/AXI4Delayer scala ` of ` AXI4DelayerChisel ` module to realize the corresponding code,
+> and will `ysyxSoC/SRC/amba/AXI4Delayer.scala` in ` Module (new axi4_delayer) ` modified to instantiate ` AXI4DelayerChisel ` Module.
+>
+> To simplify the implementation, for now you can assume that the number of beats transmitted in bursts will not be greater than 8.
+> After implementation, try to take a different 'r' and see if the above equation holds in the waveform.
+
+<!-- > #### todo::评估突发传输方式的性能表现
+>
+> 校准突发传输方式的访存延迟后, 运行microbench的train规模测试, 并与之前的记录结果进行对比. -->
+
+> #### todo:: Evaluates the performance of the burst transmission mode
+>
+> After calibrating the memory access delay of the burst transmission mode, run a microbench train-scale test and compare the results with the previous recorded results.
+
+<!-- #### 快速评估缺失代价 -->
+#### Quickly assess the cost of missing
+
+<!-- 根据上文的讨论, 当前icache的缺失代价只与块大小和总线的传输方式相关, 与cache的其他参数无关.
+因此, 我们可以提前评估各种块大小和总线传输方式组合下的缺失代价,
+将来直接代入这个缺失代价, 即可计算出TMT, 从而估算出不同cache参数组合的预期收益.
+但根据上文的分析, 采用突发传输方式的性能表现必定优于独立传输方式,
+所以实际上我们只需要提前评估各种块大小在采用突发传输方式时的缺失代价即可. -->
+
+According to the previous discussion, the current icache loss cost is only related to the block size and bus transmission mode, and has nothing to do with other cache parameters.
+Therefore, we can evaluate the missing costs of various combinations of block sizes and bus transport modes in advance. In the future, by directly substituting this missing cost, the TMT can be calculated to estimate the expected benefits of different combinations of cache parameters. However, according to the above analysis, the performance of the burst transmission mode is definitely better than that of the independent transmission mode.
+So in fact, we only need to evaluate the missing cost of various block sizes in advance when using burst transmission mode.
+
+<!-- 至于如何提前评估缺失代价, 又有两种方法:
+1. 建模. 根据SDRAM控制器状态机的工作流程, 拟合出SDRAM访问时间的计算公式.
+   代入块大小, 即可计算出相应的缺失代价.
+   这种方式比较直接, 但建模的精确度是一个挑战,
+   例如SDRAM的row buffer和刷新操作也会影响SDRAM的访问时间,
+   但我们很难刻画这些因素在一次SDRAM访问中带来的开销.
+2. 统计. 通过适当的性能计数器, 计算出icache缺失时访问SDRAM的TMT, 从而计算出平均缺失代价.
+   作为一种统计方法, 它可以通过采样后平均的方式,
+   将SDRAM的row buffer和刷新操作等难以建模的因素考虑进来.
+   但row buffer的本质也是一种cache, 其性能表现也会受到程序局部性的影响,
+   因此测试程序还需要具备代表性:
+   直接运行microbench的train规模测试的代表性是最好的,
+   虽然需要花费较多时间, 但只需要运行一次, 即可统计出平均缺失代价;
+   test规模测试与train规模测试的行为不完全相同, 但也有一定的代表性, 从而可以快速统计出平均缺失代价;
+   但hello程序的代表性就比较弱了, 用其统计出的平均缺失代价进行估算, 可能会带来较大的误差. -->
+
+As for how to assess the missing costs in advance, there are two ways:
+
+   1. Modeling. According to the work flow of SDRAM controller state machine, fit the calculation formula of SDRAM access time. The corresponding missing cost can be calculated by plugging in the block size. This approach is straightforward, but modeling accuracy is a challenge. For example, SDRAM row buffer and refresh operations can also affect SDRAM access time. But it is difficult to characterize the cost of these factors in an SDRAM visit.
+   2. Statistics. By using the appropriate performance counter, the TMT of accessing SDRAM when icache is missing is calculated, so as to calculate the average missing cost. As a statistical method, it can be used by averaging after sampling. Take into account hard-to-model factors such as SDRAM's row buffer and refresh operations. But a row buffer is essentially a cache, and its performance can be affected by program locality.
+Therefore, the test procedure also needs to be representative:
+Train-scale tests that run directly on microbench are the best represented. Although it takes more time, it only needs to run once to calculate the average missing cost. The behavior of test scale test and train scale test is not exactly the same, but there is a certain representativeness, so that the average missing cost can be calculated quickly.
+However, the representative of hello program is relatively weak, and using its statistical average missing cost to estimate may bring a large error.
+
+<!-- 在实际的项目中, 考虑到项目的复杂性, 一般很少采用建模方式,
+因而此处也推荐大家采用统计方式来提前评估缺失代价. -->
+
+In actual projects, considering the complexity of the project, modeling is rarely used.
+Therefore, it is also recommended to use statistical methods to assess the missing costs in advance.
+
+<!-- > #### todo::快速评估缺失代价
+>
+> 根据上述内容实现缺失代价的快速评估.
+> 你已经可以通过cachesim计算出缺失次数,
+> 后续你将使用这些缺失代价来估算不同cache参数组合的预期收益. -->
+
+> #### todo:: Quick assessment of missing costs
+>
+> Achieve a quick assessment of missing costs based on the above.
+> You can already count the number of misses using cachesim,
+> You will then use these missing costs to estimate the expected benefits of different combinations of cache parameters.
+
+<!-- ### 程序的内存布局 -->
+### Memory layout of the program
+
+<!-- 程序的内存布局有时换也会明显影响cache的性能表现.
+以icache为例, 当icache的块大小超过4字节时,
+有时候程序中的一些热点循环可能没有对齐到cache块的边界,
+使得热点循环中的指令多占用了一些cache块.
+例如, 某程序的热点循环位于地址`[0x1c, 0x34)`, 假设某icache的块大小是16字节,
+将这个热点循环的指令全部读入icache, 需要占用3个cache块. -->
+
+A change in the memory layout of a program can also significantly affect cache performance.
+For example, if the icache block size exceeds 4 bytes. Sometimes some hot loop in the program may not be aligned to the boundaries of the cache block. Commands in the hotspot loop occupy more cache blocks. For example, if the hot loop of a program is located at the address '[0x1c, 0x34)', suppose the block size of an icache is 16 bytes. To read all the hotspot loop instructions into the icache, three cache blocks are required.
+
+```txt
+         [0x1c, 0x34)           + 0x4   =   [0x20, 0x38)
++------+------+------+------+      +------+------+------+------+
+|      |      |      | 0x1c |      | 0x20 | 0x24 | 0x28 | 0x2c |
++------+------+------+------+      +------+------+------+------+
+| 0x20 | 0x24 | 0x28 | 0x2c |      | 0x30 | 0x34 |      |      |
++------+------+------+------+      +------+------+------+------+
+| 0x30 |      |      |      |
++------+------+------+------+
+```
+
+<!-- 但如果我们在程序代码前填充一些空白字节, 就可以改变热点循环的位置,
+从而使得热点循环的指令占用更少的cache块.
+例如, 在上述例子中, 我们只需要在程序代码前填充4字节的空白内容,
+就可以将热点循环的位置改变为`[0x20, 0x38)`.
+此时热点循环的指令只需要占用2个cache块, 节省出来的1个cache块可用于存放其他指令,
+从而提升程序的整体性能表现. -->
+
+But if we fill in some blank bytes in front of the code, we can change the location of the hot loop. Thus, the hotspot loop commands occupy fewer cache blocks. For example, in the above example, we only need to fill 4 bytes of blank content in front of the program code. You can change the location of the hot loop to '[0x20, 0x38]'. In this case, the hotspot loop commands only need to occupy two cache blocks, and the saved cache block can be used to store other commands. Thereby improving the overall performance of the program.
+
+<!-- > #### todo::优化程序的内存布局
+>
+> 尝试按照上文的方法, 在程序代码前填充若干空白字节.
+> 具体地, 你可以通过修改代码, 或者修改链接脚本来实现这个功能.
+> 实现后, 尝试评估填充的空白字节是否能优化程序的性能表现. -->
+
+> #### todo:: Optimizes the memory layout of the program
+>
+> Try filling a few blank bytes in front of the program code as described above.
+> Specifically, you can do this by modifying the code, or modifying the link script.
+> After implementation, try to evaluate whether the filled blank bytes can optimize the performance of the program.
+
+<!-- 事实上, 上述例子的cache容量很小, 因此节省出来的1个cache块在整个cache中占比较高.
+但现代处理器的cache的容量相对充足, 节省出来的1个cache块可能不会给程序的性能带来明显的提升.
+不过我们想说的是, 程序的优化也是利用局部性原理的一个重要方向,
+有的程序在优化后甚至能提升数倍的性能. -->
+
+In fact, the cache capacity of the above example is small, so the saved cache block accounts for a high proportion of the total cache. However, the cache capacity of modern processors is relatively sufficient, and the saving of one cache block may not significantly improve the performance of the program. But what we want to say is that program optimization is also an important direction to use the locality principle. Some programs can even improve performance several times after optimization.
+
+<!-- 在企业中, 对于目标场景中的关键应用程序, 工程师团队通常会采用各种方法来提升其性能表现;
+如果企业有能力设计自己的处理器, 除了提升处理器的性能之外,
+还会根据处理器的参数对编译器进行定制化.
+和采用公共版本的编译器(如开源社区的gcc)编译出来的可执行文件相比,
+采用定制化编译器编译出来的可执行文件能够在目标处理器上运行得更快.
+特别地, [SPEC CPU中定义了两种衡量标准`base`和`peak`][spec run rules],
+其中`peak`标准允许不同的子项采用不同的编译优化选项来编译,
+使得整个benchmark在目标平台上达到比`base`更优的性能.
+如果希望在`peak`标准中获得更高的分数, 软件层次的优化是不可或缺的. -->
+
+In the enterprise, for the critical application in the target scenario, the engineering team often adopts various methods to improve its performance; If enterprises have the ability to design their own processors, in addition to improving the performance of the processor. The compiler is also customized based on the parameters of the processor.
+Compared to executables compiled using a public version of a compiler such as gcc from the open source community. Executable files compiled with a custom compiler run faster on the target processor. In particular, there are two metrics defined in [SPEC CPU 'base' and 'peak'][spec run rules]. The 'peak' standard allows different subitems to be compiled with different compilation optimization options. The whole benchmark achieves better performance than 'base' on the target platform. If you want to achieve a higher score in the 'peak' criteria, optimization at the software level is indispensable.
+
+[spec run rules]: https://www.spec.org/cpu2006/Docs/runrules.html#rule_1.5
+
+<!-- ### 设计空间探索(2) -->
+### Design Space Exploration (2)
+
+<!-- 我们上面介绍了cache的很多参数, 甚至还包含程序的内存布局,
+这些都会影响程序在处理器上运行的性能.
+现在我们就可以来综合考虑这些参数, 选择一组性能表现较好的参数组合了.
+当然, 设计空间探索还需要满足面积大小的约束. -->
+
+We've covered a lot of cache parameters, even the memory layout of the program,
+These all affect the performance of the program running on the processor. Now we can consider these parameters comprehensively and select a set of parameters with better performance. Of course, design space exploration also needs to meet the constraints of area size.
+
+<!-- > #### danger::面积大小的限制
+>
+> 你的NPC需要在`yosys-sta`项目默认提供的nangate45工艺下,
+> 综合面积不超过25000$um^2$, 这也作为B阶段流片的面积限制.
+> 考虑到后续任务还需要实现流水线, 因此我们推荐此时NPC的综合面积不超过23000$um^2$.
+>
+> 这个面积算不上很大, 一方面,
+> 采用这一面积限制有助于凸显其他参数在设计空间探索中的贡献,
+> 否则你只需要一味提升cache容量, 就可以获得很好的性能表现,
+> 其他参数对性能提升的影响将难以得到体现;
+> 另一方面, 项目组预期较多同学会基于B阶段流片,
+> 更小的面积有助于项目组节省流片成本.
+>
+> 目前你不必严格满足上述面积限制, 如果超过的部分不到5%,
+> 你可以选择在完成流水线后再统一进行优化;
+> 但如果当前的综合面积已远超上述限制, 你可能需要对你的设计进行较大调整,
+> 我们建议你马上开展一些面积相关的优化工作.
+>
+> 关于流片成本, 我们可以做一些简单但不太严谨的估算.
+> 假设某晶圆厂提供nangate45工艺的流片服务, 每个block的尺寸是`2mm X 3mm`, 价格是50万RMB,
+> 则每$um^2$的价格为`500000 / (2 * 3 * 1000000) = 0.0834`元.
+> 通常来说, 连接标准单元的线网也需要占用一定的面积,
+> 同时为了让走线不至于过度拥塞, 标准单元之间也会多留一些空隙.
+> 因此, 最终的芯片面积通常要大于综合面积, 根据经验, 综合面积一般是最终芯片面积的70%.
+> 按上述估算方式, 一个综合面积为25000$um^2$的设计,
+> 最终需要花费的价格是`25000 / 0.7 * 0.0834 = 2978`元.
+>
+> 上述估算给大家带来的最大启发就是, 给CPU添加功能特性, 并不是免费的.
+> 这一点与面向FPGA的设计有很大区别: 在一些以FPGA作为目标平台的比赛中,
+> 参赛者通常会尽最大可能将FPGA上的资源转化成CPU的性能,
+> 这一过程中不会产生任何经济上的成本.
+>
+> 但真实的流片并不是这样.
+> 你可以将B阶段的目标看成是设计一款低成本的嵌入式处理器(rv32e就是面向嵌入式场景的基础指令集):
+> 假设你是某嵌入式CPU厂商的架构师, 你需要在面积预算有限的情况下想办法提升处理器的性能.
+> 如果面积超出预期, 芯片成本就会上升, 在市场上的竞争力也会有所下降.
+> 在这样的条件下, 你需要估算添加某个功能特性的性价比:
+> 假设某功能特性可以带来10%的性能提升, 它是否值得你为它多支付500元? -->
+
+> #### danger: Limit the size of the area
+>
+> Your NPC needs to be in the nangate45 process provided by the 'yosys-sta' project by default,
+> The comprehensive area does not exceed 25,000 $um^2$, which is also the area limit of the B stage flow sheet.
+> Considering that subsequent tasks still need to implement pipelining, we recommend that the comprehensive area of NPC at this time should not exceed 23,000 $um^2$.
+>
+> It's not a huge area. For one thing,
+Using this size limit helps to highlight the contribution of other parameters in designing space exploration,
+> Otherwise, you only need to increase the cache capacity to achieve good performance.
+> The impact of other parameters on performance improvement will be difficult to reflect;
+> On the other hand, the project team expects that more students will be based on the stage B streaming film,
+> Smaller area helps the project team to save the cost of the film.
+>
+> Currently you don't have to strictly meet the above area limits, if you exceed less than 5%,
+> You can choose to optimize after the completion of the pipeline;
+> But if the current combined area far exceeds the above limits, you may need to make major adjustments to your design,
+> We recommend that you immediately carry out some area related optimization work.
+>
+> About the cost of the film, we can do some simple but not too rigorous estimates.
+> Suppose a wafer fab provides the flow sheet service of the nangate45 process, the size of each block is' 2mm X 3mm ', and the price is 500,000 RMB.
+> Then the price of each $um^2$is' 500,000 / (2 * 3 * 1,000,000) = 0.0834 'yuan.
+> Generally speaking, the wire network connecting standard units also needs to occupy a certain area. At the same time, in order to avoid excessive congestion, some gaps will be left between the standard units. Therefore, the final chip area is usually larger than the overall area, which, as a rule of thumb, is generally 70% of the final chip area.
+> According to the above estimation, a design with a combined area of $25,000 um^2
+> The final price needed to spend is' 25,000/0.7 * 0.0834 = 2978 'yuan.
+> The biggest inspiration from the above estimates is that adding functional features to the CPU is not free. This is very different from designing for FPGas: in some competitions where FPgas are the target platform,
+> Contestants usually try their best to convert the resources on the FPGA into CPU performance,
+> There is no economic cost in this process.
+> But the real tape-out film is not like this. You can think of the goal of stage B as designing a low-cost embedded processor (rv32e is the basic instruction set for embedded scenarios):
+> Suppose you are an architect for an embedded CPU vendor and you need to find a way to improve the performance of your processor with a limited space budget. If the area exceeds expectations, the cost of the chip will rise, and the competitiveness in the market will be reduced.
+> Under such conditions, you need to estimate the cost performance of adding a feature:
+> Suppose a feature gives you a 10% performance boost, is it worth paying $500 more?
+
+<!-- > #### todo::icache的设计空间探索
+>
+> 按照上述介绍, 探索icache的设计空间, 并确定一个设计方案, 使其在给定约束下达到较好的性能表现.
+> 确定方案后, 通过RTL实现它, 并评估其在ysyxSoC中的性能表现. -->
+
+> #### todo:: Space exploration of icache design
+>
+> Following the above introduction, explore the design space of icache and determine a design scheme to achieve better performance under the given constraints.
+> After determining the solution, implement it through RTL and evaluate its performance on ysyxSoC.
+
+<!-- > #### hint::一些优化面积的思路
+>
+> 如果你的评估面积大幅超过上述要求, 你大概率需要对你的设计进行优化.
+> 面积的优化并没有什么诀窍, 但整体上来说可以从以下方向考虑:
+> 1. 优化逻辑开销: 考虑哪些逻辑功能是冗余的, 可以和其他现有逻辑合并
+> 1. 优化存储开销: 考虑哪些存储单元是冗余的
+> 1. 逻辑开销和存储开销之间的转换: 有时候与其存储一个信号,
+>    不如重新将它计算一遍, 不过这可能还会影响关键路径, 需要具体情况具体分析
+>
+> 对大家来说, 上述面积限制的要求大概率不是随手写写就能达到, 但也并不是不可能达到.
+> yzh的参考设计在添加icache后面积为22730$um^2$, 同时频率可以达到1081MHz,
+> 运行microbench显示的`Total time`为4.49s.
+> 我们设置上述面积限制, 一方面是锻炼大家如何优化自己的设计:
+> 如果你是初学者, 你总得需要一个契机来接触这方面的工作,
+> 在不断的试错当中建立每一行RTL代码和面积开销之间的关联认识.
+>
+> 另一方面也是为了让大家再次体会到体系结构设计的目标:
+> 性能和面积是相互制约的, 如果你觉得你的设计很难优化,
+> 最简单的方法就是减少cache的容量, 但你需要付出性能的代价;
+> 如果你想兼顾面积和性能, 你就需要尽可能节省不必要的面积开销,
+> 然后规划这些面积用来做什么更有利于提升性能. -->
+>
+> #### hint:: Some ideas to optimize the area
+>
+> If your estimated area significantly exceeds the above requirements, you will most likely need to optimize your design.
+> There is no trick to optimizing the area, but overall it can be considered from the following directions:
+>
+> 1. Optimize logic overhead: Consider which logic functions are redundant and can be merged with other existing logic
+> 2. Optimize storage overhead: Consider which storage units are redundant
+> 3. Conversion between logic overhead and storage overhead: sometimes it stores a signal,
+It is better to calculate it again, but this may also affect the critical path, which needs to be analyzed on a case-by-case basis
+>
+> For everyone, the requirements of the above area limits are not likely to be achieved with random writing, but they are not impossible to achieve.
+> yzh's reference design has an area of $22730 um^2 after icache is added, and the frequency can reach 1081MHz.
+> Running microbench shows a 'Total time' of 4.49s.
+> We set the above area limits, on the one hand, to exercise how to optimize their own design:
+> If you are a beginner, you always need a chance to get involved in this area of work,
+> Establish an understanding of the association between each line of RTL code and area overhead through trial and error.
+>
+> On the other hand, it is also to let everyone experience the goal of architecture design again:
+> Performance and area are mutually limited, and if you find your design difficult to optimize,
+The simplest way is to reduce the cache size, but you have to pay a performance price;
+> If you want both size and performance, you need to save as much space as possible.
+> Then plan what this area can be used for to improve performance.
+
+<!-- > #### todo::评估dcache的性价比
+>
+> 你已经在上文中估算了dcache在理想情况下的性能收益,
+> 在这里我们继续估算这个dcache的性价比.
+> 假设这个dcache的面积大小和icache一样, 其成本是多少元?
+>
+> 虽然你还没有设计dcache, 但因为dcache需要支持写操作,
+> 其设计必定比icache复杂, 故占用的面积也应该大于同等容量的icache.
+> 因此, 在这些条件估算出来的dcache性价比是高度乐观的,
+> 如果考虑dcache的真实性能收益和实际面积, 设计dcache的性价比只会更低.
+>
+> 另一个考虑的方向是, 如果把dcache的面积用于扩大icache的容量, 能带来多少性能提升? -->
+
+> #### todo:: Evaluate the price/performance ratio of dcache
+>
+> You have estimated the performance benefits of dcache under ideal conditions above.
+> Here we continue to estimate the price-performance ratio of this dcache.
+> Assuming the size of the dcache is the same as the icache, how much does it cost?
+>
+> Although you haven't designed dcache yet, because dcache needs to support write operations. Its design must be more complex than the icache, so the occupied area should also be greater than the icache of the same capacity.
+> Therefore, the estimated dcache price-performance ratio under these conditions is highly optimistic.
+> If you consider the real performance benefits and actual area of dcache, the design of dcache will only be less cost-effective.
+>
+> Another direction to consider is, if the dcache area is used to expand the icache capacity, how much performance improvement can be achieved?
+
+<!-- > #### caution::真正的体系结构设计
+>
+> 尽管上述的icache设计空间探索任务与真实处理器中的设计空间探索相比简化了很多,
+> 但对大部分同学来说也算是第一次接触到真正的处理器体系结构设计了.
+> 更多地, 这也很可能是大部分同学第一次接触到一个模块设计的全流程,
+> 从需求分析, 结构设计, 逻辑设计, 到功能验证, 性能验证, 性能优化,
+> 最后到电路层次的面积评估, 时序分析.
+> 其中, 逻辑设计就是大家常说的RTL编码.
+>
+> 这个任务再次让大家看到, 体系结构设计不等于RTL编码.
+> 体系结构设计的工作是, 在满足约束条件的设计空间中寻找一组表现较好的设计参数.
+> 但通常设计空间非常大, 要完整评估其中一组设计参数的性能表现, 也需要花费不少时间.
+> 因此对体系结构设计工作来说, 如何能快速评估不同设计参数的性能表现, 是一个至关重要的问题.
+>
+> 因此, 模拟器是体系结构设计的重要工具.
+> 通过模拟器, 我们不必仿真电路级别的行为(不跑verilator, 而是跑cachesim),
+> 同时只需要模拟一些必要的模块(不模拟cache的数据, 只需要模拟cache的元数据),
+> 此外也不必由处理器驱动(不跑完整的程序, 而是回放相应的itrace).
+> 正是这些不同, 使得模拟器的运行效率比RTL仿真有数量级的提升,
+> 从而可以快速评估不同设计参数的预期性能表现, 帮助我们快速排除明显不合适的设计参数.
+>
+> 根据香山团队的经验, 在verilator上跑一轮程序需要花费1周的时间,
+> 但在全系统模拟器[gem5][gem5]中跑相同的程序, 只需要2小时.
+> 这意味着, 在RTL仿真环境中评估一组设计参数的时间,
+> 可以用模拟器探索84组不同设计参数的效果.
+>
+> 对于体系结构研究来说, 模拟器也是很常见的平台.
+> 体系结构顶会ISCA已经多次举行基于[ChampSim][champsim]模拟器的大赛,
+> 包括[cache替换算法大赛][cache replacement]和[cache预取算法大赛][data prefetch]等.
+> 研究人员都是在模拟器中评估各种算法的表现, 从而快速调整算法的整体实现及其细致参数.
+> 尽管一个合格的算法还需要经过RTL层次的实现和验证,
+> 但如果一开始选择在RTL层次上探索各种算法, 效率是十分低下的.
+>
+> 所以, 当你真正明白`体系结构设计 != RTL编码`的时候,
+> 你才算是真正入门了体系结构设计这个领域. -->
+
+> #### caution:: True architecture design
+> Although the above icache design space exploration task is much simplified compared to design space exploration in real processors. But for most of the students, it was also the first time to touch the real processor architecture design.
+> More, this is probably the first time that most students have been exposed to the whole process of a module design.
+> From demand analysis, structural design, logic design, to functional verification, performance verification, performance optimization,
+> Finally to the circuit level area assessment, timing analysis. Among them, logical design is often referred to as RTL coding.
+>
+> This task once again shows that architectural design is not equal to RTL coding.
+> The work of architecture design is to find a set of design parameters that perform well in a design space that satisfies the constraints.
+> But usually the design space is very large, and it takes a lot of time to fully evaluate the performance of one set of design parameters. Therefore, how to quickly evaluate the performance of different design parameters is a crucial problem for architecture design work.
+> Therefore, simulators are an important tool for architecture design.
+> With the simulator, we don't have to simulate circuit-level behavior (instead of running the verilator, run the cachesim),
+> At the same time only need to simulate some necessary modules (do not simulate cache data, only need to simulate cache metadata). It also does not have to be driven by a processor (instead of running the full program, the corresponding itrace is played back). It is these differences that make the operation efficiency of the simulator an order of magnitude improvement over RTL simulation.
+> This allows us to quickly evaluate the expected performance of different design parameters, helping us quickly eliminate clearly unsuitable design parameters.
+>
+> According to the experience of the Xiangshan team, it takes one week to run a program on the verilator,
+> But running the same program in the full system simulator [gem5][gem5] takes only 2 hours. This means that the time it takes to evaluate a set of design parameters in an RTL simulation environment,
+> You can use the simulator to explore the effects of 84 different design parameters.
+>
+> Simulators are also a common platform for architecture research.
+ISCA has held several competitions based on [ChampSim][champsim] emulators.
+> Including [cache replacement algorithm contest][cache replacement algorithm contest] and [cache prefetch algorithm contest][data prefetch]. Researchers evaluate the performance of various algorithms in simulators to quickly adjust the overall implementation of the algorithm and its detailed parameters.
+> Although a qualified algorithm still needs to be implemented and verified at the RTL level. However, if you choose to explore various algorithms at the RTL level at the beginning, it is very inefficient.
+>
+> So, when you really understand 'architecture design! = RTL when encoding ',
+> You are truly introduced to the field of architecture design.
+
+[gem5]: https://www.gem5.org/
+[champsim]: https://github.com/ChampSim/ChampSim
+[cache replacement]: https://www.sigarch.org/call-contributions/the-2nd-cache-replacement-championship/
+[data prefetch]: https://www.sigarch.org/call-contributions/third-data-prefetching-championship/
+
+<!-- ## 缓存一致性 -->
+## Cache consistency
+
+<!-- 当store指令修改数据块的内容后, 按照程序的语义,
+后续应从相应地址中读出新数据, 否则程序的执行将会出错.
+但内存中的数据可能因为cache机制而存在多个副本,
+如何保证从每个副本中都能读出新数据, 称为缓存一致性(cache coherence)问题. -->
+
+When the store instruction modifies the contents of the data block, according to the semantics of the program. New data should be read from the corresponding address in the future, otherwise the execution of the program will be wrong. However, the data in memory may have multiple copies because of the cache mechanism. How to ensure that new data can be read from each copy is called the cache coherence problem.
+
+<!-- 在计算机系统中, 小到处理器中的cache, 大到分布式系统和互联网,
+只要存在数据副本, 它们之间就会存在一致性问题.
+在NPC添加icache后, 我们可以通过以下的`smc.c`来复现一个一致性问题: -->
+
+In computer systems, from the cache in the processor to distributed systems and the Internet. As long as there are copies of data, there will be consistency issues between them. After the NPC has added icache, we can reproduce a consistency problem by following 'smc.c' :
+
+```c
+// smc.c
+int main() {
+  asm volatile("li a0, 0;"
+               "li a1, UART_TX;"     // change UART_TX to the correct address
+               "li t1, 0x41;"        // 0x41 = 'A'
+               "la a2, again;"
+               "li t2, 0x00008067;"  // 0x00008067 = ret
+               "again:"
+               "sb t1, (a1);"
+               "sw t2, (a2);"
+               "j again;"
+              );
+  return 0;
+}
+```
+
+<!-- 上述程序首先对一些寄存器赋初值, 然后在标号`again`处往串口写入一个字符`A`,
+接着将`again`处的指令改写为`ret`, 最后跳转回`again`处重新执行.
+按照程序的语义, 程序应该在输出一个字符`A`后, 通过被改写的`ret`指令从`main()`返回.
+这种在运行过程中修改自己的代码称为"自修改代码"(Self Modified Code). -->
+
+The above program first assigns initial values to some registers, and then writes A character 'a' to the serial port at the label 'again'. Then rewrite the instruction at 'again' to 'ret', and finally jump back to 'again' to re-execute. According to the semantics of the program, the program should output A character 'a' and return from 'main()' via the rewritten 'ret' instruction. This type of Code that modifies itself while it is running is called "Self Modified Code".
+
+<!-- > #### comment::计算机发展史中的自修改代码
+>
+> 在过去内存地址空间非常紧张的时代, 经常会使用自修改代码来提升内存的利用率,
+> 从而让程序在有限的内存中实现更多的功能.
+> 例如, 1980年代的[FC红白机只有64KB的地址空间][NES MMIO], 其中卡带上的ROM占了32KB.
+> 有一些卡带上还有8KB的RAM, 但如果卡带上没有RAM, 那程序就只能使用CPU内部集成的2KB RAM了.
+> 为了利用如此有限的资源开发出精彩的游戏,
+> 开发者使用了非常多的黑科技, 自修改代码就是其中一种方法.
+>
+> 随着存储器技术的发展, 内存容量已经不像过去那样捉襟见肘了,
+> 加上自修改代码难以阅读和维护, 现代程序中已经很难看到自修改代码的踪影. -->
+
+> #### comment:: Self-modifying code in the history of computer development
+>
+> In the past, when memory address space was very tight, self-modifying code was often used to improve memory utilization.
+> This allows the program to do more with limited memory. For example, in the 1980s [FC Red and white machines had only 64KB of address space][NES MMIO], the ROM on the cassette took up 32KB. Some cartridges also have 8KB of RAM on them, but if there is no RAM on the cartridges, the program can only use the 2KB of RAM that is integrated into the CPU.
+> In order to develop great games with such limited resources,
+> Developers use a lot of black technology, self-modifying code is one of the methods.
+>
+> With the development of memory technology, memory capacity is not as tight as in the past. Coupled with the fact that self-modifying code is difficult to read and maintain, it is difficult to see the trace of self-modifying code in modern programs.
+
+[NES MMIO]: https://www.nesdev.org/wiki/CPU_memory_map
+
+<!-- > #### todo::复现缓存一致性问题
+>
+> 将上述程序编译到AM并在NPC上运行, 你发现了什么问题?
+> 尝试给出分析, 并结合波形验证你的想法.
+>
+> 如果你没有发现问题, 可以尝试增加icache的容量. -->
+
+> #### todo:: Reproduces cache consistency issues
+>
+> Compile the above program into AM and run it on NPC, what problems do you find?
+> Try to give analysis and verify your ideas with waveform.
+> If you don't find a problem, try increasing the capacity of the icache.
+
+<!-- 为了解决上述问题, 一种直接的方案是让系统中的所有副本时刻保持一致.
+例如, 在每次执行store指令时, 都马上检查系统中是否存在其他副本,
+若存在, 则对其进行更新或使其无效, 从而保证后续操作无论从何处访问,
+都能直接读出新数据(采用更新方式), 或因缺失而从下一层存储层次中读出新数据(采用无效方式);
+x86指令集采用这种方案. 但显然, 这会提升CPU设计的复杂性.
+特别地, 在一些高性能处理器中, store指令执行时, 其他部件也会同时访问系统中的各种cache,
+如何在store指令完成所有副本的更新或无效之前避免其他部件访问到过时的数据, 是非常有挑战的. -->
+
+A straightforward solution to the above problem is to keep all copies in the system consistent at all times. For example, every time the store instruction is executed, immediately check whether there are other copies in the system. If it exists, update it or invalidate it, ensuring that subsequent operations, no matter where they are accessed,
+New data can be read directly (by updating mode), or new data can be read from the next storage layer due to missing (by invalid mode);
+The x86 instruction set uses this scheme. But obviously, this increases the complexity of CPU design. In particular, in some high-performance processors, when the store instruction is executed, other components also access the various caches in the system at the same time. It is very challenging to prevent other parts from accessing outdated data until all copies of the store instruction are updated or invalid.
+
+<!-- 另一种方案则更加宽松, 允许系统中的副本在某些时刻不一致,
+但在程序访问这个数据块之前, 需要执行一条特殊的指令, 指示硬件对过时的副本进行处理.
+通过这种方式, 程序的执行过程仍然能够访问到正确的数据, 结果仍然符合程序的语义.
+RISC-V指令集采用这种方案. RISC-V中有一条`fence.i`指令,
+其语义是让在其之后的取指操作都能看到在其之前的store指令修改的数据.
+在这里, `fence.i`指令就像一道屏障, 让在其之后的取指操作无法跨越屏障来读取那些被store指令修改之前的旧数据.
+此外, 在RISC-V手册中有如下描述: -->
+
+The other scheme is more permissive, allowing copies in the system to be inconsistent at some point. But before the program can access this block of data, it needs to execute a special instruction instructing the hardware to process the outdated copy. In this way, the execution of the program is still able to access the correct data, and the result is still consistent with the semantics of the program. The RISC-V instruction set uses this scheme. 
+RISC-V has a 'fence.i' instruction. The semantics is that any subsequent fetch operation can see the data modified by the store instruction before it. Here, the 'fence.i' instruction is like a barrier, so that the fetch operation after it cannot cross the barrier to read the old data that was modified by the store instruction. In addition, it is described in the RISC-V manual as follows:
+
+```txt
+RISC-V does not guarantee that stores to instruction memory will be made
+visible to instruction fetches on a RISC-V hart until that hart executes
+a FENCE.I instruction.
+```
+<!-- 也就是说, RISC-V允许icache中的副本在某些时刻与内存不一致, 这也符合上文的讨论.
+关于`fence.i`的更多信息, 建议RTFM. -->
+
+That is, RISC-V allows a copy in icache to be inconsistent with memory at some point, which is also consistent with the discussion above. For more information about 'fence.i', RTFM is recommended.
+
+<!-- RISC-V只在指令集层次定义了`fence.i`的语义,
+但具体如何在微结构层次中实现`fence.i`的功能, 则有多种不同的方案: -->
+
+RISC-V only defines the semantics of 'fence.i' at the instruction set level. However, how to realize the function of 'fence.i' in the micro-structure level, there are many different schemes:
+
+| Scheme | When executing the store instruction      | When executing 'fence.i'  | When executing icache   |
+| :-:  | :-:                  | :-:              | :-:            |
+| (1)  | Updates the corresponding block in the icache | nop              | hit           |
+| (2)  | Invalid corresponding block in icache | nop              | Missing, access memory |
+| (3)  | -                    | flush icache   | Missing, access memory |
+
+<!-- 事实上, 实现方案(1)和(2)就是上文提到的"让系统中的所有副本时刻保持一致"的方案,
+它是"允许系统中的副本在某些时刻不一致"的特殊情况:
+由于在执行store指令时, 所有副本已保持一致, 故`fence.i`可实现成`nop`.
+对于方案(3), 执行store指令时暂不对icache中的副本进行处理, 故这些副本会处于不一致的状态,
+因此在执行`fence.i`时, 需要通过冲刷icache来实现其屏障的效果,
+使后续对icache的访问必定缺失, 从而访问内存中的新数据.
+不过无论采用何种实现方案, 程序中都需要添加`fence.i`指令, 从而满足RISC-V手册对程序的要求,
+否则程序将无法在采用实现方案(3)的处理器上正确运行. -->
+
+In fact, implementations (1) and (2) are the aforementioned "keep all copies in the system consistent at all times" solutions. It is the special case of "allowing copies in the system to be inconsistent at certain moments" :
+Since all copies are consistent when the store instruction is executed, 'fence.i' can be implemented as' nop '. For scheme (3), the store command is executed without processing the copies in the icache, so these copies are in an inconsistent state.
+Therefore, when executing 'fenc.i', it is necessary to achieve its barrier effect by flushing icache. The subsequent access to the icache must be missing, and new data in the memory can be accessed.
+However, no matter what implementation scheme is used, the program needs to add the 'fence.i' instruction, so as to meet the requirements of the RISC-V manual for the program. Otherwise, the program will not run correctly on a processor using implementation scheme (3).
+
+<!-- 事实上, 这些实现方案之间的区别只不过是将副本的一致性问题放在硬件层处理还是放在软件层处理:
+如果指令集规范要求处理器在硬件层次处理副本的一致性问题, 则这个问题对软件透明,
+程序员无需考虑应该在程序中的何处添加类似`fence.i`的指令, 但代价是硬件设计更复杂;
+如果指令集规范要求处理器在软件层次处理副本的一致性问题, 则硬件设计更简单, 但代价是增加了程序员的负担. -->
+
+In fact, the difference between these implementations is simply whether the issue of replica consistency is dealt with at the hardware or software level:
+If the instruction set specification requires the processor to deal with duplicate consistency issues at the hardware level, the issue is transparent to the software. The programmer does not have to worry about where in the program to add instructions like 'fence.i', but at the cost of more complex hardware design; If the instruction set specification requires the processor to handle the consistency of copies at the software level, the hardware design is simpler, but at the cost of increasing the burden on the programmer.
+
+<!-- 因此这个问题的本质是硬件设计的复杂性与程序开发负担之间的权衡. -->
+
+So the essence of the problem is a trade-off between the complexity of hardware design and the burden of program development.
+
+<!-- > #### todo::实现fence.i指令
+>
+> 根据你对`fence.i`指令的认识, 选择一种你认为合理的方案在NPC中实现`fence.i`指令.
+> 实现后, 在上述`smc.c`中的合适位置添加`fence.i`指令, 并重新在NPC上运行.
+> 如果你的实现正确, 你将看到程序输出一个字符`A`后成功结束.
+>
+> Hint: 你可能会遇到和`fence.i`相关的编译错误, 尝试根据报错信息解决问题. -->
+
+> #### todo:: Implements the fence.i directive
+>
+> Based on your understanding of the 'fence.i' instruction, choose a solution that you think is reasonable to implement the 'fence.i' instruction in NPCS.
+> After implementation, add the 'fence.i' instruction in the appropriate place in the above 'smc.c' and re-run it on the NPC.
+> If your implementation is correct, you will see the program output A character 'a' after successfully ending.
+>
+> Hint: You may encounter a compilation error related to 'fence.i', try to resolve the problem with the error message.
+
+<!-- 事实上, 在真实计算机中缓存一致性问题还有更多的表现.
+后续随着处理器变得复杂, 我们也会讨论更多的缓存一致性问题. -->
+
+In fact, there are more manifestations of the cache consistency problem in real computers.
+We'll talk more about cache consistency as processors become more complex.
+
diff --git a/zh/index.md b/zh/index.md
old mode 100755
new mode 100644