Virtual Memory¶

Background¶

only part of the program needs to be in memory for execution
allow memory to be shared by several processes - better IPC performance
Code needs to be in memory to execute, but entire program rarely needed or used at the same time
Virtual memory is larger than physical memory. 需要注意的是虚拟地址只是范围，并不能真正的存储数据，数据只能存在物理空间里。

Virtual-address Space

共享libc

Demand Paging¶

Demand paging brings a page into memory only when it is demanded

只有用的时候才分配，省内存但这时使用就会浪费时间

用时间换空间

demand means access (read/write)
if page is invalid (error) ➠ abort the operation
if page is valid but not in memory ➠ bring it to memory
- page fault
What causes page fault？
User space program accesses an address
Which hardware issues page fault?
MMU
Who handles page fault？
Operating system
Demand paging needs hardware support
page table entries with valid / invalid bit
backing storage (usually disks)
instruction restart

What causes page fault?¶

malloc 最后也会调用 brk()，增长堆的大小。 VMA 是 Virtual Memory Area，brk() 只是增大了 VMA 的大小（修改 vm_end），但是并没有真正的分配内存，只有当我们真正访问这个地址的时候，会触发 page fault，然后找一个空闲帧真正分配内存，并做了映射。

蓝色代表virtual address，绿色代表物理帧
在virtual address中，upper part存着kernel，lower part存着user
on 32-bit, the split is at 0xC0000000，高于此就是kernel，低的就是user，3G for user，1G for kernel

查的是VMA
Each page table entry has a valid–invalid (present) bit
$v$ means frame mapped, $i$ means frame not mapped
一开始所有entry的这一位都被设置为$i$
during address translation, if the entry is invalid, it will trigger a page fault

Who handles page fault?¶

Linux implementation

Page Fault 出现有两种情况，一种是地址本身超过了 vma 的范围，或者落在 Heap 内但权限不对，这种情况操作系统会杀死进程；一种是落在 Heap 里，而且权限也正确，那么这个时候 OS 就会分配一个空闲帧，然后把这个页映射到这个帧上。

为了判断地址是否落在 vma 里，Linux 使用了红黑树来加速查找。

Page Fault Handling

Page Fault¶

分成major page fault和minor page fault
major: 需要访问硬盘
minor: 不在虚拟地址空间，但是在物理内存
First reference to a non-present page will trap to kernel: page fault.
Operating system looks at memory mapping to decide:
invalid reference -> deliver an exception to the process
- Via check vma in Linux
注意这里的 valid 不是指 page table 的有效，而是访问的地址在 vma 里，而且权限正确。
valid but not in physical memory -> bring in
- get an empty physical frame
- bring page into frame via disk operation
- set page table entry to indicate the page is now in memory
- restart the instruction that caused the page fault
extreme case: 开始进程时memory内没有frame，又叫pure demand paging
OS sets instruction pointer to first instruction of process
- invalid page ➠ page fault
every page is paged in on first access
- program locality reduces the overhead
- an instruction could access multiple pages ➠ multiple page

Swapper¶

Lazy swapper: never swaps a page in memory unless it will be needed
the swapper that deals with pages is also caller a pager
Pre-Paging: pre-page all or some of pages a process will need, before they are referenced
it can reduce the number of page faults during execution
if pre-paged pages are unused, I/O and memory was wasted
- although it reduces page faults, total I/O# likely is higher

Get Free Frame¶

Most operating systems maintain a free-frame list -- a pool of free frames for satisfying such requests
Operating system typically allocate free frames using a technique known as zero-fill-on-demand -- the content of the frames zeroed-out before being allocated

为了防止信息泄露，在分配时把帧的所有位都置 0。

Stages in Demand Paging – Worse Case¶

Trap to the operating system.
Save the user registers and process state. (pt_regs)
Determine that the interrupt was a page fault.
Check that the page reference was legal and determine the location of the page on the disk.
Issue a read from the disk to a free frame: - Wait in a queue for this device until the read request is serviced. - Wait for the device seek and/or latency time. - Begin the transfer of the page to a free frame.
While waiting, allocate the CPU to other process. 等待读取的过程中，CPU分配给其他process
Receive an interrupt from the disk I/O subsystem. (I/O completed) - Determine that the interrupt was from the disk. - Mark page fault process ready. 第一次mark为ready
Handle page fault: wait for the CPU to be allocated to this process again. 等待第二次 - Save registers and process state for other process. - Context switch to page fault process. - Correct the page table and other tables to show page is now in memory.
Return to user: restore the user registers, process state, and new page table, and then resume the interrupted instruction.

Demand Paging: EAT¶

page fault rate: $0\leq p\leq1$
Effective Access Time(EAT): $$ \begin{aligned} (1-p)\times\text{memory access} + p\times(\text{page fault overhead} + \text{swap page out} + \newline \text{swap page in} + \text{instruction restart overhead}) \end{aligned} $$

Example

Demand Paging Optimizations¶

Swap space I/O faster than file system I/O even if on the same device.
Swap allocated in larger chunks, less management needed than file system.
Copy entire process image from disk to swap space at process load time
Then page in and out of swap space
Used in older BSD Unix
Demand page in from program binary on disk, but discard rather than paging out when freeing frame (and reload from disk next time)
Following cases still need to write to swap space
- Pages not associated with a file (like stack and heap) – anonymous memory
- Pages modified in memory but not yet written back to the file system 脏数据

Copy-on-Write¶

Copy-on-write (COW) allows parent and child processes to initially share the same

pages in memory

the page is shared as long as no process modifies it
if either process modifies a shared page, only then is the page copied

被写后才真正的进行copy，提升fork的效率

vfork syscall optimizes the case that child calls exec immediately after fork.

parent is suspend until child exits or calls exec.
child shares the parent resource, including the heap and the stack.
child cannot return from the function or call exit, should call _exit.

共享堆和栈，所以如果调用 exit 就会弄乱父进程的堆和栈，因此只能调用 _exit。
vfork could be fragile, it is invented when COW has not been implemented.

Page Replacement¶

没有空闲的物理帧时应该怎么办呢？我们可以交换出去一整个进程从而释放它的所有帧；更常见地，我们找到一个当前不在使用的帧，并释放它。

Page replacement – find some page in memory, but not really in use, page it out.

与物理地址无关

Page Fault Handler¶

To page in a page:
find the location of the desired page on disk
find a free frame:
- if there is a free frame, use it
- if there is none, use a page replacement policy to pick a victim frame, write victim frame to disk if dirty
bring the desired page into the free frame; update the page tables
restart the instruction that caused the trap

Page Replacement Algorithms¶

Page-replacement algorithm should have lowest page-fault rate on both first access and re-access
FIFO, optimal, LRU, LFU, MFU…
evaluate：用一串 memory reference string，每个数字都是一个页号，给出物理页的数量，看有多少个 page faults。

FIFO¶

replace the first page loaded

以上第二个序列，3个frame只有9个page fault，但4个frame却有10个。这个问题称为Belady's Anomaly

Optimal Algorithm¶

Optimal : replace page that will not be used for the longest time
但实际我们完全不知道哪一页不用的时间最久，因为我们不能预测未来，用来衡量算法的性能

Least Recently Used(LRU)¶

LRU replaces pages that have not been used for the longest time
LRU and OPT do NOT have Belady’s Anomaly.

implement:
counter-based: 花销太大了
stack-based
- keep a stack of page numbers (in double linked list)
- when a page is referenced, move it to the top of the stack
- each update is more expensive, but no need to search for replacement 栈更新很耗时间
approximation implementation: LRU approximation with a reference bit
- associate with each page a reference bit, initially set to 0
- when page is referenced, set the bit to 1 (done by the hardware)
- replace any page with reference bit = 0 (if one exists)
Additional-Reference-Bits Algorithm 添加到8位
- Suppose we have 8-bits byte for each page
- During a time interval (100ms), shifts bit rights by 1 bit, sets the high bit if used, and then discards the low-order bits
每隔一定时间右移1位，若这次的间隔内访问了就将最高位置为1，否则置0
- 00000000 => has not been used in 8 time intervals
- 11111111 => has been used in all time intervals
- 11000100 vs 01110111 : which one is used more recently?
- 左边的，因为高位是1
Second-chance algorithm
- Clock replacement
- If page to be replaced has
- Reference bit = 0 -> replace it
- reference bit = 1 then: 是1的就给两次机会
  - set reference bit 0, leave page in memory
  - replace next page, subject to same rules
Enhanced Second-Chance Algorithm
- Improve algorithm by using reference bit and modify bit (if available) in concert.
- Take ordered pair (reference, modify):
- (0, 0) neither recently used not modified – best page to replace.
- (0, 1) not recently used but modified – not quite as good, must write out before replacement.
- (1, 0) recently used but clean – probably will be used again soon.
- (1, 1) recently used and modified – probably will be used again soon and need to write out before replacement.

Counting-based Page Replacement¶

Keep the number of references made to each page
Least Frequently Used (LFU) replaces page with the smallest counter
A page is heavily used during process initialization and then never used
Most Frequently Used (MFU) replaces page with the largest counter
based on the argument that page with the smallest count was probably just brought in and has yet to be used
LFU and MFU are not common

Page-Buffering Algorithms¶

Keep a pool of free frames, always
frame available when needed, no need to find at fault time.

维持一个空闲帧的池子，当需要的时候直接从池子里取一个即可。
When convenient, evict victim.

系统不繁忙的时候，就把一些 victime frame 释放掉。（写回到磁盘，这样帧可以加到 free list 里）
Possibly, keep list of modified pages.
Possibly, keep free frame contents intact and note what is in them - a kind of cache.
If referenced again before reused, no need to load contents again from disk
- cache hit

Memory intensive applications can cause double buffering - a waste of memory. User 和 OS 都缓存了同一份内容，导致一个文件占用了两个帧。

Operating system can given direct access to the disk, getting out of the way of the applications

Raw disk mode. 赋予操作系统直接访问磁盘的权限。

Allocation of Frames¶

Each process needs minimum number of frames -according to instructions semantics
type
Equal allocation
Proportional allocation
- Allocate according to the size of process
- Dynamic as degree of multiprogramming, process sizes change
global vs. local allocation
global: process selects a replacement frame from the set of all frames; one process can take a frame from another
- Reclaiming Pages: page replacement is triggered when the list falls below a certain threshold.
如果 free list 里的帧数低于阈值，就触发 page replacement。这个策略希望保证这里有充足的自由内存来满足新的需求。
local: each process selects from only its own set of allocated frames

只能从自己的帧里选。
Major and minor page faults
Major: page is referenced but not in memory
Minor: mapping does not exist, but the page is in memory
- Shared library
- Reclaimed and not freed yet

Non-Uniform Memory Access¶

不同 CPU 距离不同的内存的距离不同，因此访问时间也不同。

Thrashing¶

Thrashing: a process is busy swapping pages in and out.

如果我们的进程一直在换进换出页，那么 CPU 使用率反而会降低。进程越多，可能发生一个进程的页刚加载进来又被另一个进程换出去，最后大部分进程都在 sleep。

解决方法直接kill掉一些进程就好

Why does demand paging work?
process memory access has high locality.
process migrates from one locality to another, localities may overlap.
Why does thrashing occur?
total memory size < total size of locality

一个 locality 大小比内存大，因此我们不得不一直换进换出页。

Resolve thrashing¶

Limit thrashing effects by using local page replacement - One process starts thrashing does not affect others -> it cannot cause other processes thrashing - Select swap out page from the same process

Working-Set Model

Working-set window($\Delta$): a fixed number of page references

把所有的 locality 称为一个工作集，windows 是一个关于时间的窗口，代表最近 $\Delta$ 次对页面的访问。

Working set of process $p_i$ (WSSi): total number of pages referenced in the most recent $\Delta$ (varies in time)
Total working sets: $D = \sum$ WSSi

确定一个进程频繁访问的页面，保证这些页面不被换出；需要调页时从剩余的页面进行交换。如果频繁访问的页面数已经大于了当前进程可用的页面数，操作系统就应当把整个进程换出，以防止出现抖动现象。

$\Delta=10$

怎样找到工作集？ an interval timer + a reference bit，访问了这个界面就把 reference bit 置 1，当定时器到了之后，我们就可以根据 reference bit 来将这个页面放到工作集中。

Other Considerations¶

kmalloc: 要求虚拟地址连续，物理地址也连续
vmalloc: 要求虚拟地址连续，物理地址可以不连续
prepaging: To reduce the large number of page faults that occurs at process startup

Prepage all or some of the pages a process will need, before they are referenced
But if prepaged pages are unused, I/O and memory was wasted

page size:

Fragmentation -> small page size
Page table size -> large page size
Resolution -> small page size
I/O overhead -> large page size
Number of page faults -> large page size
Locality -> small page size
TLB size and effectiveness -> large page size

现在倾向于large，因为内存便宜了

TLB reach: the amount of memory accessible from the TLB

= TLB size x page size
Increase the page size to reduce TLB pressure

program structure can affect page faults (如多层for循环的顺序)

I/O interlock: Pages that are used for copying a file from a device must be locked from being selected for eviction by a page replacement algorithm

Linux Buddy System¶

Buddy system 从物理连续的段上分配内存；每次分配内存大小是 2 的幂次方，例如请求是 11KB，则分配 16KB。

split the unit into two “buddies” until a proper sized chunk is available

当分配时，从物理段上切分出对应的大小（每次切分都是平分）。

它被释放时，会合并 (coalesce) 相邻的块形成更大的块供之后使用。

advantage: it can quickly merge unused chunks into larger chunk.

可以迅速组装成大内存（释放后即可合并）。

disadvantage: internal fragmentation

Slab Allocation¶

要分配很多 task_struct，如何迅速分配。我们把多个连续的页面放到一起，将 objects 统一分配到这些页面上。

Slab allocator is a cache of objects.

a cache in a slab allocator consists of one or more slabs
a slab contains one or more pages, divided into equal-sized objects

Upon request, slab allocator

Uses free struct in partial slab
If none, takes one from empty slab
If no empty slab, create new empty

A 12k slab (3 pages) can store 4 3k objects.

mm_struct里存的是user page table

Large physical RAM¶

how about large physical RAM >= 1GB, even larger than the kernel AS
For large memory systems (more than ~1GBRAM), not all of the physical RAM can be linearly mapped into the kernel's address space.
Kernel address space is the top 1GB of virtual address space, by default.
Further, 128 MB is reserved at the top of the kernel's memory space for non-contiguous allocations
kernel logical address部分是linear mapping的，virtual address部分是nonlinear mapping

ARM64 每次进程切换换的是TTBR0

only the bottom part of physical RAM has a kernel logical address
只有低896MB可以被映射

RMAP: maps physical frames to virtual addresses