Linux VM and swapping

RAM always was a critycal resource. Swapping is a mechanism that allows system to operate when data size is more than RAM size.

Most of people think that swap is bad because it slows down system. This opinion is right and it's wrong. When system operates near-to-limits of RAM, swap allows to operate more efficient. Let's examine VM (virtual memory) inner process and lifecycles?

Alloc VM lifecycle

Process started and is working, now it requires more RAM. What happend here? There are two phases - commiting memory and allocating memory. Commiting is action when some amount of virtual pages assigned to process and allocating occurs when physical pages assigned to virtual pages.

Program calls libc function malloc()
malloc examine process internal data structures and if no available memory found it calls sbrk() IOCTL
Commiting: kernel allocates to process a set of virtual pages. It's NOT a real pages, it's only a record "this process has XXX additional pages that is not allocated yet"
malloc returns reference into just allocated space to program
Program tries to access the memory via address returned by malloc
Allocating: kernel gets page fault because program accessing unallocated page and tries to allocate page

The caveat: what happend if kernel has NO free pages?

MMAP VM lifecycle

Process started and is working, now it wants to read some pages from file and access it. There are two phases - mapping and access

The mapping phase:

Program calls libc function mmap() that redirected into mmap() IOCTL
Kernel allocates to process a set of virtual pages. Again, it's NOT a real pages, it's a records "process has XXX additional pages and that pages associated with file ABCDE from block XXX to YYY"
mmap returns reference to the beginnong of mmap'ed set pf pages

The access phase:

Program tries to access the memory via address returned by mmap
Kernel gets page fault because program accessing unallocated page. It tries to find this pages in page cache
If page is NOT found, in page cache, kernel finds free page, assigns it to page cache and fills with data from file. Page is marked as clean and is associated with virtual page and program gets access to data.
If access is a write access (program puts data into a memory) then the page cache page marked as dirty.

The caveat: again - what happend if kernel has NO free pages?

In-kernel allocations

Within previous topics, we refer to "kernel allocation" in "caveates" notes. How kernel deal with allocation requests? Kernel counts all pages and has a list of used and free pages.

1. Kernel looks for a free page. If there are enough pages in free page list, the pages moved out of free page list and now they are "used". Allocation is complete
2. If there are no free pages, kernel have a choice of three ways:

2.1. Kernel may drop a clean pagecache page and reuse its' physical page
2.2. Kernel may flush a dirty pagecache page and reuse its' physical page
2.3. Kernel may swap out allocated page and reuse its' physical page

All of these choices except the 2.1. requires some time and lead to a significant performance degradation, but they keep clean pagecache intact.

During the background activity kernel also flushes dirty pages to make it "clean". There are a special cases when dirty pages is not flushed but they are rare enough. Kernel also tries to keep some count of pages in free list to complete some specific highest-priority allocations.

Acting near the RAM boundaries

When acting near the all RAM is used, we can get one interesting case: the program that actively uses RAM (like browser or image editor) has many documents (or objects) created opened for read. But in facts these objects (and its' pages) is not modified and may be even not accessed. It means that a lots of memory is commited, allocated but not used.

In this case allocation doesn't ends with allocating new page (we have not enough memory), allocation doesn't ends with flushing pagecache (we have no dirty pages because we doesn't change files on disk). So, allocation may be complete only with swapping or dropping out clean pagecache page.

And finally, if you have no swap enabled, the only way is dropping clean pagecache pages.

The problem is that this cache pages mostly contains some valuable shared data and shared library code (shared libraries are mmapped to process address space). And when the next round starts and code that just been dropped from cache must be executed again, system will perform filesystem read.

System could swap out last recently used image editors' and browsers' data and keep the latency because swapping out rarely used data is better than swapping out frequently used pagecache, but you disabled it by yourself when disabled swap

Overcommiting

Overcommiting allows kernel to commit more virtual memory than it can serve (kernel can server only RAM+SWAP pages really). But lots of apps tries to allocate large enough memory space and uses only part of it, or will use it significantly later. This mean that with some amount of a luck kernel may commit to applications more virtual memory than it has with a hope that when a page must be allocated it already will be available - i.e. some apps finish or other app will release allocated pages.

OOM killer and guaranteed allocation

But if the hope we mention before fail, you will get OOM-killer (out-of-memory killer). OOM killer kills process that failed witin allocation and often it's not a process that requested too many RAM. Oops.

To avoid OOM condition you can disable overcommit and limit commiting to some size, like a "swap size + 90% of RAM size". Disable overcommiting will end with application will not allow to commit memory if this commit can't be performed with allocation. The smart enough apps getting fail in allocation will flush some unused data to disk and reuse previously commited and allocated memory.

But mostly app that fail in memory commit (sbrk) just crashes, but at least it happen with app that tries to allocate data and it has a chance to work with it gracefully i.e. flush inner cashes and then die.

Tuning the system

To tune VM options, you may need to use sysctl options:

vm.swappiness	controls swapping tendency 0 - avoid swap at all 100 - swap any time when there is a choice
vm.dirty_ratio	controls dirty pages ratio after reaching flushing will start
vm.overcommit_memory	controls overcommiting ability (0 - enabled, 2 - disabled)
vm.overcommit_ratio	controls commiting factor and limit commit to RAM_size + <factor>*SWAP_size
vm.minfreekbytes	minimum amount of free memory the kernel tries to keep