Overview

An O/S provides abstraction of hardware resources, including

Virtualization: resource management, isolation, sharing
Interface: system call, process abstraction

Some features includes

Concurrency: synchronization
Persistence: file system

L1

OS Themes

Virtualization: sharing physical resources with a powerful interface
Concurrency: Coordinate multiple activities
Persistence: data to survive crashes

Limits direct execution

Interrupts

Hardware singal that causes CPU to jump to interrupt handlers.

Software interrupt (provide service, handle errors)
Hardware interrupt (regain control)

Startup process

Runs BIOS
BIOS calls entry point of kernel
Kernel initializes internal data
Creates first process (init)
Dispatch to init
Wait for events (OS is event driven)

Process

Abstraction for a program in execution.

Memory layout:

top of memory
OS
Stack
Heap
Static (global data)
Code

A process contains

An address space
Program counter
GP registers
OS resources
A kernel thread and stack
PID

PCB: OS data to store the above info, and

state (ready, running, blocks)
statistics

OS has a stat queues for each state.

New process

Creates PCB, allocate memory
Load executable
Dispatch process (to scheduler)

Context switch

Switch CPU to another process, by saving the current and restore the new process.

On yield, system call, or timer interrupt.

Zombies

When a process exists, most of its resources are deallocated, expect some OS data structures (for example, for its parents).

API

fork: creates a new PCB and copies the address space. Returns twice.
exec: replaces current address space with executable.

L2

Function call

Save caller-saved registers
Pass parameters (via registers/stack)
CALL
The return value is in ax. Restore current context.

Interrupts

Caused by hardware or software. For events, errors, traps.

Jumps to interrupt handler (kernel mode).

Priviledged instructions

IO
Memory management
Mode bits
Halt

System calls

User program calls C library function
C library function passes arguments to OS, and calls INT or SYSCALL
Syscall handler receives arguments, and performs operation.

Need to verify passed arguments.

Return result in ax.

Tracing: using strace and ptrace().

Dispatch

A jump table sys_call_table that points to functions.

Pass up to 6 parameters bx, cx, dx, si, di, bp.

Process

Process is independent if cannot interact with other processes. Otherwise cooperating.

Processes are inefficient if communication is required.

Process includes: address space, execution state, OS resources.

Threads

Threads shares address space and OS resources. Lighter weight and faster switches. Defined as a single control flow through a program.

Kernel-level thread: managed by kernel (considered by the scheduler)

User-level threads: managed by user library, smaller overhead. However, poor scheduling decisions.

Hybird: associate user-level threads with kernel-level threads.

IPC

Shared memory
Massage passing (direct, by process or indirect, by ports)

L3

Synchronization

Enforce single use
Control order of execution

Requirements:

Mutal exclusion
Progress: If no threads in critical section, other threads that wants to enter CS must be allowed.
Bounded waiting: there is a limit of time other threads can enter CS before this thread can.
Performance: overhead

Peterson’s Algorithm

int turn;
int flags[2];

{
    flag[id] = true;
    turn = id;
    while (turn == 1 - id && flag[1 - id]) {
        /* not my turn */
    }
    /* critical section */
    flag[id] = false;
}

Bakery Algorithm

Each customer gets some number.
Customer with lowest number is served next. In case a tie, thread with lower id enters.

Semaphores

An integer variable.

Operations:

P: wait: Blocks (if count is 0), and decrement
V: signal: Increment

When max count is 1, is a lock (mutex). Could have an owner for error checking.

Atomic instruction

int TAS(int *lock) {
    if (*lock == 0) {
        *lock = 1;
        return 0;
    }
    return 1;
}

Or CAS.

Spinlock

void spin_lock(int *lock) {
    while (TAS(lock)) {}
}

void spin_unlock(int *lock) {
    *lock = 0;
}

Sleep locks

blocks while waiting. Managed by kernel (wait queues).

Shared memory

Stack is private (unless shared explicitly)
Global variable and static objects are shared
Dynamic object and heap are shared

L4

Conditional variables

Conditions must check outside semaphores.
Checking requires mutex.

Pattern: check condition, block and release mutex, re-acquire mutex.

pthread_cond_wait(&cv, &mutex)
pthread_cond_signal(&cv)
pthread_cond_broadcast(&cv)

Monitors

ADT with the restriction that only one process can be active within the monitor.

Hoare monitors

signal() immediates switch to waiting thread. The condition that waiter is blocked on is guarenteed to be true.

Mesa monitors

signal() places waiter on ready queue, but signaler continues inside monitor. Must check condition again.

Deadlock are starvation

Threads are deadlocked if every process is waiting for an event that can be caused only by another process in the set.

A thread gets starvation if a thread is waiting indefinitely because other threads are in some way preferred.

L5

Scheduling

Allocation of processor to threads over time. (multiprogramming)

Increase CPU utilization and job throughput by overlapping IO and computation.

Mechanism: thread states, thread queues.

Policies: choosing the next thread to run.

Thread life cycle

Thread repeateadly alternate between computation and IO.

During IO, CPU is not needed.

Goals

Fairness - each thread got fair share of CPU
Avoid starvation
Balance - all parts of the system should be busy

Batch systems:

Throughput - maximize job completed per time
Turnaround time - minimize time between submission and completion
CPU utilization - keep cpu buzy

Interactive system:

Response time - minimize the time between starting request and starting to produce output (delay)
Proportionality - simple task complete quickly

Real-time system:

Meet deadlines
Predicitability

Type of scheduling

Long term scheduling (job scheduling, decides which process from job queue to proceed) - used in batch system, not common

Medium-term scheduling (memory scheduling) - infrequent, decides which process are swapped out to disk.

Short-term scheduling (dispatching) - frequent

Dispatching

(Context switch)

Select next thread from ready queue
Save currently running thread
Restore the next thread

Trigger

Thread blocks / exit
Thread enters ready
Thread yields
Clock interrupt

Type

Non-preemptive: keeps CPU until thread terminates
Preemptive: CPU can be taken from a running thread

FCFS

First come, first served. Non-preemptive.

Choose thread from a FIFO queue.

Long wait time.

Shortest-Job-First

Choose thread with shortest processing time.

Shortest time to complete first: preemptive version

Optimal average wait time.

Round robin

Pre-emptive. Circular ready queue.

Each thread is allowed to run for a time quantum q before being preempted.

Time quantum should be large w.r.t. context switch time.

Priority scheduling

Highest priority job is selected from ready queue

Might cause starvation and priority inversion.

Multi-Level queue scheduling

Multiple ready queues, each queue has own scheduling algorithm, and another scheduler to decide which queue to choose next.

Feedback scheduling

Adjust criteria based on past history.

Boost priority of threads that waited for a long time.
Prefer thread that do not use full quantum.
Boost priority following a user-input.
Adjust expected next-CPU burst.

MLFQ: A number of queues with different priority level.

The scheduler chooses to run jobs with highest priority.
If a job uses entire time slice, its priority gets reduced. Otherwise stays.

Ensure each group receives a proportional share of the CPU.
Priority depend on threads’ own priority and history of groups.
Lottery scheduling - each group is assigned tickets according to its share.

Unix CPU Scheduling

Interactive threads are favored.
More CPU time implies lower priority.
Aging to prevent starvation.
Reschedule every 0.1s and recompute priority every 1s.
MLFQ with RR

Virtual memory

Limited physical memory
Virtualization: process owns infinite memory (illusion)
Isolation: process does not access memory from other process

Goals:

Efficiency
Transparency
Protection and sharing

Give process its own view of memory, and decouple the physical layout.

L6

Address translation

Requirements:

Relocation: programmers does not know what memory are available.
Logical organization: memory is just 1d array, but programmers view memory in seperate parts.
Physical organization: flow of info between memory and disk must be managed.

Address binding (linking):

Programs needs to be loaded into its address space.
Address in program must be translated into real address.

Compile-time

Absiolute address, no relocation possible.

Load-time (static relocation)

Compiler translates symbolic addresses to logical, relocatable address within source file.

Linker traslates address from obj file to logical, absolute address.

Loader translates logical, absolute address to physical address.

Dynamic relocation

Executable file contains logical address for entire program.

Translated to a real physical address during execution.

OS just sets the base (and bound) register.

Problems: external fragmentation (contiguous free space to small).

Solution: compaction.

Placing algorithms:

Best fit: smallest left-over fragment, similar to first-fit
First fit: fastest and most efficient. May leave small fragments
Worst fit: ?
Quick fit: fast allocation

Requires contiguous memory.

Paging

Physical memory are partitioned into equal, fixed-size chunks (frames)
Divide process’s memory into chunk of the same size (pages)
Map pages to frames (no external fragmentation, low internal fragmentation)
Page frame size are restricted into power of 2.

Just need to know reference to page table.

VADDR = VPN | OFF
VPN = VPN1 | VPN2 | ...

PP = PDBR[VPN1][VPN2]...

PTE has special bits, like Dirty, Reference, Valid, …

Paging limitations

Linear page table can be large.

Hierachical page table
Hash page table
Inverted page table

Use 3 page table, one per logical segment.

Base and bound register.

Might be wasteful (sparse memory), external fragmentation.

Use multi-level page table: saves space but more memory access. TLB speeds up.

On larger address space, use hashed page table.
Inverted page tables: maps physical frame to virtual page number.

TLB

Fully-associative hardware cache of recently used PTEs. Tags are virtual page numbers (all of them).

Hardware-loaded (MMU loads PTE into TLB)
Software-loaded (OS loads PTE into TLB)

OS ensures TLB is consistent. Invalidate/reload PTE.

Page faults

When a page is evicted, OS invalidates PTE and store the location on swap file in PTE.
When the evicted page is accessed, traps to OS and OS restores the page from swap.

L7

TLB lookup

TLB lookup the page number of address.
If a matching entry is found, returns the PTE.
Validate PTE protection.
MMU combines physical frame and offset to physcal address.
Reads from addr, returns value to CPU.

TLB does not have PTE

Minor page fault

MMU looks up PTE into TLB, or
Traps to OS, OS loads PTE to TLB

Protection is violated

Protection fault (operation not permitted / invalid)

Traps to OS

Operation not permitted (fault, or advanced vm operation)
Invalid (segmentation fault, or swapped out (major page fault))

Policies

Fetch - when to fetch a page
Placement - where to put page
Replacement - what page to evict

Fetch policy

Pages are evicted when memory is full, and loaded again when referenced (TLB miss). OS allocate a physical frame and reads from disk.

For new process, all pages are invalid and fetched on-demand.

Prepaging: predict future page use.

Placement policy

MMU can translate any virtual-to-physical mapping equally well.

NUMA: local memory is faster.
Cache performance: choose physical pages to minimize cache conflicts.

Replacement policy

Goal: reduce fault rate

Belady’s Algorithm

Evict the page that will not be used for the longest period of time. This is optimal.

Need to use the future perfectly.

FIFO

Evict the one broughted in the longest ago.

The one brought ago might be not frequently used.

LRU

Evict the page that has not been used for the longest time in the path.

The one that is used recently might be frequently used.

CLOCK

Approximation of LRU with a reference bit.

Initially referenced is 0. If a page get referenced then referenced bit become 1.

Maintain a clock hand. If the page has R bit 1, reset it to 0. Otherwise evict the page and move to the next page.

Low overhead, and can utilize hardware bits.

Count-based

Evict the page used least often. LFU
Evict the page used most often. MFU

ARC

Ghost cache.

Scan resistent
Adapative

L8

Paging

Page buffering

Page replacement policy is costly.

Maintain a pool of free page
Run replacement algorithm when pool is too small

Thrashing

The case where OS pages data back and forth.

The system is overcommitted/oversubscribed.

Swapping: write out all pages of a process and suspend it
Out-of-memory watchdog
More memory

CPU utilization

Heavy IO implies low utilization. If page fault is frequent, not a good idea to increase multiprogramming.

Page tables

Put table in physical memory: easy to address.

Put table in VM: cold page table can be aged out, but page table itself requires translation. Thus, do not page outer page table. Need to wire special code and data.

Managing swap

Use raw disk or file.

When to allocate/free swap.

Advanced paging

Share memory: share data, could be same or different vaddr.
Copy-on-Write: defer copy as long as possible. Example: fork.
Memory mapped files: all pages are invalid, and OS transforms memory access to file access.

File systems

Longterm storage, that provides:

Large storage
Persistent
Concurrent access

Requirements:

Store data
Keep track of free space
Keep track of metadata

Files

Create
Write
Read
Seek
Delete
Truncate

General purpose access:

Sequential
Direct

Database access:

Record
Index

Linkds

Hard link: second directory entry identical to the first
Symbolic link: file that holds the path to linked file

Layout

Allocate blocks for file:

Contiguous
Chained
Indexed (direct, single indirect, double indirect, triple indirect)

Very Simple File System

Structure

Superblock
Inode bitmap
Block bitmap
Inode table
Data blocks

Inode:

Metadata
Attributes
Block locations

Inode to data:

Direct blocks
Indirect blocks
Double indirect blocks

Directory:

List of (name, inode) mappings

Hard links: multiple directory entries to same inode

Soft link: data contains path to target

Improvements

Block groups: related files and directories are closed together.

L10

File buffer cache

Programs exhibit locality for read/write. Cache file blocks in memory.

Static partitioning: allocate fixed-size cache
Dynamic partitioning: integrite VM pages and FS pages into unified page cache.

Use a buffer to buffer writes.

Combine multiple writes
Lazy updates
Reduce writes

Speed vs durability

Crash: buffered writes not written
Durability: sync to disk more frequently

Mitigation

Delay writes with specific amount of time
Write-behind (maintain a queue)
NVRAM (battery backed-up RAM)
Log-structured file system

Read-ahead

Predicts next request (prefetching).

Magnetic disk

Disk, cylinder, track, sector.

Seek: move disk arm to correct cylinder
Rotation: wait for sector to rotate under head
Transfer: actual data transfer

Hardware optimization

Track skew: arms move slowly, may miss the next sector outside. Thus skew the track location.
Outer tracks are larger by geometry, so hold more sectors.
Cache: aware of disk geometry, cache the whole track.

OS

Program - File path, offset
File system - Partition, block
IO controller: Disk, sector
Disk: Cylinder, track, sector

SCSI: high-level interface.

File system

Allocation algorithm
Request scheduling

Goals:

Closeness: reduce seek time
Amortization: amortize delay

Fragmentation result in more seeking.

Traverse between data block and meta block.

FFS

Disk partitioned into group of cylinders.

Data blocks in the same file is in the same group.
Files in same dir is in the same group.
Inode is in the same group as blocks.

Free space requirement

Disk must have free space scattered across sylinders.
For large files, break into chunks.
Allocate from a nearby group if preferred group is full.

Disk scheduling

First Come First Serve: No reording
Shortest Seek Time First: Minimizes arm movement, maximize request rate
Scan: Service request in one direction
C-Scan: Scan but only one direction
Look/C-Look: Like Scan but only go as far as last request

Reliability

fsck - check consistency of filesystem (only metadata)

Superblock
Free blocks
Inode state
Inode links
Duplicates
Bad blocks
Directory checks

Very slow

Journaling

Writes to journal first, then writes to actual position. (ext3)

Transaction:

TxBegin
Updated inode
Updated bitmap
Updated data
TxEnd

Note: TxEnd must be written after all data.

Logs can be discarded after committed. Use circular log.

Full journaling requires at least double writes. Journal metadata instead.

Write data
Metadata journal write
Metadata journal commit

L11

Redundancy

RAID

Redundant array of independent disks.

RAID0: Striping
RAID1: Mirror each disk
RAID5: Store pairity on one additional disk

FFS

Files laid out with spatial locality in mind. Read performs well, writes are not well-clustered.

But reads are cached in memory.

LFS

Treat storage as a circular log.

Write throughput improved.
Cache recovery is simpler.

Inodes are scattered over all the disk.

Inode map to find inodes. FS must have some fixed and known location on disk to begin lookup.

TODO:

Crash recovery
Garbage collection

VFS Concept

Abstraction of file and collections, seperate from implementations.

read <-> sys_read <-> vfs_read <-> [file system]_read

SSD

Page: unit of read/program (write).

Block: unit of earse.

Multiple channels so data can be striped across blocks (like RAID5).

Wear leveling:

Always write to new location
Map logical FS block to physical block
Old versions are marked stale

Garbage collection:

Reclaim stale pages by erase.

Flash Translation Layer

Translate logical blocks into physical blocks.
Reduce extra copying to deal with erases.
Implement wear leveling.

FS goals

Translate file name info number
Sequential and random access
Small/large files
Metadata
Crash recovery
Free space
Locality

Modern file systems

ZFS: Pooled storage, integrity, performance.

BTRFS: Copy-on-write

ZFS

RAID-Z1: 2+ disk for storage and 1 for parity. Dynamic stripe width.

Integrity: use checksum

Write caching

ZFS Intent Log (ZIL) on SLOG: writes are sent to ZIL, then flushed to long-term storage.

SLOG device: SSD / flash.

Read caching

Multiple layers
ARC

BTRFS

Blocks with active data are reallocated instead of overwritten.

Allows snapshots: image of entire data, can be restored. (No extra overhead.)

Space efficient: all unchanged data is shared.

Clones: writable snapshots.

Disk blocks managed in extents, less fragmentation.
Changes accumulated in memory and written back. Superblock is modified to point to the new checkpoint.
Needs defragmentation.
Good concurrency: traverse using read lock. If write lock is needed, COW the block.
Multiple device support.

Trees

Subvolumn: user-visible files and dirs
Extent: tracks allocated extent items (free-space map)
Checksum: One checksum item per allocated extent

Snapshot is just a new subvolume that shares data with other subvolumes.

L12

Non-deadlock bugs

Atomicity violation bugs (++ and --)
Order violation bugs (wrong order assumptions)

Deadlocks

Mutal blocking of a set of threads.

Communication deadlocks
Resource deadlocks

Deadlock prevention

Necessary conditions

Mutal exclusion

Only one process may use a resource at a time.

Prevention: use lock-free operations

int CAS(int *addr, int expected, int new);

void AtomicAdd(int *val, int a) {
    do {
        int old = *val;
    } while (CAS(val, old, old + a) == 0);
}

Hold-and-wait

A process may hold allocated resource while awaiting assignment of others.

Prevention:

Either process gets all resource, or none of them. (Long wait, might not know resources a priori.)
trylock() (Might result in livelock.)

No preemption

No resource can be forcibly removed from a process holding it.

Prevention:

Allow forcibly remove of resource, might require rollback

Sufficient conditions

Circular wait

A closed chain of processes exists, such that each process holds some resources needed by the next process.

Prevention:

Ensure total / partial order

Deadlock avoidance

Allows first three conditions, but ensures circular wait cannot occur.

Do not start a process if total resources required exceeds resources available.
Do not grant individual resource request if future paths results in deadlock.

Restrictions

Maximum resource requirements must be known a priori.
Process must be independent.
There must be a fixed number of resources.

Banker’s algorithm

Each thread states its maximum resource requirements, acquires and releases resources incrementally.

Runtime system delays granting some requests to ensure the system never deadlocks.

For every request:

Check if the resource can be granted.
Assume request is granted, check if the new state is safe.

A state is safe if there exists some sequence which every request is successful.

Deadlock detection

Trigger:

On every allocation request
Fixed period
System utilization drops below a threshold

Using a resource allocation graph:

Nodes are processes and resources
Arcs from resource to process represent allocations
Arcs from process to resource represent ungranted requests

Note: not valid when there are multiple instances of same resource.

Deadlock recovery

Kill all deadlocked processes
Back up and restart
Selectively kill process
Selectively preempt

Ostrich Algorithm

Ignore the problem and hope that it doesn’t happen often.

Modern O/S virtualizes physical resources.
Only dead lock on logical resources.

Communication deadlocks

Example: packet loss

Solution: timeout, resend, protocol

Transaction and Atomicity

Transaction

A collection of operations that performs a single logical function.

Either commit (successful, cannot be undone), or abort (rollbacked).

Logs are time-consuming and large. Use checkpoints instead.

Concurrent tranactions

All transactions are protected with a mutex.
Allow overlap operations, as long as they don’t conflict.

Conflict if writes to the same data. Order does not matter.

Two-phase locking

Each data has their own locks.

Each transaction has growing (acquire only) and shrinking (release only) phase.

Use abort and retry if any lock is unavailable.

Timestamp

Each transaction gets unique timestamp, and each data has two timestamp:

Largest timestamp of transactions that reads.
Largest timestamp of transactions that writes.

Read: if tranaction’s timestamp is earlier than last write, then needs to restart.

Write: if transaction’s timestamp is earlier than last access, then needs to restart.

Overview

L1

OS Themes

Interrupts

Startup process

Process

New process

Context switch

Zombies

API

L2

Function call

Interrupts

Priviledged instructions

System calls

Dispatch

Process

Threads

IPC

L3

Synchronization

Peterson’s Algorithm

Bakery Algorithm

Semaphores

Atomic instruction

Spinlock

Sleep locks

Shared memory

L4

Conditional variables

Monitors

Hoare monitors

Mesa monitors

Deadlock are starvation

L5

Scheduling

Thread life cycle

Goals

Type of scheduling

Dispatching

Trigger

Type

FCFS

Shortest-Job-First

Round robin

Priority scheduling

Multi-Level queue scheduling

Feedback scheduling

Fair share scheduling

Unix CPU Scheduling

Virtual memory

L6

Address translation

Paging

Paging limitations

TLB

Page faults

L7

TLB lookup

Policies

Fetch policy

Placement policy

Replacement policy

Belady’s Algorithm

FIFO

LRU

CLOCK

Count-based

ARC

L8

Paging

Page buffering

Thrashing

CPU utilization

Page tables

Managing swap

Advanced paging

File systems

Files

Directory