Overview

An O/S provides abstraction of hardware resources, including

  • Virtualization: resource management, isolation, sharing
  • Interface: system call, process abstraction

Some features includes

  • Concurrency: synchronization
  • Persistence: file system

L1

OS Themes

  • Virtualization: sharing physical resources with a powerful interface
  • Concurrency: Coordinate multiple activities
  • Persistence: data to survive crashes

Limits direct execution

Interrupts

Hardware singal that causes CPU to jump to interrupt handlers.

  • Software interrupt (provide service, handle errors)
  • Hardware interrupt (regain control)

Startup process

  1. Runs BIOS
  2. BIOS calls entry point of kernel
  3. Kernel initializes internal data
  4. Creates first process (init)
  5. Dispatch to init
  6. Wait for events (OS is event driven)

Process

Abstraction for a program in execution.

Memory layout:

  • top of memory
  • OS
  • Stack
  • Heap
  • Static (global data)
  • Code

A process contains

  • An address space
  • Program counter
  • GP registers
  • OS resources
  • A kernel thread and stack
  • PID

PCB: OS data to store the above info, and

  • state (ready, running, blocks)
  • statistics

OS has a stat queues for each state.

New process

  1. Creates PCB, allocate memory
  2. Load executable
  3. Dispatch process (to scheduler)

Context switch

Switch CPU to another process, by saving the current and restore the new process.

On yield, system call, or timer interrupt.

Zombies

When a process exists, most of its resources are deallocated, expect some OS data structures (for example, for its parents).

API

  • fork: creates a new PCB and copies the address space. Returns twice.
  • exec: replaces current address space with executable.

L2

Function call

  1. Save caller-saved registers
  2. Pass parameters (via registers/stack)
  3. CALL
  4. The return value is in ax. Restore current context.

Interrupts

Caused by hardware or software. For events, errors, traps.

Jumps to interrupt handler (kernel mode).

Priviledged instructions

  • IO
  • Memory management
  • Mode bits
  • Halt

System calls

  1. User program calls C library function
  2. C library function passes arguments to OS, and calls INT or SYSCALL
  3. Syscall handler receives arguments, and performs operation.

Need to verify passed arguments.

Return result in ax.

Tracing: using strace and ptrace().

Dispatch

A jump table sys_call_table that points to functions.

Pass up to 6 parameters bx, cx, dx, si, di, bp.

Process

Process is independent if cannot interact with other processes. Otherwise cooperating.

Processes are inefficient if communication is required.

Process includes: address space, execution state, OS resources.

Threads

Threads shares address space and OS resources. Lighter weight and faster switches. Defined as a single control flow through a program.

Kernel-level thread: managed by kernel (considered by the scheduler)

User-level threads: managed by user library, smaller overhead. However, poor scheduling decisions.

Hybird: associate user-level threads with kernel-level threads.

IPC

  • Shared memory
  • Massage passing (direct, by process or indirect, by ports)

L3

Synchronization

  • Enforce single use
  • Control order of execution

Requirements:

  • Mutal exclusion
  • Progress: If no threads in critical section, other threads that wants to enter CS must be allowed.
  • Bounded waiting: there is a limit of time other threads can enter CS before this thread can.
  • Performance: overhead

Peterson’s Algorithm

int turn;
int flags[2];

{
    flag[id] = true;
    turn = id;
    while (turn == 1 - id && flag[1 - id]) {
        /* not my turn */
    }
    /* critical section */
    flag[id] = false;
}

Bakery Algorithm

  1. Each customer gets some number.
  2. Customer with lowest number is served next. In case a tie, thread with lower id enters.

Semaphores

An integer variable.

Operations:

  • P: wait: Blocks (if count is 0), and decrement
  • V: signal: Increment

When max count is 1, is a lock (mutex). Could have an owner for error checking.

Atomic instruction

int TAS(int *lock) {
    if (*lock == 0) {
        *lock = 1;
        return 0;
    }
    return 1;
}

Or CAS.

Spinlock

void spin_lock(int *lock) {
    while (TAS(lock)) {}
}

void spin_unlock(int *lock) {
    *lock = 0;
}

Sleep locks

blocks while waiting. Managed by kernel (wait queues).

Shared memory

  • Stack is private (unless shared explicitly)
  • Global variable and static objects are shared
  • Dynamic object and heap are shared

L4

Conditional variables

  • Conditions must check outside semaphores.
  • Checking requires mutex.

Pattern: check condition, block and release mutex, re-acquire mutex.

pthread_cond_wait(&cv, &mutex)
pthread_cond_signal(&cv)
pthread_cond_broadcast(&cv)

Monitors

ADT with the restriction that only one process can be active within the monitor.

Hoare monitors

signal() immediates switch to waiting thread. The condition that waiter is blocked on is guarenteed to be true.

Mesa monitors

signal() places waiter on ready queue, but signaler continues inside monitor. Must check condition again.

Deadlock are starvation

Threads are deadlocked if every process is waiting for an event that can be caused only by another process in the set.

A thread gets starvation if a thread is waiting indefinitely because other threads are in some way preferred.

L5

Scheduling

Allocation of processor to threads over time. (multiprogramming)

  • Increase CPU utilization and job throughput by overlapping IO and computation.

Mechanism: thread states, thread queues.

Policies: choosing the next thread to run.

Thread life cycle

Thread repeateadly alternate between computation and IO.

During IO, CPU is not needed.

Goals

  • Fairness - each thread got fair share of CPU
  • Avoid starvation
  • Balance - all parts of the system should be busy

Batch systems:

  • Throughput - maximize job completed per time
  • Turnaround time - minimize time between submission and completion
  • CPU utilization - keep cpu buzy

Interactive system:

  • Response time - minimize the time between starting request and starting to produce output (delay)
  • Proportionality - simple task complete quickly

Real-time system:

  • Meet deadlines
  • Predicitability

Type of scheduling

Long term scheduling (job scheduling, decides which process from job queue to proceed) - used in batch system, not common

Medium-term scheduling (memory scheduling) - infrequent, decides which process are swapped out to disk.

Short-term scheduling (dispatching) - frequent

Dispatching

(Context switch)

  1. Select next thread from ready queue
  2. Save currently running thread
  3. Restore the next thread

Trigger

  1. Thread blocks / exit
  2. Thread enters ready
  3. Thread yields
  4. Clock interrupt

Type

  • Non-preemptive: keeps CPU until thread terminates
  • Preemptive: CPU can be taken from a running thread

FCFS

First come, first served. Non-preemptive.

Choose thread from a FIFO queue.

Long wait time.

Shortest-Job-First

Choose thread with shortest processing time.

  • Shortest time to complete first: preemptive version

Optimal average wait time.

Round robin

Pre-emptive. Circular ready queue.

Each thread is allowed to run for a time quantum q before being preempted.

Time quantum should be large w.r.t. context switch time.

Priority scheduling

Highest priority job is selected from ready queue

Might cause starvation and priority inversion.

Multi-Level queue scheduling

Multiple ready queues, each queue has own scheduling algorithm, and another scheduler to decide which queue to choose next.

Feedback scheduling

Adjust criteria based on past history.

  • Boost priority of threads that waited for a long time.
  • Prefer thread that do not use full quantum.
  • Boost priority following a user-input.
  • Adjust expected next-CPU burst.

MLFQ: A number of queues with different priority level.

  • The scheduler chooses to run jobs with highest priority.
  • If a job uses entire time slice, its priority gets reduced. Otherwise stays.

Fair share scheduling

  • Ensure each group receives a proportional share of the CPU.
  • Priority depend on threads’ own priority and history of groups.
  • Lottery scheduling - each group is assigned tickets according to its share.

Unix CPU Scheduling

  • Interactive threads are favored.
  • More CPU time implies lower priority.
  • Aging to prevent starvation.
  • Reschedule every 0.1s and recompute priority every 1s.
  • MLFQ with RR

Virtual memory

  • Limited physical memory
  • Virtualization: process owns infinite memory (illusion)
  • Isolation: process does not access memory from other process

Goals:

  1. Efficiency
  2. Transparency
  3. Protection and sharing

Give process its own view of memory, and decouple the physical layout.

L6

Address translation

Requirements:

  • Relocation: programmers does not know what memory are available.
  • Logical organization: memory is just 1d array, but programmers view memory in seperate parts.
  • Physical organization: flow of info between memory and disk must be managed.

Address binding (linking):

  • Programs needs to be loaded into its address space.
  • Address in program must be translated into real address.
  1. Compile-time

Absiolute address, no relocation possible.

  1. Load-time (static relocation)

Compiler translates symbolic addresses to logical, relocatable address within source file.

Linker traslates address from obj file to logical, absolute address.

Loader translates logical, absolute address to physical address.

  1. Dynamic relocation

Executable file contains logical address for entire program.

Translated to a real physical address during execution.

OS just sets the base (and bound) register.

Problems: external fragmentation (contiguous free space to small).

Solution: compaction.

Placing algorithms:

  • Best fit: smallest left-over fragment, similar to first-fit
  • First fit: fastest and most efficient. May leave small fragments
  • Worst fit: ?
  • Quick fit: fast allocation

Requires contiguous memory.

Paging

  • Physical memory are partitioned into equal, fixed-size chunks (frames)
  • Divide process’s memory into chunk of the same size (pages)
  • Map pages to frames (no external fragmentation, low internal fragmentation)
  • Page frame size are restricted into power of 2.

Just need to know reference to page table.

VADDR = VPN | OFF
VPN = VPN1 | VPN2 | ...

PP = PDBR[VPN1][VPN2]...

PTE has special bits, like Dirty, Reference, Valid, …

Paging limitations

Linear page table can be large.

  • Hierachical page table
  • Hash page table
  • Inverted page table

Use 3 page table, one per logical segment.

Base and bound register.

Might be wasteful (sparse memory), external fragmentation.

Use multi-level page table: saves space but more memory access. TLB speeds up.

  • On larger address space, use hashed page table.
  • Inverted page tables: maps physical frame to virtual page number.

TLB

Fully-associative hardware cache of recently used PTEs. Tags are virtual page numbers (all of them).

  • Hardware-loaded (MMU loads PTE into TLB)
  • Software-loaded (OS loads PTE into TLB)

OS ensures TLB is consistent. Invalidate/reload PTE.

Page faults

  1. When a page is evicted, OS invalidates PTE and store the location on swap file in PTE.

  2. When the evicted page is accessed, traps to OS and OS restores the page from swap.

L7

TLB lookup

  1. TLB lookup the page number of address.
  2. If a matching entry is found, returns the PTE.
  3. Validate PTE protection.
  4. MMU combines physical frame and offset to physcal address.
  5. Reads from addr, returns value to CPU.
  • TLB does not have PTE

Minor page fault

  1. MMU looks up PTE into TLB, or
  2. Traps to OS, OS loads PTE to TLB
  • Protection is violated

Protection fault (operation not permitted / invalid)

Traps to OS

  1. Operation not permitted (fault, or advanced vm operation)
  2. Invalid (segmentation fault, or swapped out (major page fault))

Policies

  • Fetch - when to fetch a page
  • Placement - where to put page
  • Replacement - what page to evict

Fetch policy

Pages are evicted when memory is full, and loaded again when referenced (TLB miss). OS allocate a physical frame and reads from disk.

For new process, all pages are invalid and fetched on-demand.

Prepaging: predict future page use.

Placement policy

MMU can translate any virtual-to-physical mapping equally well.

  • NUMA: local memory is faster.
  • Cache performance: choose physical pages to minimize cache conflicts.

Replacement policy

Goal: reduce fault rate

Belady’s Algorithm

Evict the page that will not be used for the longest period of time. This is optimal.

Need to use the future perfectly.

FIFO

Evict the one broughted in the longest ago.

The one brought ago might be not frequently used.

LRU

Evict the page that has not been used for the longest time in the path.

The one that is used recently might be frequently used.

CLOCK

Approximation of LRU with a reference bit.

Initially referenced is 0. If a page get referenced then referenced bit become 1.

Maintain a clock hand. If the page has R bit 1, reset it to 0. Otherwise evict the page and move to the next page.

Low overhead, and can utilize hardware bits.

Count-based

  • Evict the page used least often. LFU
  • Evict the page used most often. MFU

ARC

Ghost cache.

  • Scan resistent
  • Adapative

L8

Paging

Page buffering

Page replacement policy is costly.

  • Maintain a pool of free page
  • Run replacement algorithm when pool is too small

Thrashing

The case where OS pages data back and forth.

The system is overcommitted/oversubscribed.

  • Swapping: write out all pages of a process and suspend it
  • Out-of-memory watchdog
  • More memory

CPU utilization

Heavy IO implies low utilization. If page fault is frequent, not a good idea to increase multiprogramming.

Page tables

Put table in physical memory: easy to address.

Put table in VM: cold page table can be aged out, but page table itself requires translation. Thus, do not page outer page table. Need to wire special code and data.

Managing swap

Use raw disk or file.

When to allocate/free swap.

Advanced paging

  • Share memory: share data, could be same or different vaddr.
  • Copy-on-Write: defer copy as long as possible. Example: fork.
  • Memory mapped files: all pages are invalid, and OS transforms memory access to file access.

File systems

Longterm storage, that provides:

  • Large storage
  • Persistent
  • Concurrent access

Requirements:

  • Store data
  • Keep track of free space
  • Keep track of metadata

Files

  • Create
  • Write
  • Read
  • Seek
  • Delete
  • Truncate

General purpose access:

  • Sequential
  • Direct

Database access:

  • Record
  • Index

Directory

Hierarchy of directory /usr/local/

  • Search entry
  • Create entry
  • Delete entry
  • List
  • Update

Linkds

  • Hard link: second directory entry identical to the first
  • Symbolic link: file that holds the path to linked file

Layout

Allocate blocks for file:

  • Contiguous
  • Chained
  • Indexed (direct, single indirect, double indirect, triple indirect)

Very Simple File System

Structure

  • Superblock
  • Inode bitmap
  • Block bitmap
  • Inode table
  • Data blocks

Inode:

  • Metadata
  • Attributes
  • Block locations

Inode to data:

  • Direct blocks
  • Indirect blocks
  • Double indirect blocks

Directory:

  • List of (name, inode) mappings

Hard links: multiple directory entries to same inode

Soft link: data contains path to target

Improvements

Block groups: related files and directories are closed together.

L10

File buffer cache

Programs exhibit locality for read/write. Cache file blocks in memory.

  • Static partitioning: allocate fixed-size cache
  • Dynamic partitioning: integrite VM pages and FS pages into unified page cache.

Use a buffer to buffer writes.

  • Combine multiple writes
  • Lazy updates
  • Reduce writes

Speed vs durability

  • Crash: buffered writes not written
  • Durability: sync to disk more frequently

Mitigation

  • Delay writes with specific amount of time
  • Write-behind (maintain a queue)
  • NVRAM (battery backed-up RAM)
  • Log-structured file system

Read-ahead

  • Predicts next request (prefetching).

Magnetic disk

Disk, cylinder, track, sector.

  • Seek: move disk arm to correct cylinder
  • Rotation: wait for sector to rotate under head
  • Transfer: actual data transfer

Hardware optimization

  • Track skew: arms move slowly, may miss the next sector outside. Thus skew the track location.
  • Outer tracks are larger by geometry, so hold more sectors.
  • Cache: aware of disk geometry, cache the whole track.

OS

  • Program - File path, offset
  • File system - Partition, block
  • IO controller: Disk, sector
  • Disk: Cylinder, track, sector

SCSI: high-level interface.

File system

  • Allocation algorithm
  • Request scheduling

Goals:

  • Closeness: reduce seek time
  • Amortization: amortize delay

Fragmentation result in more seeking.

Traverse between data block and meta block.

FFS

Disk partitioned into group of cylinders.

  • Data blocks in the same file is in the same group.
  • Files in same dir is in the same group.
  • Inode is in the same group as blocks.

Free space requirement

  • Disk must have free space scattered across sylinders.
  • For large files, break into chunks.
  • Allocate from a nearby group if preferred group is full.

Disk scheduling

  • First Come First Serve: No reording
  • Shortest Seek Time First: Minimizes arm movement, maximize request rate
  • Scan: Service request in one direction
  • C-Scan: Scan but only one direction
  • Look/C-Look: Like Scan but only go as far as last request

Reliability

fsck - check consistency of filesystem (only metadata)

  1. Superblock
  2. Free blocks
  3. Inode state
  4. Inode links
  5. Duplicates
  6. Bad blocks
  7. Directory checks

Very slow

Journaling

Writes to journal first, then writes to actual position. (ext3)

Transaction:

  1. TxBegin
  2. Updated inode
  3. Updated bitmap
  4. Updated data
  5. TxEnd

Note: TxEnd must be written after all data.

Logs can be discarded after committed. Use circular log.

Full journaling requires at least double writes. Journal metadata instead.

  1. Write data
  2. Metadata journal write
  3. Metadata journal commit

L11

Redundancy

RAID

Redundant array of independent disks.

  • RAID0: Striping
  • RAID1: Mirror each disk
  • RAID5: Store pairity on one additional disk

FFS

Files laid out with spatial locality in mind. Read performs well, writes are not well-clustered.

But reads are cached in memory.

LFS

Treat storage as a circular log.

  • Write throughput improved.
  • Cache recovery is simpler.

Inodes are scattered over all the disk.

Inode map to find inodes. FS must have some fixed and known location on disk to begin lookup.

TODO:

  • Crash recovery
  • Garbage collection

VFS Concept

Abstraction of file and collections, seperate from implementations.

read <-> sys_read <-> vfs_read <-> [file system]_read

SSD

Page: unit of read/program (write).

Block: unit of earse.

Multiple channels so data can be striped across blocks (like RAID5).

Wear leveling:

  • Always write to new location
  • Map logical FS block to physical block
  • Old versions are marked stale

Garbage collection:

  • Reclaim stale pages by erase.

Flash Translation Layer

  • Translate logical blocks into physical blocks.
  • Reduce extra copying to deal with erases.
  • Implement wear leveling.

FS goals

  • Translate file name info number
  • Sequential and random access
  • Small/large files
  • Metadata
  • Crash recovery
  • Free space
  • Locality

Modern file systems

ZFS: Pooled storage, integrity, performance.

BTRFS: Copy-on-write

ZFS

RAID-Z1: 2+ disk for storage and 1 for parity. Dynamic stripe width.

Integrity: use checksum

Write caching

ZFS Intent Log (ZIL) on SLOG: writes are sent to ZIL, then flushed to long-term storage.

SLOG device: SSD / flash.

Read caching

  • Multiple layers
  • ARC

BTRFS

Blocks with active data are reallocated instead of overwritten.

  • Allows snapshots: image of entire data, can be restored. (No extra overhead.)

Space efficient: all unchanged data is shared.

Clones: writable snapshots.

  • Disk blocks managed in extents, less fragmentation.

  • Changes accumulated in memory and written back. Superblock is modified to point to the new checkpoint.

  • Needs defragmentation.

  • Good concurrency: traverse using read lock. If write lock is needed, COW the block.

  • Multiple device support.

Trees

  • Subvolumn: user-visible files and dirs
  • Extent: tracks allocated extent items (free-space map)
  • Checksum: One checksum item per allocated extent

Snapshot is just a new subvolume that shares data with other subvolumes.

L12

Non-deadlock bugs

  • Atomicity violation bugs (++ and --)
  • Order violation bugs (wrong order assumptions)

Deadlocks

Mutal blocking of a set of threads.

  • Communication deadlocks
  • Resource deadlocks

Deadlock prevention

Necessary conditions

  1. Mutal exclusion

Only one process may use a resource at a time.

Prevention: use lock-free operations

int CAS(int *addr, int expected, int new);

void AtomicAdd(int *val, int a) {
    do {
        int old = *val;
    } while (CAS(val, old, old + a) == 0);
}
  1. Hold-and-wait

A process may hold allocated resource while awaiting assignment of others.

Prevention:

  • Either process gets all resource, or none of them. (Long wait, might not know resources a priori.)
  • trylock() (Might result in livelock.)
  1. No preemption

No resource can be forcibly removed from a process holding it.

Prevention:

Allow forcibly remove of resource, might require rollback

Sufficient conditions

  1. Circular wait

A closed chain of processes exists, such that each process holds some resources needed by the next process.

Prevention:

Ensure total / partial order

Deadlock avoidance

Allows first three conditions, but ensures circular wait cannot occur.

  1. Do not start a process if total resources required exceeds resources available.
  2. Do not grant individual resource request if future paths results in deadlock.

Restrictions

  1. Maximum resource requirements must be known a priori.
  2. Process must be independent.
  3. There must be a fixed number of resources.

Banker’s algorithm

Each thread states its maximum resource requirements, acquires and releases resources incrementally.

Runtime system delays granting some requests to ensure the system never deadlocks.

For every request:

  1. Check if the resource can be granted.
  2. Assume request is granted, check if the new state is safe.

A state is safe if there exists some sequence which every request is successful.

Deadlock detection

Trigger:

  1. On every allocation request
  2. Fixed period
  3. System utilization drops below a threshold

Using a resource allocation graph:

  • Nodes are processes and resources
  • Arcs from resource to process represent allocations
  • Arcs from process to resource represent ungranted requests

Note: not valid when there are multiple instances of same resource.

Deadlock recovery

  • Kill all deadlocked processes
  • Back up and restart
  • Selectively kill process
  • Selectively preempt

Ostrich Algorithm

Ignore the problem and hope that it doesn’t happen often.

  • Modern O/S virtualizes physical resources.
  • Only dead lock on logical resources.

Communication deadlocks

Example: packet loss

Solution: timeout, resend, protocol

Transaction and Atomicity

Transaction

A collection of operations that performs a single logical function.

Either commit (successful, cannot be undone), or abort (rollbacked).

Logs are time-consuming and large. Use checkpoints instead.

Concurrent tranactions

  • All transactions are protected with a mutex.
  • Allow overlap operations, as long as they don’t conflict.

Conflict if writes to the same data. Order does not matter.

Two-phase locking

Each data has their own locks.

Each transaction has growing (acquire only) and shrinking (release only) phase.

Use abort and retry if any lock is unavailable.

Timestamp

Each transaction gets unique timestamp, and each data has two timestamp:

  • Largest timestamp of transactions that reads.
  • Largest timestamp of transactions that writes.

Read: if tranaction’s timestamp is earlier than last write, then needs to restart.

Write: if transaction’s timestamp is earlier than last access, then needs to restart.