CSC369 Review
Overview
An O/S provides abstraction of hardware resources, including
- Virtualization: resource management, isolation, sharing
- Interface: system call, process abstraction
Some features includes
- Concurrency: synchronization
- Persistence: file system
L1
OS Themes
- Virtualization: sharing physical resources with a powerful interface
- Concurrency: Coordinate multiple activities
- Persistence: data to survive crashes
Limits direct execution
Interrupts
Hardware singal that causes CPU to jump to interrupt handlers.
- Software interrupt (provide service, handle errors)
- Hardware interrupt (regain control)
Startup process
- Runs BIOS
- BIOS calls entry point of kernel
- Kernel initializes internal data
- Creates first process (init)
- Dispatch to init
- Wait for events (OS is event driven)
Process
Abstraction for a program in execution.
Memory layout:
- top of memory
- OS
- Stack
- Heap
- Static (global data)
- Code
A process contains
- An address space
- Program counter
- GP registers
- OS resources
- A kernel thread and stack
- PID
PCB: OS data to store the above info, and
- state (ready, running, blocks)
- statistics
OS has a stat queues for each state.
New process
- Creates PCB, allocate memory
- Load executable
- Dispatch process (to scheduler)
Context switch
Switch CPU to another process, by saving the current and restore the new process.
On yield, system call, or timer interrupt.
Zombies
When a process exists, most of its resources are deallocated, expect some OS data structures (for example, for its parents).
API
- fork: creates a new PCB and copies the address space. Returns twice.
- exec: replaces current address space with executable.
L2
Function call
- Save caller-saved registers
- Pass parameters (via registers/stack)
CALL
- The return value is in
ax
. Restore current context.
Interrupts
Caused by hardware or software. For events, errors, traps.
Jumps to interrupt handler (kernel mode).
Priviledged instructions
- IO
- Memory management
- Mode bits
- Halt
System calls
- User program calls C library function
- C library function passes arguments to OS, and calls
INT
orSYSCALL
- Syscall handler receives arguments, and performs operation.
Need to verify passed arguments.
Return result in ax
.
Tracing: using strace
and ptrace()
.
Dispatch
A jump table sys_call_table
that points to functions.
Pass up to 6 parameters bx, cx, dx, si, di, bp
.
Process
Process is independent if cannot interact with other processes. Otherwise cooperating.
Processes are inefficient if communication is required.
Process includes: address space, execution state, OS resources.
Threads
Threads shares address space and OS resources. Lighter weight and faster switches. Defined as a single control flow through a program.
Kernel-level thread: managed by kernel (considered by the scheduler)
User-level threads: managed by user library, smaller overhead. However, poor scheduling decisions.
Hybird: associate user-level threads with kernel-level threads.
IPC
- Shared memory
- Massage passing (direct, by process or indirect, by ports)
L3
Synchronization
- Enforce single use
- Control order of execution
Requirements:
- Mutal exclusion
- Progress: If no threads in critical section, other threads that wants to enter CS must be allowed.
- Bounded waiting: there is a limit of time other threads can enter CS before this thread can.
- Performance: overhead
Peterson’s Algorithm
int turn;
int flags[2];
{
flag[id] = true;
turn = id;
while (turn == 1 - id && flag[1 - id]) {
/* not my turn */
}
/* critical section */
flag[id] = false;
}
Bakery Algorithm
- Each customer gets some number.
- Customer with lowest number is served next. In case a tie, thread with lower id enters.
Semaphores
An integer variable.
Operations:
P
: wait: Blocks (if count is 0), and decrementV
: signal: Increment
When max count is 1, is a lock (mutex). Could have an owner for error checking.
Atomic instruction
int TAS(int *lock) {
if (*lock == 0) {
*lock = 1;
return 0;
}
return 1;
}
Or CAS
.
Spinlock
void spin_lock(int *lock) {
while (TAS(lock)) {}
}
void spin_unlock(int *lock) {
*lock = 0;
}
Sleep locks
blocks
while waiting. Managed by kernel (wait queues).
Shared memory
- Stack is private (unless shared explicitly)
- Global variable and static objects are shared
- Dynamic object and heap are shared
L4
Conditional variables
- Conditions must check outside semaphores.
- Checking requires mutex.
Pattern: check condition, block and release mutex, re-acquire mutex.
pthread_cond_wait(&cv, &mutex)
pthread_cond_signal(&cv)
pthread_cond_broadcast(&cv)
Monitors
ADT with the restriction that only one process can be active within the monitor.
Hoare monitors
signal()
immediates switch to waiting thread. The condition that waiter
is blocked on is guarenteed to be true.
Mesa monitors
signal()
places waiter on ready queue, but signaler continues inside
monitor. Must check condition again.
Deadlock are starvation
Threads are deadlocked if every process is waiting for an event that can be caused only by another process in the set.
A thread gets starvation if a thread is waiting indefinitely because other threads are in some way preferred.
L5
Scheduling
Allocation of processor to threads over time. (multiprogramming)
- Increase CPU utilization and job throughput by overlapping IO and computation.
Mechanism: thread states, thread queues.
Policies: choosing the next thread to run.
Thread life cycle
Thread repeateadly alternate between computation and IO.
During IO, CPU is not needed.
Goals
- Fairness - each thread got fair share of CPU
- Avoid starvation
- Balance - all parts of the system should be busy
Batch systems:
- Throughput - maximize job completed per time
- Turnaround time - minimize time between submission and completion
- CPU utilization - keep cpu buzy
Interactive system:
- Response time - minimize the time between starting request and starting to produce output (delay)
- Proportionality - simple task complete quickly
Real-time system:
- Meet deadlines
- Predicitability
Type of scheduling
Long term scheduling (job scheduling, decides which process from job queue to proceed) - used in batch system, not common
Medium-term scheduling (memory scheduling) - infrequent, decides which process are swapped out to disk.
Short-term scheduling (dispatching) - frequent
Dispatching
(Context switch)
- Select next thread from ready queue
- Save currently running thread
- Restore the next thread
Trigger
- Thread blocks / exit
- Thread enters ready
- Thread yields
- Clock interrupt
Type
- Non-preemptive: keeps CPU until thread terminates
- Preemptive: CPU can be taken from a running thread
FCFS
First come, first served. Non-preemptive.
Choose thread from a FIFO queue.
Long wait time.
Shortest-Job-First
Choose thread with shortest processing time.
- Shortest time to complete first: preemptive version
Optimal average wait time.
Round robin
Pre-emptive. Circular ready queue.
Each thread is allowed to run for a time quantum q before being preempted.
Time quantum should be large w.r.t. context switch time.
Priority scheduling
Highest priority job is selected from ready queue
Might cause starvation and priority inversion.
Multi-Level queue scheduling
Multiple ready queues, each queue has own scheduling algorithm, and another scheduler to decide which queue to choose next.
Feedback scheduling
Adjust criteria based on past history.
- Boost priority of threads that waited for a long time.
- Prefer thread that do not use full quantum.
- Boost priority following a user-input.
- Adjust expected next-CPU burst.
MLFQ: A number of queues with different priority level.
- The scheduler chooses to run jobs with highest priority.
- If a job uses entire time slice, its priority gets reduced. Otherwise stays.
Fair share scheduling
- Ensure each group receives a proportional share of the CPU.
- Priority depend on threads’ own priority and history of groups.
- Lottery scheduling - each group is assigned tickets according to its share.
Unix CPU Scheduling
- Interactive threads are favored.
- More CPU time implies lower priority.
- Aging to prevent starvation.
- Reschedule every 0.1s and recompute priority every 1s.
- MLFQ with RR
Virtual memory
- Limited physical memory
- Virtualization: process owns infinite memory (illusion)
- Isolation: process does not access memory from other process
Goals:
- Efficiency
- Transparency
- Protection and sharing
Give process its own view of memory, and decouple the physical layout.
L6
Address translation
Requirements:
- Relocation: programmers does not know what memory are available.
- Logical organization: memory is just 1d array, but programmers view memory in seperate parts.
- Physical organization: flow of info between memory and disk must be managed.
Address binding (linking):
- Programs needs to be loaded into its address space.
- Address in program must be translated into real address.
- Compile-time
Absiolute address, no relocation possible.
- Load-time (static relocation)
Compiler translates symbolic addresses to logical, relocatable address within source file.
Linker traslates address from obj file to logical, absolute address.
Loader translates logical, absolute address to physical address.
- Dynamic relocation
Executable file contains logical address for entire program.
Translated to a real physical address during execution.
OS just sets the base (and bound) register.
Problems: external fragmentation (contiguous free space to small).
Solution: compaction.
Placing algorithms:
- Best fit: smallest left-over fragment, similar to first-fit
- First fit: fastest and most efficient. May leave small fragments
- Worst fit: ?
- Quick fit: fast allocation
Requires contiguous memory.
Paging
- Physical memory are partitioned into equal, fixed-size chunks (frames)
- Divide process’s memory into chunk of the same size (pages)
- Map pages to frames (no external fragmentation, low internal fragmentation)
- Page frame size are restricted into power of 2.
Just need to know reference to page table.
VADDR = VPN | OFF
VPN = VPN1 | VPN2 | ...
PP = PDBR[VPN1][VPN2]...
PTE has special bits, like Dirty, Reference, Valid, …
Paging limitations
Linear page table can be large.
- Hierachical page table
- Hash page table
- Inverted page table
Use 3 page table, one per logical segment.
Base and bound register.
Might be wasteful (sparse memory), external fragmentation.
Use multi-level page table: saves space but more memory access. TLB speeds up.
- On larger address space, use hashed page table.
- Inverted page tables: maps physical frame to virtual page number.
TLB
Fully-associative hardware cache of recently used PTEs. Tags are virtual page numbers (all of them).
- Hardware-loaded (MMU loads PTE into TLB)
- Software-loaded (OS loads PTE into TLB)
OS ensures TLB is consistent. Invalidate/reload PTE.
Page faults
-
When a page is evicted, OS invalidates PTE and store the location on swap file in PTE.
-
When the evicted page is accessed, traps to OS and OS restores the page from swap.
L7
TLB lookup
- TLB lookup the page number of address.
- If a matching entry is found, returns the PTE.
- Validate PTE protection.
- MMU combines physical frame and offset to physcal address.
- Reads from addr, returns value to CPU.
- TLB does not have PTE
Minor page fault
- MMU looks up PTE into TLB, or
- Traps to OS, OS loads PTE to TLB
- Protection is violated
Protection fault (operation not permitted / invalid)
Traps to OS
- Operation not permitted (fault, or advanced vm operation)
- Invalid (segmentation fault, or swapped out (major page fault))
Policies
- Fetch - when to fetch a page
- Placement - where to put page
- Replacement - what page to evict
Fetch policy
Pages are evicted when memory is full, and loaded again when referenced (TLB miss). OS allocate a physical frame and reads from disk.
For new process, all pages are invalid and fetched on-demand.
Prepaging: predict future page use.
Placement policy
MMU can translate any virtual-to-physical mapping equally well.
- NUMA: local memory is faster.
- Cache performance: choose physical pages to minimize cache conflicts.
Replacement policy
Goal: reduce fault rate
Belady’s Algorithm
Evict the page that will not be used for the longest period of time. This is optimal.
Need to use the future perfectly.
FIFO
Evict the one broughted in the longest ago.
The one brought ago might be not frequently used.
LRU
Evict the page that has not been used for the longest time in the path.
The one that is used recently might be frequently used.
CLOCK
Approximation of LRU with a reference bit.
Initially referenced is 0. If a page get referenced then referenced bit become 1.
Maintain a clock hand. If the page has R bit 1, reset it to 0. Otherwise evict the page and move to the next page.
Low overhead, and can utilize hardware bits.
Count-based
- Evict the page used least often. LFU
- Evict the page used most often. MFU
ARC
Ghost cache.
- Scan resistent
- Adapative
L8
Paging
Page buffering
Page replacement policy is costly.
- Maintain a pool of free page
- Run replacement algorithm when pool is too small
Thrashing
The case where OS pages data back and forth.
The system is overcommitted/oversubscribed.
- Swapping: write out all pages of a process and suspend it
- Out-of-memory watchdog
- More memory
CPU utilization
Heavy IO implies low utilization. If page fault is frequent, not a good idea to increase multiprogramming.
Page tables
Put table in physical memory: easy to address.
Put table in VM: cold page table can be aged out, but page table itself requires translation. Thus, do not page outer page table. Need to wire special code and data.
Managing swap
Use raw disk or file.
When to allocate/free swap.
Advanced paging
- Share memory: share data, could be same or different vaddr.
- Copy-on-Write: defer copy as long as possible. Example: fork.
- Memory mapped files: all pages are invalid, and OS transforms memory access to file access.
File systems
Longterm storage, that provides:
- Large storage
- Persistent
- Concurrent access
Requirements:
- Store data
- Keep track of free space
- Keep track of metadata
Files
- Create
- Write
- Read
- Seek
- Delete
- Truncate
General purpose access:
- Sequential
- Direct
Database access:
- Record
- Index
Directory
Hierarchy of directory /usr/local/
- Search entry
- Create entry
- Delete entry
- List
- Update
Linkds
- Hard link: second directory entry identical to the first
- Symbolic link: file that holds the path to linked file
Layout
Allocate blocks for file:
- Contiguous
- Chained
- Indexed (direct, single indirect, double indirect, triple indirect)
Very Simple File System
Structure
- Superblock
- Inode bitmap
- Block bitmap
- Inode table
- Data blocks
Inode:
- Metadata
- Attributes
- Block locations
Inode to data:
- Direct blocks
- Indirect blocks
- Double indirect blocks
Directory:
- List of (name, inode) mappings
Hard links: multiple directory entries to same inode
Soft link: data contains path to target
Improvements
Block groups: related files and directories are closed together.
L10
File buffer cache
Programs exhibit locality for read/write. Cache file blocks in memory.
- Static partitioning: allocate fixed-size cache
- Dynamic partitioning: integrite VM pages and FS pages into unified page cache.
Use a buffer to buffer writes.
- Combine multiple writes
- Lazy updates
- Reduce writes
Speed vs durability
- Crash: buffered writes not written
- Durability: sync to disk more frequently
Mitigation
- Delay writes with specific amount of time
- Write-behind (maintain a queue)
- NVRAM (battery backed-up RAM)
- Log-structured file system
Read-ahead
- Predicts next request (prefetching).
Magnetic disk
Disk, cylinder, track, sector.
- Seek: move disk arm to correct cylinder
- Rotation: wait for sector to rotate under head
- Transfer: actual data transfer
Hardware optimization
- Track skew: arms move slowly, may miss the next sector outside. Thus skew the track location.
- Outer tracks are larger by geometry, so hold more sectors.
- Cache: aware of disk geometry, cache the whole track.
OS
- Program - File path, offset
- File system - Partition, block
- IO controller: Disk, sector
- Disk: Cylinder, track, sector
SCSI: high-level interface.
File system
- Allocation algorithm
- Request scheduling
Goals:
- Closeness: reduce seek time
- Amortization: amortize delay
Fragmentation result in more seeking.
Traverse between data block and meta block.
FFS
Disk partitioned into group of cylinders.
- Data blocks in the same file is in the same group.
- Files in same dir is in the same group.
- Inode is in the same group as blocks.
Free space requirement
- Disk must have free space scattered across sylinders.
- For large files, break into chunks.
- Allocate from a nearby group if preferred group is full.
Disk scheduling
- First Come First Serve: No reording
- Shortest Seek Time First: Minimizes arm movement, maximize request rate
- Scan: Service request in one direction
- C-Scan: Scan but only one direction
- Look/C-Look: Like Scan but only go as far as last request
Reliability
fsck - check consistency of filesystem (only metadata)
- Superblock
- Free blocks
- Inode state
- Inode links
- Duplicates
- Bad blocks
- Directory checks
Very slow
Journaling
Writes to journal first, then writes to actual position. (ext3)
Transaction:
- TxBegin
- Updated inode
- Updated bitmap
- Updated data
- TxEnd
Note: TxEnd must be written after all data.
Logs can be discarded after committed. Use circular log.
Full journaling requires at least double writes. Journal metadata instead.
- Write data
- Metadata journal write
- Metadata journal commit
L11
Redundancy
RAID
Redundant array of independent disks.
- RAID0: Striping
- RAID1: Mirror each disk
- RAID5: Store pairity on one additional disk
FFS
Files laid out with spatial locality in mind. Read performs well, writes are not well-clustered.
But reads are cached in memory.
LFS
Treat storage as a circular log.
- Write throughput improved.
- Cache recovery is simpler.
Inodes are scattered over all the disk.
Inode map to find inodes. FS must have some fixed and known location on disk to begin lookup.
TODO:
- Crash recovery
- Garbage collection
VFS Concept
Abstraction of file and collections, seperate from implementations.
read <-> sys_read <-> vfs_read <-> [file system]_read
SSD
Page: unit of read/program (write).
Block: unit of earse.
Multiple channels so data can be striped across blocks (like RAID5).
Wear leveling:
- Always write to new location
- Map logical FS block to physical block
- Old versions are marked stale
Garbage collection:
- Reclaim stale pages by erase.
Flash Translation Layer
- Translate logical blocks into physical blocks.
- Reduce extra copying to deal with erases.
- Implement wear leveling.
FS goals
- Translate file name info number
- Sequential and random access
- Small/large files
- Metadata
- Crash recovery
- Free space
- Locality
Modern file systems
ZFS: Pooled storage, integrity, performance.
BTRFS: Copy-on-write
ZFS
RAID-Z1: 2+ disk for storage and 1 for parity. Dynamic stripe width.
Integrity: use checksum
Write caching
ZFS Intent Log (ZIL) on SLOG: writes are sent to ZIL, then flushed to long-term storage.
SLOG device: SSD / flash.
Read caching
- Multiple layers
- ARC
BTRFS
Blocks with active data are reallocated instead of overwritten.
- Allows snapshots: image of entire data, can be restored. (No extra overhead.)
Space efficient: all unchanged data is shared.
Clones: writable snapshots.
-
Disk blocks managed in extents, less fragmentation.
-
Changes accumulated in memory and written back. Superblock is modified to point to the new checkpoint.
-
Needs defragmentation.
-
Good concurrency: traverse using read lock. If write lock is needed, COW the block.
-
Multiple device support.
Trees
- Subvolumn: user-visible files and dirs
- Extent: tracks allocated extent items (free-space map)
- Checksum: One checksum item per allocated extent
Snapshot is just a new subvolume that shares data with other subvolumes.
L12
Non-deadlock bugs
- Atomicity violation bugs (
++
and--
) - Order violation bugs (wrong order assumptions)
Deadlocks
Mutal blocking of a set of threads.
- Communication deadlocks
- Resource deadlocks
Deadlock prevention
Necessary conditions
- Mutal exclusion
Only one process may use a resource at a time.
Prevention: use lock-free operations
int CAS(int *addr, int expected, int new);
void AtomicAdd(int *val, int a) {
do {
int old = *val;
} while (CAS(val, old, old + a) == 0);
}
- Hold-and-wait
A process may hold allocated resource while awaiting assignment of others.
Prevention:
- Either process gets all resource, or none of them. (Long wait, might not know resources a priori.)
trylock()
(Might result in livelock.)
- No preemption
No resource can be forcibly removed from a process holding it.
Prevention:
Allow forcibly remove of resource, might require rollback
Sufficient conditions
- Circular wait
A closed chain of processes exists, such that each process holds some resources needed by the next process.
Prevention:
Ensure total / partial order
Deadlock avoidance
Allows first three conditions, but ensures circular wait cannot occur.
- Do not start a process if total resources required exceeds resources available.
- Do not grant individual resource request if future paths results in deadlock.
Restrictions
- Maximum resource requirements must be known a priori.
- Process must be independent.
- There must be a fixed number of resources.
Banker’s algorithm
Each thread states its maximum resource requirements, acquires and releases resources incrementally.
Runtime system delays granting some requests to ensure the system never deadlocks.
For every request:
- Check if the resource can be granted.
- Assume request is granted, check if the new state is safe.
A state is safe if there exists some sequence which every request is successful.
Deadlock detection
Trigger:
- On every allocation request
- Fixed period
- System utilization drops below a threshold
Using a resource allocation graph:
- Nodes are processes and resources
- Arcs from resource to process represent allocations
- Arcs from process to resource represent ungranted requests
Note: not valid when there are multiple instances of same resource.
Deadlock recovery
- Kill all deadlocked processes
- Back up and restart
- Selectively kill process
- Selectively preempt
Ostrich Algorithm
Ignore the problem and hope that it doesn’t happen often.
- Modern O/S virtualizes physical resources.
- Only dead lock on logical resources.
Communication deadlocks
Example: packet loss
Solution: timeout, resend, protocol
Transaction and Atomicity
Transaction
A collection of operations that performs a single logical function.
Either commit (successful, cannot be undone), or abort (rollbacked).
Logs are time-consuming and large. Use checkpoints instead.
Concurrent tranactions
- All transactions are protected with a mutex.
- Allow overlap operations, as long as they don’t conflict.
Conflict if writes to the same data. Order does not matter.
Two-phase locking
Each data has their own locks.
Each transaction has growing (acquire only) and shrinking (release only) phase.
Use abort and retry if any lock is unavailable.
Timestamp
Each transaction gets unique timestamp, and each data has two timestamp:
- Largest timestamp of transactions that reads.
- Largest timestamp of transactions that writes.
Read: if tranaction’s timestamp is earlier than last write, then needs to restart.
Write: if transaction’s timestamp is earlier than last access, then needs to restart.