LearnedOS-OSR19.pdf
Opportunities for ML in OS
- ML can be used to dynamically set configurations
- timing-related: frequency of interrupts in CPU core
- more interrupts → throughput good but switching overhead exists
- size-related: cache-size / disk prefetching amount / swap prefetching
- large buffer cache → better performance for storage systems but reduces available memory for user applications
- Model can also be used to dynamically generate new configurations to adapt to workload and environmental changes → reinforcement learning
- ML can be used to generate policies based on application and hardware properties
- space allocation policy → decide which space to free when application requests for memory
- get historical traces of how much user space requested, what OS allocated, and how efficient space is utilized for ML
- scheduling: ML could potentially save metadata memory space
- e.g. Linux CFS scheduler uses red-black tree that requires O(logN) time to make decision → could be faster w ML
- cache management (cache eviction strategies)
- ML can learn past memory access patterns and learn cache size
- ML can be used to build certain OS mechanisms
- above do not require precision —> mechanism does!
- mapping from virtual address to physical memory
- can be done with ML rather than page table
- mapping from file name and offset to disk logical block address
- can be done with ML rather than multi-level index structure
- ML based mapping can inference any size and offset of memory space (instead of using fix-sized memory pages)
- ML can potentially be smaller and run faster than multi-level
Challenges and Solutions
- Model Building : need better evaluation criteria
- e.g. use buffer cache miss rate instead of application performance
- fine-grained model vs coarse-grained model
- fine-grained can achieve more accurate prediction with more customization and specialization but require more resources
- learning multiple correlated tasks (buffer cache, flushing frequency, eviction policy → all related)
- Training :
- training data set must incur minimal cost to foreground applications
- not feasible to trace each memory access to build models for predicting memory eviction candidates
- offline: fine-grained data can be trained offline
- online: collect coarse-grained data to not disturb foreground application performance
- validation sets
- no theoretical best to compare
- CPU scheduling → fastest turn-around time
- page replacement → furthest in future
- solution: use reinforcement learning (reward function) instead of validation set
- Inference :
- Two categories:
- learning can only be run once
- foreground application operations do not wait for any of the inference results
- e.g. configuration and policy fall into this category
- OS decisions that must be made quickly
- e.g. thread scheduling
- Memory overhead for storing ML models
- could use model memory-reuse techniques in RNNs
- Security : attacker can train ML model to always evict other applications’ memory and launch a DOS attack …