LearnedOS-OSR19.pdf

Opportunities for ML in OS

  1. ML can be used to dynamically set configurations
    1. timing-related: frequency of interrupts in CPU core
      1. more interrupts → throughput good but switching overhead exists
    2. size-related: cache-size / disk prefetching amount / swap prefetching
      1. large buffer cache → better performance for storage systems but reduces available memory for user applications
    3. Model can also be used to dynamically generate new configurations to adapt to workload and environmental changes → reinforcement learning
  2. ML can be used to generate policies based on application and hardware properties
    1. space allocation policy → decide which space to free when application requests for memory
      1. get historical traces of how much user space requested, what OS allocated, and how efficient space is utilized for ML
    2. scheduling: ML could potentially save metadata memory space
      1. e.g. Linux CFS scheduler uses red-black tree that requires O(logN) time to make decision → could be faster w ML
    3. cache management (cache eviction strategies)
      1. ML can learn past memory access patterns and learn cache size
  3. ML can be used to build certain OS mechanisms
    1. above do not require precision —> mechanism does!
    2. mapping from virtual address to physical memory
      1. can be done with ML rather than page table
    3. mapping from file name and offset to disk logical block address
      1. can be done with ML rather than multi-level index structure
    4. ML based mapping can inference any size and offset of memory space (instead of using fix-sized memory pages)
      1. ML can potentially be smaller and run faster than multi-level

Challenges and Solutions

  1. Model Building : need better evaluation criteria
    1. e.g. use buffer cache miss rate instead of application performance
    2. fine-grained model vs coarse-grained model
      1. fine-grained can achieve more accurate prediction with more customization and specialization but require more resources
    3. learning multiple correlated tasks (buffer cache, flushing frequency, eviction policy → all related)
  2. Training :
    1. training data set must incur minimal cost to foreground applications
      1. not feasible to trace each memory access to build models for predicting memory eviction candidates
        1. offline: fine-grained data can be trained offline
        2. online: collect coarse-grained data to not disturb foreground application performance
    2. validation sets
      1. no theoretical best to compare
        1. CPU scheduling → fastest turn-around time
        2. page replacement → furthest in future
      2. solution: use reinforcement learning (reward function) instead of validation set
  3. Inference :
    1. Two categories:
      1. learning can only be run once
        1. foreground application operations do not wait for any of the inference results
        2. e.g. configuration and policy fall into this category
      2. OS decisions that must be made quickly
        1. e.g. thread scheduling
      3. Memory overhead for storing ML models
        1. could use model memory-reuse techniques in RNNs
  4. Security : attacker can train ML model to always evict other applications’ memory and launch a DOS attack …