FarFetch Fusion Notes

Challenges:

mobile deployment of 3D Reconstruction is difficult
- sparse 3D data from mobile devices
- 3D data from multi-view depth streams must be accurately aligned and fused together
less than 100 ms end-to-end latency (capture + 3D reconstruction + streaming + render)

Proposed Design: Disentangled Fusion

Separate static (ear, nose, forehead) and dynamic (mouth, eye) facial information

Combine static info from multiple frames while fusing only the dynamic parts from recent frames

⇒ leverage spatio-temporal redundancy in multi-view video streams

⇒ reduce processing time of static information

TSDF (distance measurement from voxel to surface) - stored in hash map

Volumetric Fusion

Alignment (feature extraction + registration)
1. use 2D images to find landmarks (mouth, eyes, ears, etc.)
2. lift to 3D landmarks using 3D images ⇒ used to find similarity transform parameters
Fusion
1. Step 1: remove previous hash entries (to remove after image)
2. Step 2: cast ray from each pixel of RGB-D image, allocate hash for the voxels intersecting with ray (voxels on or near face surface are captured only)
3. Step 3: TSDF value is set (diff between z-axis position of voxel and depth value)
1. Rendering
  1. Rasterization
    
    → highly optimized but need TSDF voxel to point cloud conversion
    
    → Marching Cubs Algorithm (to convert TSDF voxels to point cloud)
  2. Raycasting: generate rendered images without converting TSDF voxels to 3D data (point cloud)
    
    → find surface voxels (TSDF values closest to zero) by casting ray from viewer POV
    
    → color information used to form 2D pixels
System Design: