A pig walking through a sorting hallway with a single RGB-D camera mounted on the ceiling overhead, looking straight down.
Capture setup. A single RGB-D camera looks straight down over a sorting hallway the pigs already walk through — no restraint, no extra handling.
Top-down view of a pig walking under the camera: synchronized color frame on the left and colorized depth frame on the right.
Model input. As the animal passes underneath, the camera records synchronized color and depth. Only the depth stream is used at inference.
B-mode ultrasound image at the last rib with the skin-fat interface, fat-loin interface, and last rib annotated.
Ground truth. Slaughter-lab ultrasound at the 12th rib gives the backfat and loin-depth labels the model is trained to predict.

Abstract

Sow body condition is an important indicator for growers as it has a large impact on lactation performance and piglet survival. However, body condition measures used during production, such as visual scoring and calipers, correlate poorly with underlying tissue composition. Ultrasound scans can provide direct measurements of subcutaneous backfat thickness and loin muscle depth, but their operation is labor intensive and not scalable for production. We present PigFormer, an end-to-end two-stage system that takes raw depth frames from a ceiling-mounted RGB-D camera and predicts subcutaneous backfat thickness, loin muscle depth, and total tissue thickness at the last rib. Stage 1 is a geometric front-end that converts raw depth into a standardized height map via SAM3-to-MaskDINO segmentation distillation, ground-plane removal, and orientation normalization. Stage 2 is a Slice Attention Encoder that treats each height map as a sequence of cross-sectional slices and captures spatial relationships along the full dorsal surface. On a multi-site dataset of 319 sow and gilt instances (6,705 frames) from two facilities, PigFormer achieves 2.43 mm backfat MAE and 3.87 mm overall MAE. It outperforms strong single-stage ResNet-18 and ViT-small baselines that feed raw depth directly to a pretrained backbone, isolating the contribution of Stage 1.

Method

PigFormer two-stage pipeline: ceiling-mounted depth camera → Stage 1 geometric front-end (SAM3-to-MaskDINO segmentation + ground-plane removal + orientation normalization) → 96×224 height map → Stage 2 Slice Attention Encoder regressing fat, loin, and total tissue depth at the last rib.
The two-stage pipeline. A ceiling-mounted depth camera captures raw frames. Stage 1 (geometric front-end) segments the pig, removes the ground plane, and rotates the height map so each pig faces right. Stage 2 (Slice Attention Encoder) tokenizes the 224 cross-sectional slices of the height map and regresses backfat, loin muscle, and total tissue depth at the last rib.
Stage 1

Geometric front-end

Raw depth frames are turned into a canonical 96 × 224 height map. A depth-only MaskDINO segmenter — distilled from SAM3-on-RGB pseudo-labels so it needs no color or text prompt at deployment — isolates the pig. RANSAC removes the ground plane, and a minimum-area-rectangle long axis rotates every animal to face right. The result is a metric, top-down height image that is invariant to where the pig stood under the camera.

Stage 2

Slice Attention Encoder

The height map is read as a sequence of 224 cross-sectional slices along the spine, each a 96-d token. A single RoPE transformer encoder layer (8 heads) relates slices along the full dorsal surface; a concatenated mean + max pool feeds a small MLP head that regresses backfat, loin, and total tissue depth at the last rib. One layer is optimal at this data scale — deeper encoders overfit.

Dataset

PigFormer is trained and evaluated on a multi-site collection of 319 sow and gilt instances (6,705 depth frames) recorded with ceiling-mounted Azure Kinect / Orbbec cameras at two independent facilities — Michigan State University (116 instances) and the University of Nebraska–Lincoln (203 instances). Ground-truth backfat and loin depths come from slaughter-lab ultrasound at the 12th rib. Splits are made at the animal level: roughly 20% of unique IDs are held out as a fixed test set and the remainder is divided into four cross-validation folds, so no animal ever appears in more than one split.

Results

Held-out test results on 79 sow / gilt instances. MAE in mm. Per-frame inference measured on A100 with batch = 1 (MaskDINO Stage 1 in fp16; UNet Stage 1, single-stage backbones, and PigFormer Stage 2 in fp32). Single-stage baselines feed raw depth directly to an ImageNet-pretrained backbone and predict fat and loin only (total is f + ŷl at evaluation). PigFormer numbers are 4-fold cross-validation ensembles with output aggregation. Best MAE in bold.

Method Backbone Inference (ms / frame) MAE (mm) ↓
Stage 1 Stage 2 Fat Loin Total Overall
ViT-small (single-stage) ViT-S/16 4.98 3.577.298.166.34
ResNet-18 (single-stage) ResNet-18 2.88 2.886.105.814.93
PigFormer MaskDINO (R50-300q-9L) 106.920.50 2.43 5.01 4.19 3.87
PigFormer Pruned MaskDINO (R18-50q-5L) 52.730.50 2.34 5.274.203.94
PigFormer UNet (MobileNetV3-Small) 6.580.50 2.405.204.263.95
Human Ultrasound Std 1.302.022.291.87

End-to-end PigFormer with the UNet Stage 1 runs in ≈ 7 ms / frame on a single A100, fast enough for real-time monitoring on an installed-camera stream. The pruned MaskDINO retains the detection-style inductive bias for out-of-distribution content (handlers, empty pens) at half the latency of the original.

What does the model look at?

A natural worry is that PigFormer simply regresses global statistics — body volume or mean height — rather than reading local anatomy. We test this with a SmoothGrad × Input attribution summed per spine column, normalized, and averaged over all 79 test animals in shared body coordinates.

Population attribution curve over relative body position from tail (0) to head (1), showing a tail-rump peak near 0.1 and a broad rib-shoulder plateau from 0.4 to 0.9, with the anatomical last rib marked at 0.4. Curves for fat, loin, and total share the same shape.
The aggregate importance curve is strongly non-uniform — a sharp tail/rump peak (rel. ≈ 0.1) and a broad rib–shoulder plateau (rel. 0.4–0.9), with the anatomical last rib (dashed) on the plateau's rising edge. A global-statistic regressor would produce a flat curve. To pin this down quantitatively, a ridge regressor on four global features (mean height, body volume, area, max height) reaches only 9.13 mm MAE — 2.3× worse than PigFormer — confirming the encoder localizes to anatomically meaningful structure even with a single transformer layer.

Citation

@inproceedings{bashar2026pigformer,
  title     = {What's Under the Skin? Estimating Swine Body Condition},
  author    = {Bashar, Mk and Bhatti, Kuljit and Rohrer, Gary
               and Benjamin, Madonna and Brown-Brandl, Tami
               and Morris, Daniel},
  booktitle = {CV4Animals Workshop, IEEE/CVF Conference on Computer Vision
               and Pattern Recognition (CVPR)},
  year      = {2026},
  eprint    = {2606.05611},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}