Drift Happens | Neural Architecture Robustness to Temporal Shift

Abstract

Overview

Real-world data distributions evolve over time, inducing temporal distribution shift that can substantially degrade the reliability of deployed machine learning systems.

We present a systematic empirical comparison of temporal robustness across three heterogeneous, time-indexed domains encompassing image classification, multi-label text classification, and text regression tasks. Using a unified evaluation framework based on temporal drift matrices, we train models on cumulative historical data and evaluate their performance on both earlier and later time periods.

Collectively, the results characterize how temporal drift degrades performance as the train-test gap widens across domains, drift scenarios, and model families, offering practical guidance for selecting architectures for dynamic systems; across all settings, inductive biases aligned with domain structure yield markedly greater temporal robustness.

Contributions

Paper Contributions

(I) Cross-Domain Temporal Study

Systematic evaluation of temporal robustness across heterogeneous domains and task types, rather than a single benchmark.

(II) Unified Temporal Evaluation Framework

Model-agnostic protocol based on cumulative historical training and temporal drift matrices, yielding comparable robustness measurements across datasets, tasks, and metrics.

(III) Architectural Inductive Bias Analysis

Large-scale comparison of diverse neural architectures to quantify how differing inductive biases shape robustness under temporal shift.

Background

Types of Temporal Drift

Following Gama et al. (2014), temporal distribution shift manifests along three axes:

Original Distribution

Circles and stars indicate the two label classes and the curve represents the decision boundary.

Covariate Shift

P(X) changes, P(Y|X) fixed

Input distribution evolves (e.g., hairstyles, image quality) while the labeling function remains stable.

Concept Drift

P(Y|X) changes

The relationship between inputs and labels shifts—same text may indicate different categories over time.

Label Shift

P(Y) changes

Class prevalence evolves as research topics or product categories rise and fall in popularity.

Experimental Setup

Datasets

1905–2013

Yearbook

Samples37,921

TaskBinary classification

Input32x32 grayscale

Categories (Label)2 (M/F)

Standardized portraits of U.S. high-school seniors with binary sex labels. Pronounced covariate and concept drift from evolving hairstyles, clothing, and photographic artefacts makes it a canonical long-range benchmark.

1986–2025

arXiv

Records2.87M → 360,265

TaskMulti-label classification

InputArticle titles

Categories (Label)176 → 5

2.87M article titles annotated with 176 subject categories (restricted to the five most frequent). Topic prevalence and terminology change over decades, inducing label and vocabulary drift in this multi-label setting. We use a sample of 360,265 articles across the top 5 categories.

1996–2023

Amazon Reviews

Reviews571M → 100,000

Task (Label)Regression (1-5 stars)

InputReview text

Categories33 → 7

571M reviews spanning 33 categories; we uniformly sample 100K reviews from seven categories to capture covariate and concept drift in consumer language. Strong rating skew motivates balanced error metrics. We only use a sample of 100,000 reviews across 7 categories.

Experimental Setup

Models

📸

Image Models (Yearbook)

MLP (7 variants)

Two to five fully connected layers (199K-1.1M parameters) trained from scratch on flattened 32x32 pixels. Provide a low-capacity reference with no spatial inductive bias.

CNN (5 variants)

Wild-Time baseline plus VGG-, AlexNet-, deep, and wide variants (29K-2.9M parameters) explore locality, depth, and receptive field width under convolutional inductive biases.

ResNet (2 variants)

Residual networks with two or three stages (660K-2.8M parameters) trained from scratch; skip connections stabilize deeper hierarchies for small portraits.

Pretrained (6 models)

DINOv2-S, DINOv3-S, CLIP-B/32, ConvNeXt-S, EVA02-B, and SigLIP-B provide frozen feature extractors (770 - 3k trainable classifier head parameters); we train only linear heads to isolate the benefit of large-scale pretraining.

📝

Text Models (arXiv, Amazon)

Feed-Forward (3 variants)

Frozen RoBERTa embeddings are mean-pooled and fed into a two-layer MLP (39M trainable parameters) with varying dropout (baseline, low, high); discards word order entirely.

TextCNN (5 variants)

Convolutional text classifiers with kernels spanning 2-5 tokens over learnable embeddings (6.6M-39M trainable parameters), capturing local n-gram statistics without long-range context.

RNN (3 variants)

Bidirectional GRU and LSTM models (6.6M-15.5M trainable parameters) with learned embeddings process sequences token-by-token; the largest adds attention pooling over hidden states.

Transformer (3 variants)

Lightweight Transformer encoders with 2, 3, or 6 layers (6.9M trainable parameters) use learned positional embeddings and mean pooling to capture long-range dependencies.

Training Protocol

Adam optimizer with cumulative temporal training; weighted binary cross-entropy for arXiv and rating-balanced mean squared error for Amazon.

Evaluation Framework

Temporal Drift Matrices

We adopt a cumulative-historical protocol: train on all data up to period \(k\), then evaluate on every period.

\[ M_{ij} = \text{perf}\!\left(f_i, \mathcal{D}_j\right) \]

Each entry \(M_{ij}\) measures performance of model \(f_i\) (trained through period \(i\)) when evaluated on period \(j\). This reveals both forward generalization (\(j > i\)) and backward consistency (\(j < i\)), where backward evaluation uses only held-out validation and test splits to prevent data leakage.

Matrix Interpretation

\(M_{ii}\)Diagonal: in-distribution (train and test on same period)

\(j > i\)Lower-right: forward generalization to unseen future

\(j < i\)Upper-left: backward generalization (held-out splits only)

\(|j - i|\)Off-diagonal distance: temporal gap magnitude

Temporal splits

For evaluation periods already included in training (\(j \le i\)), we use only held-out validation and test splits to avoid leakage; for \(j > i\), every available sample becomes out-of-distribution test data so forward estimates remain high-signal.

Training & losses

Cumulative models \(f_i\) are trained with Adam on all data up to slice \(i\). Yearbook minimizes cross-entropy, arXiv uses class-weighted binary cross-entropy per label, and Amazon minimizes rating-weighted mean squared error to counter strong label imbalance.

Evaluation metrics

Binary accuracy populates Yearbook drift matrices, macro-AUC (uniformly averaged over five categories) summarizes arXiv, and balanced MSE averages per-rating errors so rare 1- and 2-star reviews remain visible.

Example of average accuracy matrix computed on a group of Yearbook models.

Results

Image Classification (Yearbook)

Average faces from the Yearbook dataset across decades.

🏆

Performance

ResNet (2-3 stages, skip connections)

Most robust scratch-trained model. Diagonal accuracy >93%, smallest forward decay. Hierarchical features remain stable across the mid-century fashion transition.

CNN (VGG/AlexNet-style)

High diagonal (mid-90s), but 10-20pt drop on future decades. Performance collapses sharply post-1970s as hairstyles and accessories change.

MLP (2-5 layers)

Largest degradation—over 15pt loss on future data. Relies on global pixel statistics: hair silhouette, background intensity.

Pretrained (ConvNeXt, CLIP, DINO, EVA02, SigLIP)

Frozen backbones (self-supervised and vision-language) yield lower peak accuracy yet single-digit forward decay, trading specialization for smoother temporal performance.

🔬

Saliency Analysis

We compute gradient-based saliency maps to visualize which image regions each architecture attends to, explaining observed robustness differences.

MLP-Wide-Deep

Broad, noisy attribution over hair, clothing, and background—features that change significantly across decades.

ResNet-Small

Concentrated on inner facial regions (eyes, nose, mouth), stable across decades. Down-weights peripheral regions.

Interpretation

Architectures that focus on temporally stable semantic features (facial structure) exhibit higher cross-temporal robustness.

MLP-Wide-Deep (Trained ≤ 1950)

ResNet-Small (Trained ≤ 1950)

MLP-Wide-Deep (Trained ≤ 1970)

ResNet-Small (Trained ≤ 1970)

(a) ResNet-Small

(b) MLP-Small

Results

Text Classification & Regression

Across arXiv (macro-AUC) and Amazon Reviews (balanced MSE), early training slices degrade as the evaluation window moves forward, less dramatically than Yearbook yet enough to expose architectural differences.

📚

arXiv (Multi-label, Macro-AUC)

Feed-Forward (RoBERTa + MLP)

Order-agnostic pooling exhibits the steepest forward decay: models trained on 2000-era data lose >0.05 macro-AUC when evaluated post-2015 because new terminology shifts the embedding statistics.

TextCNN (n-grams)

Local convolutional filters over embeddings remain surprisingly brittle. Despite competitive diagonal scores, drift matrices show sharp drops for models trained before 2010.

RNN (BiGRU, BiLSTM)

Bidirectional GRU/LSTM variants achieve the flattest drift matrices: sequential inductive biases capture syntactic structure that changes slowly, yielding small deviations even far from the training slice.

Transformer (2-6 layers)

Lightweight Transformers match RNN diagonal performance but show moderate forward decay, landing between recurrent robustness and the drift sensitivity of TextCNN.

Figure 3: Drift Matrix — arXiv (Transformer, Macro-AUC)

🛒

Amazon Reviews (Regression, Balanced MSE)

Feed-Forward (RoBERTa + MLP)

Bag-of-embeddings baselines underperform both diagonally and forward-in-time. Ignoring word order makes them sensitive to shifts in review phrasing and platform-specific jargon.

TextCNN (n-grams)

Convolutional models show high variance across configurations and suffer pronounced future degradation despite strong local feature extraction.

RNN (BiGRU, BiLSTM)

GRU, LSTM, and LSTM-Attn variants deliver the lowest balanced MSE overall and the most stable forward trajectories—temporal drift barely nudges their error maps.

Transformer (2-6 layers)

Transformer encoders benefit from parallel self-attention yet still exhibit more drift than recurrent counterparts, underscoring the utility of explicit sequential inductive bias.

Figure 4: Drift Matrix — Amazon (Transformer, Balanced MSE)

Central Observation

Architectures whose inductive biases match domain structure (spatial locality for faces, sequential modeling for text, broad coverage from pretraining) retain accuracy far longer under temporal distribution shift.

Conclusions

Summary

In-distribution ≠ Temporal Robustness

High held-out accuracy does not predict cross-temporal generalization. Evaluation must explicitly measure forward drift.

Inductive Biases Determine Robustness

Architectures whose biases match domain structure—spatial hierarchy for images, sequential modeling for text—degrade more gracefully.

ResNets Most Robust for Images

Among scratch-trained models, ResNets show smallest forward decay; MLPs are most drift-sensitive due to reliance on global statistics.

Pretraining Smooths Degradation

Large-scale pretraining yields lower peak accuracy but flatter decay curves—trading specialization for temporal stability.

Discussion

Limitations & Future Work

Limitations

✗ Robustness only: models follow a cumulative-historical protocol with no periodic retraining, online adaptation, or continual learning.

✗ Restricted to three domains and standard predictive metrics—no calibration, fairness, or task-specific utility analysis.

✗ Single training run per configuration—no confidence intervals or statistical significance, so small gaps may reflect noise.

✗ Model capacities are not matched; parameter counts span 29K-86M for images and 6.6M-45M for text.

✗ arXiv results filter to the five most common categories, potentially excluding labels whose rise or decline drives drift.

Future Work

→ Combine drift matrices with scheduled or drift-triggered adaptation strategies under explicit compute and latency budgets.

→ Introduce adaptive time slicing, drift detection, and attribution methods to link performance drops to specific regime changes.

→ Repeat experiments across seeds, report confidence intervals, and perform statistical tests for robustness claims.

→ Run capacity-controlled comparisons to disentangle inductive bias from sheer model scale or implicit regularization.

→ Extend to additional modalities, larger pretrained encoders, and tasks such as recommendation or retrieval.