SAFETYLIT WEEKLY UPDATE

We compile citations and summaries of about 400 new articles every week.
RSS Feed

HELP: Tutorials | FAQ
CONTACT US: Contact info

Search Results

Journal Article

Citation

Fithian W, Hastie T. Ann. Stat. 2014; 42(5): 1693-1724.

Affiliation

Department of Statistics, Stanford University, 390 Serra Mall, Stanford, California 94305-4065, USA.

Copyright

(Copyright © 2014, Institute of Mathematical Statistics)

DOI

10.1214/14-AOS1220

PMID

25492979

Abstract

For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients θ*. By contrast, our estimator is consistent for θ* provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE-even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to [Formula: see text] if we multiply the baseline acceptance probabilities by c > 1 (and weight points with acceptance probability greater than 1), taking roughly [Formula: see text] times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.


Language: en

NEW SEARCH


All SafetyLit records are available for automatic download to Zotero & Mendeley
Print