SAFETYLIT WEEKLY UPDATE

We compile citations and summaries of about 400 new articles every week.
RSS Feed

HELP: Tutorials | FAQ
CONTACT US: Contact info

Search Results

Journal Article

Citation

Coley RY, Walker RL, Cruz M, Simon GE, Shortreed SM. Biom. J. 2021; ePub(ePub): ePub.

Copyright

(Copyright © 2021, John Wiley and Sons)

DOI

10.1002/bimj.202000199

PMID

unavailable

Abstract

Clinical visit data are clustered within people, which complicates prediction modeling. Cluster size is often informative because people receiving more care are less healthy and at higher risk of poor outcomes. We used data from seven health systems on 1,518,968 outpatient mental health visits from January 1, 2012 to June 30, 2015 to predict suicide attempt within 90 days. We evaluated true performance of prediction models using a prospective validation set of 4,286,495 visits from October 1, 2015 to September 30, 2017. We examined dividing clustered data on the person or visit level for model training and cross-validation and considered a within cluster resampling approach for model estimation. We evaluated optimism by comparing estimated performance from a left-out testing dataset to performance in the prospective dataset. We used two prediction methods, logistic regression with least absolute shrinkage and selection operator (LASSO) and random forest. The random forest model using a visit-level split for model training and testing was optimistic; it overestimated discrimination (area under the curve, AUC = 0.95 in testing versus 0.84 in prospective validation) and classification accuracy (sensitivity = 0.48 in testing versus 0.19 in prospective validation, 95th percentile cut-off). Logistic regression and random forest models using a person-level split performed well, accurately estimating prospective discrimination and classification: estimated AUCs ranged from 0.85 to 0.87 in testing versus 0.85 in prospective validation, and sensitivity ranged from 0.15 to 0.20 in testing versus 0.17 to 0.19 in prospective validation. Within cluster resampling did not improve performance. We recommend dividing clustered data on the person level, rather than visit level, to ensure strong performance in prospective use and accurate estimation of future performance at the time of model development.


Language: en

Keywords

machine learning; correlated data; electronic health records; nonignorable cluster size; predictive analytics

NEW SEARCH


All SafetyLit records are available for automatic download to Zotero & Mendeley
Print