Currently there is little practical advice on which treatment effect estimator to use when trying to adjust for observable differences. A recent suggestion is to compare the performance of estimators in simulations that somehow mimic the empirical context. Two ways to run such 'empirical Monte Carlo studies' (EMCS) have been proposed. We show theoretically that neither is likely to be informative except under restrictive conditions that are unlikely to be satisfied in many contexts. To test empirical relevance, we also apply the approaches to a real-world setting where estimator performance is known. We find that in our setting both EMCS approaches are worse than random at selecting estimators which minimise absolute bias. They are better when selecting estimators that minimise mean squared error. However, using a simple bootstrap is at least as good and often better. For now researchers would be best advised to use a range of estimators and compare estimates for robustness.