Can publicly available, web-scrapeddata beused to identify promising business startups at an early stage? To answer this question, we usesuchtextual and non-textual informationabout the names of Danish firms andtheiraddressesas well as their business purpose statements (BPSs)supplemented by core accounting informationalong with founderand initial startup characteristics to forecastthe performance of newly started enterprisesover a five years' time horizon. The performanceoutcomes we consider are involuntary exit, above-average employment growth, areturn on assets of above 20 percent, new patent applicationsandparticipationin an innovation subsidy program. Our first key finding is that our models predictstartup performance with either high or very high accuracy with the exception of high returns on assets where predictive power remainspoor. Our second key finding is that the datarequirements for predicting performance outcomes with such accuracyare low. To forecast the two innovation-related performance outcomes well, we only need toinclude a set of variables derived from the BPS texts while an accurate prediction of startup survival and high employment growth needs the combination of (i) informationderivedfrom thenamesof the startups, (ii) data on elementary founder-related characteristics and (iii) eithervariables describing theinitial characteristics of the startup (to predict startup survival) or business purpose statement information (to predict high employment growth). These sets of variables are easily obtainable since the underlying information is mandatory to report upon business registration. The substantial accuracy of our predictions for survival, employment growth, new patents and participation in innovation subsidy programs indicates ample scope for algorithmic scoring models as an additional pillar of funding and innovation support decisions.
Das Dokument ist öffentlich zugänglich im Rahmen des deutschen Urheberrechts.