This study examines the accuracy and consistency of Generative AI (GenAI) by testing ChatGPT's ability to estimate the accuracy of 719 business research hypotheses. For critical tasks, we find GenAI performance to be inadequate in terms of accuracy and consistency. Accuracy improved only marginally from 76.5% (GPT-3.5, 2024) to 80% (GPT-5 mini, 2025), yielding an effective chance-adjusted accuracy of only 60%. Moreover, accuracy drops significantly for insignificant hypotheses, reaching only 16.4% in 2025. Crucially, consistency across ten identical prompts was poor, with over a quarter of the cases having at least one incorrect estimation. We conclude that GenAI's linguistic fluency is not yet backed by commensurate conceptual intelligence and frequently produces unreliable output, necessitating vigilant human oversight.