AI groups rush to redesign model testing and create new benchmarks (Financial Times)