It can be difficult to compare the performance between different models if you run one model on different data. Pretty happy to see clearly the comparison in performance between different model versions with Weight and Biases. The graphs shown below indicate the average performance of different models on different metrics