You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using temperature, per-species accuracy and num-samples per species, design a formula to display a calibrated confidence from the softmax function or raw logits.
Background: when models output a prediction of a moth species, we don't know how confident the model is about this prediction. We use one common metric, the softmax score as a proxy for confidence, but this value does not take into account how much the model knows or doesn't know about each species. And the softmax score cannot be used to compare the output of different models. In most applied applications, the softmax score is never displayed. The interface may show "very sure" or "unsure" about a prediction. We need a way to compute a reliable threshold of when the predictions can be trusted, or should be questioned (or rolled up to a higher taxon rank).
The text was updated successfully, but these errors were encountered:
Temperature calibration. Setting all the configuration details aside, it is known that the NNs tend to be “too confident” when predicting the classes [GPSW17, MDR+21]. Here, confidence means the probability of the correctness of the prediction (e.g., softmax probability for the predicted class). In deep learning-related literature, methods have been developed to calibrate the confidence, i.e., going closer to the true probability. One simplest and common calibration factor is called the temperature, T, which “softens” the softmax in a way that T → 1 indicates the estimated output probability is close to its true probability cf. [GPSW17].
Using temperature, per-species accuracy and num-samples per species, design a formula to display a calibrated confidence from the softmax function or raw logits.
Background: when models output a prediction of a moth species, we don't know how confident the model is about this prediction. We use one common metric, the softmax score as a proxy for confidence, but this value does not take into account how much the model knows or doesn't know about each species. And the softmax score cannot be used to compare the output of different models. In most applied applications, the softmax score is never displayed. The interface may show "very sure" or "unsure" about a prediction. We need a way to compute a reliable threshold of when the predictions can be trusted, or should be questioned (or rolled up to a higher taxon rank).
The text was updated successfully, but these errors were encountered: