Detecting objects in the environment, and forming a sense of confidence in these decisions typically involves multisensory processing. We sought to characterize how humans form amodal and modality-specific confidence judgments during audiovisual detection. We found that participants made more accurate detection and confidence judgments for audiovisual than unimodal stimuli. To explain these results, we extended a Bayesian evidence accumulation model to audiovisual detection and successfully reproduced both unimodal and audiovisual detection judgments. Despite being fitted to decisions and decision times alone, our model accurately reproduced modality-specific confidence. It failed, however, to account for amodal confidence, suggesting that the latter might not arise from optimal signal integration in detection contexts. Our results indicate that, in the presence of audiovisual signals, different integration rules apply for perceptual and metacognitive decisions.