I need to:
(a) gather more data
(b) establish its "ground truth" labels
(c) measuring inter-rater or intra-rater reliability (one rater could be myself or my colleague, Jakob and the other could be a radiologist)
There are different methods of collecting ground truth. One strategy would be to ask a rater to sort the 100 images (or image patches) in order of density. For intra-rater reliability, I then shuffle the deck and have the rater repeat the task. For inter-rater reliability, I simply compare the sorted lists between raters. In fact, there are a variety of distances of ranked lists that I can use when the time comes.
An alternative:
Now if it is too much work for a rater to sort the entire set of images, I could use an approximation in which I pick a subset of random pairs of images, and just have the rater say which image in each pair is denser.
No comments:
Post a Comment