Addressing level of expertise


The question/answer model for classifying coins is reasonable for applications where the expertese of the person who trained the system and the user of the system are equal. For instance in the bird dataset presented by [1] the questions and answers were mostly understood by the users, and thus the “human” element can be used to increase classification accuracy. However, the paper notes that the deterministic  experiments yielded a higher accuracy than when tested with actual humans. Was this due to lack of complete understanding of the questions and answers? The paper addressed this issue by allowing users to judge their confidence of their answers. While a reasonable method, there is little work done in the field of confidence accuracy from the psychological standpoint. Is it fair to be confident about confidence?

Back to the coins, there are a few intuitions which seem important to be noted.First, coins are ultimately annotated  by students who copy from primary sources written by experts.  Second, ideally the expert uses a more precise vocabulary than the student (where “precise” is a metric not yet defined). Third, ideally the entropy present in the answers given by experts is far higher than that of students. That is, students answer questions in more general terms than experts such that an expert’s answer is far more helpful in classifying a coin than a student’s answer. The problem arises due to the fact that coins are annotated by experts and no graceful method for generalizing the questions/answers exists implicitly in the dataset. All questions trivially present in the database such as “Name the figure on the reverse of the coin from this list” has no clear general form. That is, not without knowing about linguistics because the general form of the question could be “Is the figure on the reverse of the coin a male or female or other?”

One way of generalizing questions/answers is to use WordNet. This is not a trivial issue, but has been researched at length. Instead of asking the question “Figure on the obverse of coin” and offer over 100 answers to choose from, more abstract questions could potentially be used by abstracting the answers, and then further refining the answer as needed. Bellow is a tree of the abstraction of answers where the green circles represent actual answers encoded in the dataset and the black circles represent abstract answers.

Many terms are not included in this chart, and many of the generalized answers are not intuitive or what questions can be made such that the answers make sense. These are points of further research. What does become apparent is that several answers which are encoded in the dataset can be generalized into much broader terms. Example: instead of having answers “Ares” and “Luna” side by side its more intuitive (a metric that needs further defining) for a student to be first asked about “Greek deities” and “Roman deities”.

This is not a new idea and has been partially been explored in [2] where image classification is improved by the use of WordNet ontologies. What has not been done however, is added the interactive nature of question and answer which is crucial for difficult datasets such as coins.
  1. “Visual Recognition with Humans in the Loop”, Branson, Wah, Schroff, Babenko, Welinder, Perona, Belongie. 2010
  2. “Exploiting ontologies for automatic image annotation”, Srikanth, Varner, Bowden, Moldovan. SIGIR 2005.

Answer similarity


In the question/answer model for coin identification similarity is an important metric. Prior methods in question/answer systems have shown good results by using image similarity and across-question similarity, however answers have been mostly been ignored.

Let’s assume for any given question, there is a discreet set of possible answers, one of which is the “correct” one. In prior work if the wrong answer is given, it can drastically skew the accuracy of results. This is particularly the case in the Coin dataset. If a coin is annotated to have a “young male” on the obverse side but the user mistakes it for “Zeus” this will make identifying the coin almost impossible. Current systems recover from such situations by asking the user more questions as to dampen the incorrect answer. This is however clumsy as the more questions that are asked, more likely is there a chance for incorrect answers to be given (or rather, there is a fixed error rate for all questions).

It’s for this reason that it would seem good to have a sort of metric to measure how similar answers are as to not to consider “young male” confused with “Zeus” to be completely incorrect. Many issues arise which are curious by themselves as possible research paths:
  • “young male” vs. “boy”. WordNet can be used to find similarity of single words, not phrases. How to address phrase similarity is a problem.
  • “man” vs. “Zeus”. WordNet does not have proper nouns. This seems a problem suited for ontological studies.
  • “Zeus” vs. “Jupiter”. This is the most curious issue because it crosses several problems. First it is similar to the ontological problem as both could be thought of as “man” and thus share some similarity. Or some equivalency could also exist, like a pseudonym. Second, and much more interesting, there exists no historical/mythical genealogical database similar to WordNet. It would be very helpful if similarity between “Henry II” and “Henry III” could be calculated because that is much more similar to “Buddha” who shares no historical or mythical connection.

Questions to ask


One method to dealing with multi-class classification/image retrieval is incorporating the user into the task. Particularly interesting is with the approach of asking the user questions which he/she can readily answer but the computer cannot.


A simple example would be using the “Animals with attributes” dataset where every picture has a set of attributes which are all equally challenging. These include things such as: Is it X (where X is the color of the animal), does it have stripes, does it like water, does it eat X, and etc&… These are complex which require prior knowledge but that can also quickly improve classification/retrieval accuracy. [1]


Coins are very similar to the “Animal with attributes” dataset. Some basic questions include: Does the obverse contain an animal, what is the pose of the obverse figure, how many beings are on the obverse side, and &etc… As it can be see, while questions can be made, they are no longer boolean. While they could be reduced to boolean questions it would also create dependencies between the questions as well as drastically increasing the total number of attributes. For this reason a good system would have to be able to incoreporate non-boolean questions for it to be practical. Also many of the answers to these questions are similar for instance some figures are said to be “charging” while others are “running”. These similarities if not taken account of would make any system impractical. Some understanding of attribute similarity is important to allow for both user and ground-truth errors that may exist.

[1] Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer. Lampert, Nickisch, Harmeling. 2009