Muddiest Point:
I am fine with this class,and I do not have the muddiest point.
Reading Notes:
8.1 Information retrieval system evaluation
In this chapter, the author measured the effectiveness of
IR systems
8.2 Standard test collections
Here is a list of the most standard test collections:
The Cranfield collection: precise but too small
Text Retrieval Conference (TREC): topics and specified in detailed text passages. Largest and the
topics are more consistent
GOV2: the largest Web
collection easily available for research purposes.
NTCIR: Be focused on
East Asian language and cross-language information retrieval,
CLEF: Be concentrated on
European languages and cross-language information retrieval.
REUTERS: Its scale and rich annotation makes it a better basis for
future research.
20 NEWSGROUPS: It consists of 1000 articles from each of 20 Usenet
newsgroups (the newsgroup name being regarded as the category).
8.3 Evaluation of unranked retrieval sets
precision and recall
A single measure that trades off precision versus recall
is the F measure, which is the weighted harmonic mean of precision and
recall:

8.4 Evaluation of ranked retrieval results
Entire precision-recall curve
The traditional way to test the entire precision-recall
curve is 11-point interpolated average precision.
Other measures have become more common :Mean Average
Precision (MAP)
measuring precision at fixed low levels of retrieved
results, such as 10 or 30 documents. This is referred to as “Precision at k”,
R-PRECISION
BREAK-EVEN POINT
ROC CURVE
SENSITIVITY
SPECIFICITY
NDCG
8.5 Assessing relevance
It discussed developing reliable and informative test
collections。
8.6 A broader perspective: System quality and user
utility
Another systems aspects that allow quantitative
evaluation and the issue of user utility.
System issues: All
the criteria apart from query language expressiveness are straightforwardly measurable:
we can quantify the speed or size.
User utility: quantifying
aggregate user happiness, based on the relevance, speed, and user interface of
a system
Refining a deployed system
The most common version of this is A/B testing,The basis of A/B testing is running a bunch of single
variable tests
8.7 Results snippets
static
and dynamic
A static summary is generally comprised of either or both
a subset of the document and metadata associated with the document
Dynamic summaries display one or more “windows” on the
document, aiming to present the pieces that have the most utility to the user
in evalu- ating the document with respect to their information need.
No comments:
Post a Comment