Friday, February 14, 2014

Unit 6 Muddiest Points and Reading notes

Muddiest Point:
I am fine with this class,and I do not have the muddiest point.

Reading Notes:

8.1 Information retrieval system evaluation
In this chapter, the author measured the effectiveness of IR systems
8.2 Standard test collections
Here is a list of the most standard test collections:
The Cranfield collection: precise but too small
Text Retrieval Conference (TREC): topics and specified in detailed text passages. Largest and the topics are more consistent
GOV2: the largest Web collection easily available for research purposes.
NTCIR: Be focused on East Asian language and cross-language information retrieval,
CLEF: Be concentrated on European languages and cross-language information retrieval.
REUTERS: Its scale and rich annotation makes it a better basis for future research.
20 NEWSGROUPS: It consists of 1000 articles from each of 20 Usenet newsgroups (the newsgroup name being regarded as the category).
8.3 Evaluation of unranked retrieval sets
precision and recall
A single measure that trades off precision versus recall is the F measure, which is the weighted harmonic mean of precision and recall:
说明: Macintosh HD:Users:danyangli:Desktop:Screen Shot 2014-02-11 at 7.36.25 PM.png
8.4 Evaluation of ranked retrieval results
Entire precision-recall curve
The traditional way to test the entire precision-recall curve is 11-point interpolated average precision.
Other measures have become more common :Mean Average Precision (MAP)
measuring precision at fixed low levels of retrieved results, such as 10 or 30 documents. This is referred to as “Precision at k”,
R-PRECISION
BREAK-EVEN POINT
ROC CURVE
SENSITIVITY
SPECIFICITY
NDCG
8.5 Assessing relevance
It discussed developing reliable and informative test collections
8.6 A broader perspective: System quality and user utility
Another systems aspects that allow quantitative evaluation and the issue of user utility.
System issues: All the criteria apart from query language expressiveness are straightforwardly measurable: we can quantify the speed or size.
User utility: quantifying aggregate user happiness, based on the relevance, speed, and user interface of a system
Refining a deployed system
The most common version of this is A/B testingThe basis of A/B testing is running a bunch of single variable tests
8.7 Results snippets
static and dynamic
A static summary is generally comprised of either or both a subset of the document and metadata associated with the document
Dynamic summaries display one or more “windows” on the document, aiming to present the pieces that have the most utility to the user in evalu- ating the document with respect to their information need.














No comments:

Post a Comment