Friday, February 7, 2014

Unit5 Muddiest point and Reading notes

Muddiest Point:
1. Why the rank document related with the angle between query and document

Reading Notes
11.2 The Probability Ranking Principle
11.2.1 The 1/0 loss case
    the 1/0loss case is he simplest case of PRP, You lose a point for either returning a no relevant document or failing to return a relevant document (such a binary situation where you are evaluated on your accuracy is called 1/0 loss).
11.2.2 The PRP with retrieval costs
Let C1 be the cost of not retrieving a relevant document and C0 the cost of retrieval of a non-relevant document. Then the Probability Ranking Principle says that if for a specific document d and for all documents d not yet retrieved
C0 · P(R = 0|d) − C1 · P(R = 1|d) ≤ C0 · P(R = 0|d) − C1 · P(R = 1|d)
11.3 The Binary Independence Model
11.3.1 Deriving a ranking function for query terms


The ct terms are log odds ratios for the terms in the query. We have the odds of the term appearing if the document is relevant (pt/(1 pt)) and the odds of the term appearing if the document is non-relevant (ut/(1 ut)

11.3.2 Probability estimates in theory

11.3.3
Probability estimates in practice
11.3.4 Probabilistic approaches to relevance feedback
1.Assume initial estimates for pt and ut as above.
2.Determine a guess for the size of the relevant document set. If unsure, a conservative (too small) guess is likely to be best. This motivates use of a fixed size set V of highest ranked documents.
3Improve our guesses for pt and ut. We choose from the methods of Equa- tions (11.23) and (11.25) for re-estimating pt, except now based on the set V instead of VR. If we let Vt be the subset of documents in V containing xt and use add 1 smoothing, we get:


and if we assume that documents that are not retrieved are non-relevant then we can update our ut estimates as:


4.Go to step 2 until the ranking of the returned results converges.
11.4 An appraisal and some extensions
12 Language models for information retrieval
12.1 Language models
   It introduces some kinds of  concepts of language models
1finite automata and language models
tV P(t) = 1
 (2) unigram language model:
Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)
(3) Bigram language models, which condition on the previous term,
Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)

12.2 query likelihood model
It described the basic and most commonly used language modeling approach to IR,
(1)  query likelihood model
P(q|Md) = Kq P(t|Md)tft,d
The approach is to
1. Infer a LM for each document.

2. Estimate P(q|Mdi), the probability of generating the query according to each of these document models.
3. Rank the documents according to these probabilities.
(2)  Estimating the query generation probability
In both cases the probability estimate for a word present in the document combines a discounted MLE and a fraction of the estimate of its prevalence in the whole collection, while for words not present in a docu- ment, the estimate is just a fraction of the estimate of the prevalence of the word in the whole collection.

(3)  Ponte and Croft’s Experiments
Ponte and Croft argued strongly for the effectiveness of the term weights that come from the language modeling approach over traditional tf-idf weights.
12.3 Language modeling versus other approaches in IR
It introduces some comparisons between the language modeling approach and other approaches to IR
There the approaches that assume queries and documents are objects of the same type are also among the most successful. On the other hand, like all IR models, you can also raise objections to the model. The model has significant relations to traditional tf-idf models.
12.4 Extended language modeling approaches
In this section we briefly mention some of the work that extends the basic language modeling approach.









No comments:

Post a Comment