Blog for Class IS2140: Unit5 Muddiest point and Reading notes

Muddiest Point:

1. Why the rank document related with the angle between query and document？

Reading Notes

11.2 The Probability Ranking Principle

11.2.1 The 1/0 loss case

the 1/0loss case is he simplest case of PRP, You lose a point for either returning a no relevant document or failing to return a relevant document (such a binary situation where you are evaluated on your accuracy is called 1/0 loss).

11.2.2 The PRP with retrieval costs

Let C1 be the cost of not retrieving a relevant document and C0 the cost of retrieval of a non-relevant document. Then the Probability Ranking Principle says that if for a specific document d and for all documents d′ not yet retrieved

C0 · P(R = 0|d) − C1 · P(R = 1|d) ≤ C0 · P(R = 0|d′) − C1 · P(R = 1|d′)

11.3 The Binary Independence Model

11.3.1 Deriving a ranking function for query terms

The ct terms are log odds ratios for the terms in the query. We have the odds of the term appearing if the document is relevant (pt/(1 − pt)) and the odds of the term appearing if the document is non-relevant (ut/(1 − ut)

11.3.2 Probability estimates in theory

11.3.3

Probability estimates in practice

11.3.4 Probabilistic approaches to relevance feedback

1.Assume initial estimates for pt and ut as above.

2.Determine a guess for the size of the relevant document set. If unsure, a conservative (too small) guess is likely to be best. This motivates use of a fixed size set V of highest ranked documents.

3．Improve our guesses for pt and ut. We choose from the methods of Equa- tions (11.23) and (11.25) for re-estimating pt, except now based on the set V instead of VR. If we let Vt be the subset of documents in V containing xt and use add 1 smoothing, we get:

and if we assume that documents that are not retrieved are non-relevant then we can update our ut estimates as:

4.Go to step 2 until the ranking of the returned results converges.

11.4 An appraisal and some extensions

12 Language models for information retrieval

12.1 Language models

It introduces some kinds of concepts of language models：

（1）finite automata and language models

∑t∈V P(t) = 1

(2) unigram language model:

Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)

(3) Bigram language models, which condition on the previous term,

Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)

12.2 query likelihood model

It described the basic and most commonly used language modeling approach to IR,

(1) query likelihood model

P(q|Md) = Kq ∏ P(t|Md)tft,d

The approach is to

1. Infer a LM for each document. 

2. Estimate P(q|Mdi), the probability of generating the query according to each of these document models.

3. Rank the documents according to these probabilities.

(2) Estimating the query generation probability

In both cases the probability estimate for a word present in the document combines a discounted MLE and a fraction of the estimate of its prevalence in the whole collection, while for words not present in a docu- ment, the estimate is just a fraction of the estimate of the prevalence of the word in the whole collection.

(3) Ponte and Croft’s Experiments

Ponte and Croft argued strongly for the effectiveness of the term weights that come from the language modeling approach over traditional tf-idf weights.

12.3 Language modeling versus other approaches in IR

It introduces some comparisons between the language modeling approach and other approaches to IR

There the approaches that assume queries and documents are objects of the same type are also among the most successful. On the other hand, like all IR models, you can also raise objections to the model. The model has significant relations to traditional tf-idf models.

12.4 Extended language modeling approaches

In this section we briefly mention some of the work that extends the basic language modeling approach.

Blog for Class IS2140

Friday, February 7, 2014

Unit5 Muddiest point and Reading notes

No comments:

Post a Comment