Saturday, April 5, 2014

Muddiest Point

Muddiest Point:
   I am not quite understand the  Darwish's Probabilistic structured query.

Friday, February 14, 2014

Friday, February 7, 2014

Unit5 Muddiest point and Reading notes

Muddiest Point:
1. Why the rank document related with the angle between query and document

Wednesday, January 22, 2014

Unit3 Reading notes.

4.Index construction

4.1 Hardware basics
Catching, seek time, buffer increase the efficiency and decrease the time.
4.2 Block sort-based indexing
It describes methods for large collections that require the use of secondary storage.
Important definitions: termID, reuters-RCV1, external sorting algorithm (increasing the memory sufficiency) blocked sorting based index algorithm, inversion (sorting, collecting) posting
4.3 Single-pass in-memory indexing
It uses term instead of termID, repeating on the token stream until the entire collection has been processed, saving time but wasting memory.
4.4 Distributer indexing
Mapreduce:divide the work up into chunk
           4.5 Dynamic indexing
Problem: the new terms need to be added to the dictionary, and posting list need to be updated for existing terms
Solution: periodically reconstruct the index from scratch
Problem: there is a requirement that new document should be included quickly
     Solution: to maintain two indexes: a large main and a small auxiliary index that stores new documents.
4.6 Other types of indexes
  

5.Index compression

  It employs a number of compression techniques for dictionary and inverted indexes that are essential for efficient IR system.
5.1 Statistical properties of terms in the IR
    Ÿ rule of 30: rule of 30 states that the 30 most common words account for 30% of the tokens in written text
  Ÿ  lossless lossy compression                               

  Ÿ  Heap’s law: M = kTb
Ÿ  Zipf ‘s law: cfi 1
5.2 Dictionary compression:
  Although the dictionary is small compared to the posting files , it can determine the responsible time in IR system.
Ÿ Dictionary as a string: The simplest data structure for the dictionary is to sort the vocabulary lexicographically and store it in an array of fixed-width entries
  Ÿ  Blocked storage: We can further compress the dictionary by grouping terms in the string into blocks of size k and keeping a term pointer only for the first term of each block
 5.3 Posting file compression
 5.3.1 Variable byte codes
Variable byte (VB) encoding uses an integral number of bytes to encode a gap. For most IR systems variable byte codes offer an excellent tradeoff between time and space.
  5.3.2 γ codes
   Do not understand.







Thursday, January 16, 2014

Unit 2 muddiest point and Unit 1 reading notes

Unit 2 
Muddiest point:
1.     Why does the stemming never lower recall?
2.     In what condition does the WSD not work?

Friday, January 10, 2014

Unit1 Muddiest point, Unit2 Reading Notes

Unit 1 : Muddiest point:

I am a little confused about the Whole View of System Oriented IR”,what is the role of the index processing ,and why it must have , I think retrieval and ranking process with queries is enough.