Wednesday, January 22, 2014

Unit3 Reading notes.

4.Index construction

4.1 Hardware basics
Catching, seek time, buffer increase the efficiency and decrease the time.
4.2 Block sort-based indexing
It describes methods for large collections that require the use of secondary storage.
Important definitions: termID, reuters-RCV1, external sorting algorithm (increasing the memory sufficiency) blocked sorting based index algorithm, inversion (sorting, collecting) posting
4.3 Single-pass in-memory indexing
It uses term instead of termID, repeating on the token stream until the entire collection has been processed, saving time but wasting memory.
4.4 Distributer indexing
Mapreduce:divide the work up into chunk
           4.5 Dynamic indexing
Problem: the new terms need to be added to the dictionary, and posting list need to be updated for existing terms
Solution: periodically reconstruct the index from scratch
Problem: there is a requirement that new document should be included quickly
     Solution: to maintain two indexes: a large main and a small auxiliary index that stores new documents.
4.6 Other types of indexes
  

5.Index compression

  It employs a number of compression techniques for dictionary and inverted indexes that are essential for efficient IR system.
5.1 Statistical properties of terms in the IR
    Ÿ rule of 30: rule of 30 states that the 30 most common words account for 30% of the tokens in written text
  Ÿ  lossless lossy compression                               

  Ÿ  Heap’s law: M = kTb
Ÿ  Zipf ‘s law: cfi 1
5.2 Dictionary compression:
  Although the dictionary is small compared to the posting files , it can determine the responsible time in IR system.
Ÿ Dictionary as a string: The simplest data structure for the dictionary is to sort the vocabulary lexicographically and store it in an array of fixed-width entries
  Ÿ  Blocked storage: We can further compress the dictionary by grouping terms in the string into blocks of size k and keeping a term pointer only for the first term of each block
 5.3 Posting file compression
 5.3.1 Variable byte codes
Variable byte (VB) encoding uses an integral number of bytes to encode a gap. For most IR systems variable byte codes offer an excellent tradeoff between time and space.
  5.3.2 γ codes
   Do not understand.







Thursday, January 16, 2014

Unit 2 muddiest point and Unit 1 reading notes

Unit 2 
Muddiest point:
1.     Why does the stemming never lower recall?
2.     In what condition does the WSD not work?

Friday, January 10, 2014

Unit1 Muddiest point, Unit2 Reading Notes

Unit 1 : Muddiest point:

I am a little confused about the Whole View of System Oriented IR”,what is the role of the index processing ,and why it must have , I think retrieval and ranking process with queries is enough.