Blog for Class IS2140: 2014

Wednesday, January 22, 2014

Unit3 Reading notes.

4.Index construction

4.1 Hardware basics

Catching, seek time, buffer increase the efficiency and decrease the time.

4.2 Block sort-based indexing

It describes methods for large collections that require the use of secondary storage.

Important definitions: termID, reuters-RCV1, external sorting algorithm (increasing the memory sufficiency) blocked sorting based index algorithm, inversion (sorting, collecting) posting

4.3 Single-pass in-memory indexing

It uses term instead of termID, repeating on the token stream until the entire collection has been processed, saving time but wasting memory.

4.4 Distributer indexing

Mapreduce:divide the work up into chunk

4.5 Dynamic indexing

Problem: the new terms need to be added to the dictionary, and posting list need to be updated for existing terms

Solution: periodically reconstruct the index from scratch

Problem: there is a requirement that new document should be included quickly

Solution: to maintain two indexes: a large main and a small auxiliary index that stores new documents.

4.6 Other types of indexes

5.Index compression

It employs a number of compression techniques for dictionary and inverted indexes that are essential for efficient IR system.

5.1 Statistical properties of terms in the IR

Ÿ rule of 30: rule of 30 states that the 30 most common words account for 30% of the tokens in written text

Ÿ lossless lossy compression

Ÿ Heap’s law: M = kTb

Ÿ Zipf ‘s law: cfi ∝ 1

5.2 Dictionary compression:

Although the dictionary is small compared to the posting files , it can determine the responsible time in IR system.

Ÿ Dictionary as a string: The simplest data structure for the dictionary is to sort the vocabulary lexicographically and store it in an array of fixed-width entries

Ÿ Blocked storage: We can further compress the dictionary by grouping terms in the string into blocks of size k and keeping a term pointer only for the first term of each block

5.3 Posting file compression

5.3.1 Variable byte codes

Variable byte (VB) encoding uses an integral number of bytes to encode a gap. For most IR systems variable byte codes offer an excellent tradeoff between time and space.

5.3.2 γ codes

Do not understand.

Friday, January 10, 2014

Unit1 Muddiest point, Unit2 Reading Notes

Unit 1 : Muddiest point:

I am a little confused about the “Whole View of System Oriented IR”，what is the role of the index processing ,and why it must have , I think retrieval and ranking process with queries is enough.

Blog for Class IS2140

Saturday, April 5, 2014

Muddiest Point

Friday, March 28, 2014

Unit 10 Muddiest point and Reading notes.

Muddiest point:

I am fine with this class.

Friday, March 21, 2014

Unit 9 Muddiest point and Reading notes

Friday, February 28, 2014

Unit 8 Muddiet Point and Reading Notes

Friday, February 21, 2014

Unit 7 Muddiest point and Reading notes

Friday, February 14, 2014

Unit 6 Muddiest Points and Reading notes

Friday, February 7, 2014

Unit5 Muddiest point and Reading notes

Thursday, January 30, 2014

Unit4 Muddiest Point and Reading Notes