Blog for Class IS2140: January 2014

4.Index construction

4.1 Hardware basics

Catching, seek time, buffer increase the efficiency and decrease the time.

4.2 Block sort-based indexing

It describes methods for large collections that require the use of secondary storage.

Important definitions: termID, reuters-RCV1, external sorting algorithm (increasing the memory sufficiency) blocked sorting based index algorithm, inversion (sorting, collecting) posting

4.3 Single-pass in-memory indexing

It uses term instead of termID, repeating on the token stream until the entire collection has been processed, saving time but wasting memory.

4.4 Distributer indexing

Mapreduce:divide the work up into chunk

4.5 Dynamic indexing

Problem: the new terms need to be added to the dictionary, and posting list need to be updated for existing terms

Solution: periodically reconstruct the index from scratch

Problem: there is a requirement that new document should be included quickly

Solution: to maintain two indexes: a large main and a small auxiliary index that stores new documents.

4.6 Other types of indexes

5.Index compression

It employs a number of compression techniques for dictionary and inverted indexes that are essential for efficient IR system.

5.1 Statistical properties of terms in the IR

Ÿ rule of 30: rule of 30 states that the 30 most common words account for 30% of the tokens in written text

Ÿ lossless lossy compression

Ÿ Heap’s law: M = kTb

Ÿ Zipf ‘s law: cfi ∝ 1

5.2 Dictionary compression:

Although the dictionary is small compared to the posting files , it can determine the responsible time in IR system.

Ÿ Dictionary as a string: The simplest data structure for the dictionary is to sort the vocabulary lexicographically and store it in an array of fixed-width entries

Ÿ Blocked storage: We can further compress the dictionary by grouping terms in the string into blocks of size k and keeping a term pointer only for the first term of each block

5.3 Posting file compression

5.3.1 Variable byte codes

Variable byte (VB) encoding uses an integral number of bytes to encode a gap. For most IR systems variable byte codes offer an excellent tradeoff between time and space.

5.3.2 γ codes

Do not understand.

Blog for Class IS2140

Thursday, January 30, 2014

Unit4 Muddiest Point and Reading Notes

Wednesday, January 22, 2014

Unit3 Reading notes.

4.Index construction

5.Index compression

Thursday, January 16, 2014

Unit 2 muddiest point and Unit 1 reading notes

Friday, January 10, 2014

Unit1 Muddiest point, Unit2 Reading Notes