4.Index construction
4.1 Hardware basics
Catching, seek time, buffer
increase the efficiency and decrease the time.
4.2 Block sort-based indexing
It describes methods for large
collections that require the use of secondary storage.
Important definitions: termID, reuters-RCV1, external
sorting algorithm (increasing the memory sufficiency) blocked sorting based
index algorithm, inversion (sorting, collecting) posting
4.3 Single-pass in-memory
indexing
It uses term instead of termID,
repeating on the token stream until the entire collection has been processed,
saving time but wasting memory.
4.4 Distributer indexing
Mapreduce:divide the work up into
chunk
4.5 Dynamic indexing
Problem: the new terms need to be
added to the dictionary, and posting list need to be updated for existing terms
Solution: periodically
reconstruct the index from scratch
Problem: there is a requirement that
new document should be included quickly
Solution: to maintain two indexes: a large
main and a small auxiliary index that stores new documents.
4.6 Other types of indexes
5.Index compression
It employs a number
of compression techniques for dictionary and inverted indexes that are
essential for efficient IR system.
5.1 Statistical properties of
terms in the IR
rule of 30: rule of 30 states
that the 30 most common words account for 30% of the tokens in written text
lossless lossy compression
Heap’s law: M = kTb
Zipf ‘s law: cfi ∝ 1
5.2 Dictionary compression:
Although the dictionary is small compared to
the posting files , it can determine the responsible time in IR system.
Dictionary
as a string: The simplest data structure for the dictionary
is to sort the vocabulary lexicographically and store it in an array of
fixed-width entries
Blocked storage: We
can further compress the dictionary by grouping terms in the string into blocks
of size k and keeping a term pointer only for the first term of each
block
5.3 Posting file compression
5.3.1 Variable byte codes
Variable byte (VB) encoding
uses an integral number of bytes to encode a gap. For most IR systems variable
byte codes offer an excellent tradeoff between time and space.
5.3.2 γ codes
Do not understand.