Muddiest Point:
I am not quite understand the Darwish's Probabilistic structured query.
Saturday, April 5, 2014
Friday, March 28, 2014
Friday, March 21, 2014
Friday, February 28, 2014
Friday, February 21, 2014
Friday, February 14, 2014
Unit 6 Muddiest Points and Reading notes
Muddiest Point:
I am fine with this class,and I do not have the muddiest point.
Friday, February 7, 2014
Unit5 Muddiest point and Reading notes
Muddiest Point:
1. Why the rank document related
with the angle between query and document?
Thursday, January 30, 2014
Wednesday, January 22, 2014
Unit3 Reading notes.
4.Index construction
4.1 Hardware basics
Catching, seek time, buffer
increase the efficiency and decrease the time.
4.2 Block sort-based indexing
It describes methods for large
collections that require the use of secondary storage.
Important definitions: termID, reuters-RCV1, external
sorting algorithm (increasing the memory sufficiency) blocked sorting based
index algorithm, inversion (sorting, collecting) posting
4.3 Single-pass in-memory
indexing
It uses term instead of termID,
repeating on the token stream until the entire collection has been processed,
saving time but wasting memory.
4.4 Distributer indexing
Mapreduce:divide the work up into
chunk
4.5 Dynamic indexing
Problem: the new terms need to be
added to the dictionary, and posting list need to be updated for existing terms
Solution: periodically
reconstruct the index from scratch
Problem: there is a requirement that
new document should be included quickly
Solution: to maintain two indexes: a large
main and a small auxiliary index that stores new documents.
4.6 Other types of indexes
5.Index compression
It employs a number
of compression techniques for dictionary and inverted indexes that are
essential for efficient IR system.
5.1 Statistical properties of
terms in the IR
rule of 30: rule of 30 states
that the 30 most common words account for 30% of the tokens in written text
lossless lossy compression
Heap’s law: M = kTb
Zipf ‘s law: cfi ∝ 1
5.2 Dictionary compression:
Although the dictionary is small compared to
the posting files , it can determine the responsible time in IR system.
Dictionary
as a string: The simplest data structure for the dictionary
is to sort the vocabulary lexicographically and store it in an array of
fixed-width entries
Blocked storage: We
can further compress the dictionary by grouping terms in the string into blocks
of size k and keeping a term pointer only for the first term of each
block
5.3 Posting file compression
5.3.1 Variable byte codes
Variable byte (VB) encoding
uses an integral number of bytes to encode a gap. For most IR systems variable
byte codes offer an excellent tradeoff between time and space.
5.3.2 γ codes
Do not understand.
Thursday, January 16, 2014
Unit 2 muddiest point and Unit 1 reading notes
Unit 2
Muddiest point:
1.
Why does the stemming never lower recall?
2.
In what condition does the WSD not work?
Friday, January 10, 2014
Unit1 Muddiest point, Unit2 Reading Notes
Unit 1 : Muddiest
point:
I am a little confused about the “Whole View of System Oriented
IR”,what
is the role of the index processing ,and why it must have , I think retrieval
and ranking process with queries is enough.
Subscribe to:
Comments (Atom)