Unit 1 : Muddiest
point:
I am a little confused about the “Whole View of System Oriented
IR”,what
is the role of the index processing ,and why it must have , I think retrieval
and ranking process with queries is enough.
Unit2 Reading Notes
Section1.2: A first
take at building an inverted index.
It introduces the major steps in inverted index
construction, and we should know:
1.
How to draw the inverted index that built for
some document collections?
2.
How to draw the term-document incidence matrix
for some document collections? And how to draw the inverted index
representation for this collection.
Chapter2: The term
vocabulary and posting lists
2.1.
How the basic unit of a document can be defined
and how the character sequence that it comprises is determined?
2.2.
How to determine the term vocabulary(tokenization,
stop words, normalization and stemming and lemmatization )
2.3.
It further explores how to use posting list data
structure and increase the efficiency of using it. (Skip list, if an index is
static)
2.4.
Biword indexed, positional indexes, combination
schemes( I am not get the point of this section)
Chapter 3
Dictionaries and tolerant retrieval
3.1. Finding the data structure to help search
for terms in the vocabulary in an inverted index (hash or search tree)
3.2.
A idea about “Wildcard query” (such as *a*e*i*o*u*, which seeks documents containing any term that
includes all the five vowels in sequence.
3.3. Some techniques to solve the spelling error in queries.
Two steps to solve the spelling error:
edit distance and k-gram overlap
Two basic principles are for the
spelling correction algorithms.
Two form of spelling corrections:
isolated-term, context-sensitive.
Two techniques for addressing
isolated-term correction: edit distance and k-gram overlap.
3.4. Phonetic
correction: generate a “phonetic hash”
No comments:
Post a Comment