Muddiest point:
I am fine with the guest speaker's speech .
19 Web search basics
19.1 Background
and history:
Four things Make web search basics
different: The Web is unprecedented in many ways:
unprecedented in scale, unprecedented in the almost-complete lack of
coordination in its creation, and unprecedented in the diversity of backgrounds
and motives of its participants.
19.2 Web characteristics
Decentralized content publishing with essentially no
central control of authorship
(1)The web graph:two nodes A and B
from the web graph, each corre- sponding to a web page, with a hyperlink from A
to B. We refer to the set of all such nodes and directed edges as the web
graph.
(2)Spam:It is the
manipulation of web page content for the purpose of appearing high up in search
results for selected keywords.
19.3 Advertising as the economic model
Several aspects of Goto’s model are worth highlighting.
First, a user typing the query q into Goto’s search interface was
actively expressing an interest and intent related to the query q. For
instance, a user typing golf clubs is
more likely to be imminently purchasing a set than one who is simply browsing
news on golf. Second, Goto only got compensated when a user actually ex-
pressed interest in an advertisement – as evinced by the user clicking the ad-
vertisement. Taken together, these created a powerful mechanism by which to
connect advertisers to consumers, quickly raising the annual revenues of
Goto/Overture into hundreds of millions of dollars. This style of search en-
gine came to be known variously as sponsored search or search
advertising.
19.4 The search user experience
Here Google identified two principles that helped it grow
at the expense of its competitors:
(1) a focus on
relevance, specifically precision rather than recall in the first few re-
sults; (2) a user experience that is lightweight, meaning that both the search
query page and the search results page are uncluttered and almost entirely
textual, with very few graphical elements.
19.4.1 User query needs
There appear to be three broad categories into which
common web search queries can be grouped: (i) informational, (ii) navigational
and (iii) transac- tional.
19.5 Index size and estimation
1. In response to queries a search engine can return web
pages whose con- tents it has not (fully or even partially) indexed. For one
thing, search engines generally index only the first few thousand words in a
web page. In some cases, a search engine is aware of a page p that is linked
to by pages it has indexed, but has not indexed p itself. As we will
see in Chapter 21, it is still possible to meaningfully return p in
search results.
2. Search engines generally organize their indexes in
various tiers and parti- tions, not all of which are examined on every search
(recall tiered indexes from Section 7.2.1). For instance, a web page deep
inside a website may be indexed but not retrieved on general web searches; it
is however retrieved as a result on a search that a user has explicitly
restricted to that website (such site-specific search is offered by most web
search engines).
19.6 Near-duplicates and shingling
The simplest approach to detecting duplicates is to
compute, for each web page, a fingerprint that is a succinct (say
64-bit) digest of the characters on that page. Then, whenever the fingerprints
of two web pages are equal, we test whether the pages themselves are equal and
if so declare one of them to be a duplicate copy of the other.
21.Link analysis
21.2 PageRank
The PageRank of a node will depend on the link structure
of the web graph. Given a query, a web search engine computes a composite score
for each web page that combines hundreds of features such as cosine similarity
(Sec- tion 6.3) and term proximity (Section 7.2.2), together with the PageRank
score. This composite score, developed using the methods of Section 15.4.1, is
used to provide a ranked list of results for the query.
21.2.1 Markov chains
∀i,j,Pij ∈[0,1]
and
N ∀i,∑Pij =1.
1. If a row of A has no 1’s, then replace each element
by 1/N. For all other rows proceed as follows.
2. Divide each 1 in A by the number of 1’s in its
row. Thus, if there is a row with three 1’s, then each of them is replaced by
1/3.
A simple Markov chain with three states; the numbers on the links
3. Multiply the resulting matrix by 1 − α.
4. Add α/N to every entry of the resulting matrix, to
obtain P.
21.2.2 The PageRank computation
Recall the definition of a left eigenvector from
Equation 18.2; the left eigenvectors of the transition probability matrix P are
N-vectors ⃗π such that
⃗π P = λ ⃗π .
21.3 Hubs and Authorities

No comments:
Post a Comment