Blog for Class IS2140: Unit 9 Muddiest point and Reading notes

Muddiest point:

I am fine with the guest speaker's speech .

19 Web search basics

19.1 Background and history:

Four things Make web search basics different: The Web is unprecedented in many ways: unprecedented in scale, unprecedented in the almost-complete lack of coordination in its creation, and unprecedented in the diversity of backgrounds and motives of its participants.

19.2 Web characteristics

Decentralized content publishing with essentially no central control of authorship

（1）The web graph：two nodes A and B from the web graph, each corre- sponding to a web page, with a hyperlink from A to B. We refer to the set of all such nodes and directed edges as the web graph.

（2）Spam：It is the manipulation of web page content for the purpose of appearing high up in search results for selected keywords.

19.3 Advertising as the economic model

Several aspects of Goto’s model are worth highlighting. First, a user typing the query q into Goto’s search interface was actively expressing an interest and intent related to the query q. For instance, a user typing golf clubs is more likely to be imminently purchasing a set than one who is simply browsing news on golf. Second, Goto only got compensated when a user actually ex- pressed interest in an advertisement – as evinced by the user clicking the ad- vertisement. Taken together, these created a powerful mechanism by which to connect advertisers to consumers, quickly raising the annual revenues of Goto/Overture into hundreds of millions of dollars. This style of search en- gine came to be known variously as sponsored search or search advertising.

19.4 The search user experience

Here Google identified two principles that helped it grow at the expense of its competitors:

(1) a focus on relevance, specifically precision rather than recall in the first few re- sults; (2) a user experience that is lightweight, meaning that both the search query page and the search results page are uncluttered and almost entirely textual, with very few graphical elements.

19.4.1 User query needs

There appear to be three broad categories into which common web search queries can be grouped: (i) informational, (ii) navigational and (iii) transac- tional.

19.5 Index size and estimation

1. In response to queries a search engine can return web pages whose con- tents it has not (fully or even partially) indexed. For one thing, search engines generally index only the first few thousand words in a web page. In some cases, a search engine is aware of a page p that is linked to by pages it has indexed, but has not indexed p itself. As we will see in Chapter 21, it is still possible to meaningfully return p in search results.

2. Search engines generally organize their indexes in various tiers and parti- tions, not all of which are examined on every search (recall tiered indexes from Section 7.2.1). For instance, a web page deep inside a website may be indexed but not retrieved on general web searches; it is however retrieved as a result on a search that a user has explicitly restricted to that website (such site-specific search is offered by most web search engines).

19.6 Near-duplicates and shingling

The simplest approach to detecting duplicates is to compute, for each web page, a fingerprint that is a succinct (say 64-bit) digest of the characters on that page. Then, whenever the fingerprints of two web pages are equal, we test whether the pages themselves are equal and if so declare one of them to be a duplicate copy of the other.

21.Link analysis

21.2 PageRank

The PageRank of a node will depend on the link structure of the web graph. Given a query, a web search engine computes a composite score for each web page that combines hundreds of features such as cosine similarity (Sec- tion 6.3) and term proximity (Section 7.2.2), together with the PageRank score. This composite score, developed using the methods of Section 15.4.1, is used to provide a ranked list of results for the query.

21.2.1 Markov chains

∀i,j,Pij ∈[0,1]

and

N ∀i,∑Pij =1.

1. If a row of A has no 1’s, then replace each element by 1/N. For all other rows proceed as follows.

2. Divide each 1 in A by the number of 1’s in its row. Thus, if there is a row with three 1’s, then each of them is replaced by 1/3.

A simple Markov chain with three states; the numbers on the links

3. Multiply the resulting matrix by 1 − α.

4. Add α/N to every entry of the resulting matrix, to obtain P.

21.2.2 The PageRank computation

Recall the definition of a left eigenvector from Equation 18.2; the left eigenvectors of the transition probability matrix P are N-vectors ⃗π such that

⃗π P = λ ⃗π .

21.3 Hubs and Authorities

Blog for Class IS2140

Friday, March 21, 2014

Unit 9 Muddiest point and Reading notes

No comments:

Post a Comment