Friday, July 25, 2008

Google "Knows" About 1 Trillion Web Items

Google检索了1兆的网页

How big is Google's web index? Google hasn't said for years and still isn't saying -- but it is blogging today about how it "knows" of 1 trillion web-based items. Joy, we haven't had a search engine talk about pages it knows about to confuse things since the Lycos days. Glad to have Google making the first shot to take us back into the search engine size wars of old.

Google的网络检索究竟有多大?Google好多年没有说起过,现在也没有说-但是他在Blog上提到了他们检索了1兆的网页内容。奇怪,从Lycos(名词释义 Lycos是搜索引擎中的元老,是最早提供信息搜索服务的网站之一,2000被西班牙网络集团Terra Lycos Network以125亿美元收归旗下。来自互动百科)的时代起还没有搜索引擎谈到过他检索到的网页数目来蒙人。感谢Google起头把我们带到老一代的搜索引擎大小之争中。

As a short refresher, Google used to list the number of pages it had indexed on its home page. It dropped that count back in September 2005, after Yahoo for a short period had claimed to have indexed more. Both search engines swapped PR blows over who was bigger, then we got detente when the count went away.

Google以前是在首页上列出他检索到了页面数量,过一段时间更新一次。但是在2005年九月份的时候这一计数被拿掉了,那时候雅虎声称他们检索到的页面数量更多。两家搜索引擎变着法子争论自己更大,计数消失后我们才感觉缓和一些。

That was good. Very good. This is because search engine size has long been used as a substitute for a relevancy metric that doesn't exist. If a search engine wants to seem twice as good as a competitor, they need only trot out a bar chart showing they have twice as many documents as their competitor. Plus, you toss in the famous haystack metaphor. You can't find the needle in the haystack if you're searching only half the haystack!

这很好,非常好。因为搜索引擎的检索数目常常被用作并不存在的相关性指数的替代者。如果一个搜索引擎想看起来比对手好一倍,他们只要拿出一个条形图,上面显示他们比对手的文件多一倍就可以了。来个草堆比喻好了。如果你只搜索了草堆的一半就很难找到针。

But more documents doesn't mean better relevancy. Indeed, more documents can make a search engine worse, if it doesn't have good relevancy. My turn on the haystack metaphor has always been to say that if I dump the entire haystack on your head, does that help you find the needle? Chances are, you just get overwhelmed by a bunch of hay.

但是文件多了也不一定意味着更好的相关性。事实上更多的文件甚至会使搜索引擎更糟糕,如果他没有很好的相关性。还是草堆的比喻,如果把整个草堆都倒在你头上,会帮助你找到针吗?很可能你就被一把草搞晕了。

Still, size has long been ahttp://web.archive.org/web/19961121225924/http:/www.excite.com/ice/counting.html n appealing stat that the search engines would go to -- which in turn would cause search engines to find ways to inflate the size figures they'd report. Way back in the 1996, Lycos talked about the number of pages it "knew" about, even those these weren't actually indexed or made accessible to searchers. Excite was so annoyed that it pushed back with a page on how to count URLs, as you can see archived here.

但是大小对搜索引擎来说是个讨人喜欢的数字-反过来也会使搜索引擎想着法子把他们披露出来的数目吹的多一些。还在1996年的时候Lycos就谈到他检索到的网页数目,其中还有不少根本就没有被检索到或者是搜索者根本就无法进入的网页。Excite看不过去,他就推出一个页面,谈到如何来数URLs,这儿有存档

Now we've got Google talking about "knowing" 1 trillion items of content out there:

这儿是Google谈到的他们检索到1兆的网页内容:

Recently, even our search engineers stopped in awe about just how big the web is these days -- when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!

最近,甚至我们的搜索工程师开始惊叹现在的网络之大-我们处理网络链接以发现新的内容的系统达到了一个新的里程碑-1兆(1,000,000,000,000)的独立的网址。

It's easy to come away with the idea that Google lets you search against 1 trillion documents. That's not the case, as the post does explain:

这很容易使人联想到Google允许你搜索1兆的文档。完全不是这码事,就像帖子里解释的:

We don't index every one of those trillion pages -- many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn't very useful to searchers. But we're proud to have the most comprehensive index of any search engine, and our goal always has been to index all the world's data.

我们不是检索这1兆网页中的每一个页面-其中不少都非常的类似,或者是显示自动生成的内容,就像我们提到的日历的例子,这些内容对搜索用户来说没有什么用处。但是我们可以很自豪地说我们在所有的搜索引擎中拥有最完整的检索,我们的目标是检索世界上全部的数据。

All I really want is the last part -- that Google has what it believes to be a comprehensive index of the web. I don't even want them or anyone saying that they have the "most" comprehensive, given that this is so difficult to verify. Indeed, consider this from Google's own post:

我想要的是最后的部分-Google只是拥有他认为最完整的网页检索。我不希望看到他们或者是任何人说他们拥有最完整的检索,再说证实起来也比较有难度。事实上,仔细想想Google自己的帖子上的一段:

So how many unique pages does the web really contain? We don't know; we don't have time to look at them all! :-) Strictly speaking, the number of pages out there is infinite -- for example, web calendars may have a "next day" link, and we could follow that link forever, each time finding a "new" page. We're not doing that, obviously, since there would be little benefit to you. But this example shows that the size of the web really depends on your definition of what's a useful page, and there is no exact answer.

网上究竟有多少拥有独立网址的页面?我们不知道;我们甚至没有时间一个个看过来!:-)严格地说网页的数量是无限的-比如,基于网页的日历总会有一个下一天链接的,我们可以一直点下去,每次都会出现一个新的页面。我们不会这样去做,很显然,对你也没有什么用处。但是这个例子可以说明网络的大小其实看你怎么定义一个页面是否有用,而这个问题没有一个固定的答案。

Right, there's no exact answer to what's a useful page -- and so in turn, there's no one exact answer to who has the "most" of them collected. Tell me you have a good chunk of the web, and I'm fine. But when Google or any search engine start making size claims, my hackles go way up. There are better things to focus on.

对的,什么是有用的页面是没有固定答案的-也就很难说谁收集了最多的有用的页面。你告诉我你有一大堆网页,我完全可以接受。但是当Google或者别的搜索引擎开始谈大小问题,我就感觉很难过。其实还有更好的事要去做。


No comments: