Free-Conversant Support / Technical Info: Recent Changes Related to the Indexers
 Home   About Conversant   Free Sites   Hosting   Support   XML-RPC 

Search




Subject Technical Info: Recent Changes Related to the Indexers
Posted 6/7/2001; 5:08 PM by Seth Dillingham
In Response To (#Top of Thread.)
Label None. Read 413
<Previous Next> Thread: Forward chronological view Reverse chronological view Hierarchical outline view Edit Reply
Warning: A lot of this message is quite technical. Please don't feel that you must read it, I'm just posting it here so that what we're doing about the load "problem" introduced by the indexers (the search engine bots) is available to anyone who wants to know.

As many of you know already, the sites hosted on the Free-Conversant servers aren't as responsive when the search engines index us every month.

The reason for that is simple: nearly everything served through Conversant's web server is generated dynamically, and so it's impossible for Conversant to serve pages as quickly as a static server can. The indexers (most notably: Googlebot from google.com, Slurp from inktomi, and FAST-WebCrawler from alltheweb.com) request EVERY PAGE from EVERY SITE, every time they index us.

Unfortunately, they can't tell by looking at two different URLs that they'll lead to the same page, so they request both URL's, and index the same information twice.

Some services, such as editthispage.com, completely shut off the bots from accessing their sites for a month. This had the unfortunate effect of removing those sites from the search engines. They've turned them back on now, and those that haven't been reindexed yet will be whenever the search engines get around to it.

Most people find sites through search engines. If you're not there, you're not being found.

We've taken a different approach based on our own experience, on information found all over the net, and on comments by professionals who have been dealing with this longer than we have. We're tweaking Conversant to handle the search engines, and (more importantly) to guide them through the sites we host.

What did we do?

The first steps, finished over a month ago, included the following:

  • hide the DG Calendar sort links from the bots
  • change the DG Calendar sort links to use search args
  • when redirecting to a logon page, use search args ("?") to indicate the return URL
  • include "Last-Modified" HTTP headers in most requests

This week, we added support for the "If-Modified-Since" HTTP request header. (More about that at the end of this message.)

Why did we do those things?

We hid the DG Calendar sort links so that bots wouldn't even see them. Every bot used to request every DG Calendar once, and then again for every way the list could be sorted. Hiding those links from the bots prevents them from requesting them.

We changed the sort links to use search args (that is, put something like "?sort=auth" in the URL to sort the list of messages by the author), because most of the bots won't request URL's containing search args. The old URL's used "path args", which just meant that a $ was used instead of a ?. Now if the old URL is requested, a redirect is returned to the new URL, which (as I said) most of the bots won't request.

We changed the logon page to use ? in the redirects, like this:

http://support.free-conversant.com/logon?returnto=http:%3A%2F%2Fsupport.free-conversant.com%2F

Since the bots were trying to reply to every message in every discussion group, they were being redirected to the logon page: that's two requests per message, per site. Changing to the new format prevents them from requesting the logon page, which therefore cut out a huge number of hits. (Try replying to any message on a site that you're not logged on to, and you'll see that change in action.)

Finally, and most importantly, we included the Last-Modified header in the HTTP responses that we send back to requests for URL-bound pages, weblog pages, message pages, linked javascripts, and linked stylesheets. This is a tiny bit of information, not displayed anywhere (it's not part fo the HTML) that tells the browser or indexing robot (the User-Agent) when this page was last modified.

That's probably the most significant improvement of all, because the next time the bots index the page, some of them are smart enough to specify in the request, "only send this page to me if it's been updated since the last time I indexed it" (that's done with the If-Modified-Since request header). If the page hasn't been updated since the specified date, then Conversant doesn't have to serve the page... so a page that may have taken a second or two will now take practically no time at all (specifically, about 4 ticks, versus 60 - 120 ticks).

Multiply those savings by thousands of pages, and you can probably see the benefits.

There were numerous other changes, most of them very small, and nearly all of them designed to increase Conversant's performance under heavy indexer load. In other words, you won't see a difference under normal circumstances, but hopefully the indexers won't hit us as hard when they come back in a few weeks.

That's all the technical info. My next message will be more general, and will ask for everybody's help.

Seth

<Previous Next> Thread: Forward chronological view Reverse chronological view Hierarchical outline view Edit Reply
ENCLOSURES

None.
REPLIES

None.
TRACKBACKS



This site managed with Conversant, © Copyright 2010 Macrobyte Resources