I will highly recommend CozaHost to any web entrepreneurs out there because of your consistently high levels of service and support.
Marius George
What is in a domain name
How search engines and their spiders work
Do you know how many web pages are available to the public?
An educated guess puts the number of available pages at approximately 5
billion. To put that huge number in perspective: if you were to read
one web page per second day and night without ever taking a break, you will
be busy for about 150 years! If you could somehow catch-up on all the
existing pages, you will quickly fall behind again, because new pages are
being added to the web at a rate of several hundred per second.
The internet, and more specifically the world wide web (www) is the biggest
repository of knowledge in the history of mankind - which presents the
researcher with a significant problem: how to find relevant information in
the largest haystack ever conceived?
Having ruled out the approach where you read all pages on the net as a bit
tedious, the only other option is to use an internet search engine to help
you find the information you need.
What is a search engine
A search engine is a software program that collects information on web pages
from the internet, categorizes and indexes the information and then store
the result in a huge database where it can be quickly searched. The
search engine then provides a web page to search the database.
Examples of the most used search engines are:
Google,
Yahoo, MSN,
Alta Vista,
Dogpile and there are many thousands of
smaller and specialized search engines. (Please see the 01 April
edition of this newsletter where we covered a
few of the best specialist search engines).
Crawling a spider's web
The world wide web is called "the web" because web pages contains links
pointing to other web pages - a spider's web of pages.
If you were to draw a diagram of a web site, it would resemble a mixture
between a hierarchical (organizational ) diagram and the web of a slightly
deranged spider. If you were to extend this diagram to, say, all the
web sites in South Africa, and draw lines between pages linking to one
another, the end result will look like a spider's web: as spun by a
seriously hung-over spider, ie it will lack any form of symmetry or order -
just a huge number of lines in a seemingly random pattern.
A search engine's first task is to "crawl" every link in this web, in other
words to look at each web page, extract all the links on it and then follow
those links to the next page and repeat the process. This process is
called "spidering" or "crawling" and it is performed by automatic computer
programs called "search engine spiders" or "bots" (for robots).
Every search engine has it's own spider, with it's own rules.
It is the spider that makes one engine better or weaker than it's
competition.
Indexing for relevance
Every page encountered by the search engine spider has to be classified and
indexed, so that one can type in the phrase "web hosting south africa" and
get a list of matches within a reasonable amount of time. (Remember
that the big search engines handle hundreds of thousands of searches per
second, so the search process must be very fast to be useful).
In order to classify and index a page, the spider will look at the content
(ie words) on the page and will try to decide how to index it.
For instance, if the spider finds that the title of the page is "web
hosting" and it contains the same phrase a few times, it will categorize the
page as one dealing with "web hosting".
Remember that a computer program (spider) cannot understand or derive
meaning from a picture, so only words on the page is used to classify the
page. This is why people are often disappointed that
their web pages are not indexed "correctly" by search engines - because they
insist on using a lot of graphics (pictures) and very little content
(words). The former is ignored by spiders and the latter is it's
"food". :-)
To make matters more "interesting" for the search engine spider, a typical
web page can be indexed in multiple ways: for instance, the
Cozahost home page contains
information about "web hosting" and "domain registration" and "south africa"
- and so on.
The spider is now faced with the question on how to index the page - is it
primarily about "web hosting" or primarily about "domain registration"?
The easiest approach would be to index the page under all the key phrases it
finds, but here lies the big issue for search engines: Relevance.
The holy grail. Relevance is more important to a search engine than finding
WMD in Iraq was for Bush; or a gold credit card was for Zuma; or farm land
was for Mugabe.
You get the point: it's VERY important.
Think about it: if you search for "web hosting", you want the engine to
return ONLY pages that are relevant to that search term AND you want
the results returned in order of RELEVANCE.
Unless a search engine can produce relevant and well ranked results quickly,
it will die. People will stop using it.
If it does this well, it will turn into a billion dollar company virtually
overnight - ask the two guys who created Google.
A quest to demonstrate the relevance of relevance
A vivid example of the importance of relevance, is when I tried to find an
old Afrikaans
poem a few days ago:
I learnt about the poem in school, but it was much later as an adult that I
first appreciated it's simple beauty and wisdom. All I could remember
about the poem was that it contained the verse "glasies by elektriese skyn"
(glases lit by electric light),
and I remembered that the poem was titled "Repos Allieurs (rest is
elsewhere), but I did not know if my spelling of "Allieurs" was correct.
(It was not)
I also had a theory that the poem was written by CJ Langenhoven (which it
was not), probably because "Langenhoven" is one of the very few old
Afrikaans poets I remember. :-)
Anyway, off to Ananzi to find the poem
- secure in the knowledge that it must be on the internet somewhere, and
Ananzi is supposed to be a search engine specializing in South African
content, it must be listed there right?
Not a chance.
After searching Ananzi for a while, I realized that the only thing I know
for sure about the poem is that little snippet I remember: "glasies by
elektriese skyn", so I searched for that. My reward was a list of
electrical contractors and miscellaneous junk that had no relevance to my
quest whatsoever. (Click here to see the
junk returned by Ananzi)
After this rather rude reminder of how bad the Ananzi search relevance
actually is, I went to the big boss - Google. With one search I found
the poem I was looking for - listed first on the page of results. I
learnt that the poem was written by Totius (not Langenhoven) and that the
verse I remembered does not appear in the poem at all - it is actually
snippets from other verses.
Despite my very inexact search, Google was able to find the poem for
me - after making me wait for 0.13 seconds while it searched it's database
of 4.2 billion web pages. (Click here to see the
results returned by Google)
(By the way:
Google found the poem here:
http://users.skynet.be/meeuws/tdierus.html - if you can read Afrikaans,
I really recommend that you read the poem. I think it is beautifully
simple and it teaches a great wisdom.)
Back to the relevant point: (pun intended :-)
Unless a search engine can return relevant results quickly, it looses users.
After my experience with Ananzi, do you think I will use it again? Not
likely.
Not all spiders are that bad
The above example illustrates a) how bad the Ananzi spider is and b) how
astoundingly brilliant the Google engine is.
Google's untouchable status as the best search engine on the net is to a
large extent thanks to it's spider.
Now don't get me wrong. I don't like spiders. In fact, those big black hairy things can make me scream like a girl.
Search engine spiders are fortunately an entirely different matter. First of
all they are neither black nor hairy, and secondly they are invisible and
they never climb on you.
Google's spider specifically is a thing of beauty: Not only was the
spider (and supporting machinery) clever enough to index that page in such a
way that my inaccurate search could be satisfied, but it allowed the engine
to locate the correct results in less than a quarter of a second!
Magical spiders
Someone once said that any sufficiently sophisticated technology is
indistinguishable from magic. In Google's case, that observation is
very true - but fortunately we do have some insight into how the spiders
work. (Search engines consider the inner workings of
their spiders as trade secrets, but we can deduce some of it's internal
workings by observing it's results)
As I've said earlier - spiders use the content (words) on a web page to
determine how a page should be ranked and indexed. Here are some of
the other rules used by spiders:
Title of the web page
(A phrase or word that appears in the page title is
considered more important than words in the body of the page)
Headings
(HTML headings (ie H1, H2, etc) and the words in them are
considered to be more important than the rest of the page)
Links and titles on a page
(Keywords that appear in the text of links
are more important)
Frequency (density) and prominence
(Words that appear more often and
earlier in sentences are given more weight)
Links from other pages
(The more pages linking TO a page the higher the page
ranking. The principle is that if a lot of pages link TO a page, then that
page must contain important information)
"Freshness" of information
("Younger" pages will rank higher than older pages if all things are
equal)
THE golden rule
The short list above is only a small sample of the methods used by spiders
to classify a web page, and it gets a lot more complicated quite quickly.
At Cozahost we use sophisticated software to analyze our client's web pages
to make sure that they archive the highest possible ranking for their chosen
keywords; but the golden rule remains - as solid and immutable as the law of
gravity:
No matter what you do - you will only achieve significant search engine
rankings if your pages contain large amounts of quality, original content.
Ignore this rule and you will not benefit from thousands of potential
clients a good search engine can send to your site.
A web site is not the same as a paper advertisement, or a flyer, or a TV
add. No quality content = no quality leads. Simple.
Immutable. (For more info on how to make sure your pages are ranked
well, please read "website
optimization with doorway pages" for practical tips on how to improve
your site's search engine ranking)
Search engine optimization is both a science and a black art. Books
have been written on the topic, and there is no easy way out - so don't be
fooled by the "marketing fundis" who tell you they can get your site ranked
first on Google, or get you listed on a gazillion search engines.
There is no such thing as a free lunch.
The ONLY way is to write quality content or your site and making sure that
your content is structured in a way that is search engine spider friendly.
No shortcuts. No get get out of jail free card.
Spiders on your site
Given that MILLIONS of potential customers use search engines every day, it
is plainly obvious that you need to have your web site crawled regularly if
you hope to make a success of your internet business.
Remember that internet search engines are independent entities (companies)
with no obligation whatsoever to spider or index your web site. You
cannot force them to spider your site, although, in many cases, you can pay
them to do it.
Spiders are not stupid. If they were, the search engine that sent them
won't last very long:
If a search engine spider thinks that your are trying to fool it, for
instance by repeating your keywords too many times, or by using any other
technique (like invisible text) intended to mislead, the search engine will
simply drop your site from their index. If the attempt at fooling them
(called "spamdexing") is blatant enough, they may even blacklist your site
forever.
The point is that the search engine does you a favor by listing your site
and sending visitors to it. In return, you have to provide the search
engine's customers with quality, relevant information. Not a bad deal
if you ask me.
To get your site spidered, you need to update your content often by adding
more pages and making sure that all information on existing pages are
accurate and up to date.
Because spiders have BILLIONS of web pages to visit, they take anything from
6 to 12 weeks to return to your site, or even to visit it for the first
time. (If your site is not listed at all, you should contact Cozahost
so that we can submit your site to the major search engines on your behalf)
You can encourage repeat visits by spiders by frequently updating your web
site - spiders LOVE fresh new juicy content. :-)
Please note that all rights are
reserved for this article but you may copy and publish this article on your web
site provided that you make no changes to the page at all - and that includes
all of the hyperlinks and this notice. We ask that you
contact us if you are re-publishing this article
on your web site so that we can notify you when we update the article.
(c) Cozahost, 2006. All rights reserved. Use our site map to find information or please contact us if you have any questions.