Cozahost Newsletter Archive

Previous issues

Contact us Cozahost Subscribe

Cozahost newsletter - 16 November 2004
Hi!  
Here is your Cozahost newsletter:

Search engines make the world wide web the powerful information tool it is today.  We explain how the engines work, and how you can Google yourself.

Please take a moment now to recommend the Cozahost Newsletter to friends who would like it too. Thanks!
 

..:: In This Issue ::..

Hello
Internet search engines explained
Google yourself
Your smile for the day
Services and products
Subscribe to this newsletter
..::  Hello :-)
  
"When you find something, it will be in the last place you looked"

That sounded like a very negative statement to me, until I realized that it was perfectly true.  Duh!  Bit slow on the uptake. :-)

Finding things on the internet, can be very time consuming...I mean there are literally billions of places to look.  If it was not for internet search engines...

In this issue we discuss how the search engines work.  Without the likes of Google, the web would be practically useless.  If you have your own site, you already know that these engines have the power to turn your site into a profitable investment or a complete waste of time.

On average, search engines will generate 30% to 50% of traffic to a web site, and this may translate into hundreds of new leads / sales per month.  Whether you are a webmaster, business owner or internet user - understanding the basics of how search engines operate is essential.

We also show you how you can Google yourself.  :-)  It's a free software program that enables you to run your own Google search engine on your PC to search your own files, emails and web pages. 

It's so brilliant, it made my jaw drop.  More about that later in this edition, but first, let's get to grips with the brains behind the web...
  
 

..:: Internet search engines explained


Do you know how many web pages are available to the public?

An educated guess puts the number of available pages at approximately 5 billion.  To put that huge number in perspective: if you were to read one web page per second day and night without ever taking a break, you will be busy for about 150 years!  If you could somehow catch-up on all the existing pages, you will quickly fall behind again, because new pages are being added to the web at a rate of several hundred per second.

The internet, and more specifically the world wide web (www) is the biggest repository of knowledge in the history of mankind - which presents the researcher with a significant problem: how to find relevant information in the largest haystack ever conceived?

Having ruled out the approach where you read all pages on the net as a bit tedious, the only other option is to use an internet search engine to help you find the information you need.

What is a search engine

A search engine is a software program that collects information on web pages from the internet, categorizes and indexes the information and then store the result in a huge database where it can be quickly searched.  The search engine then provides a web page to search the database.

Examples of the most used search engines are: Google, Yahoo, MSN, Alta Vista, Dogpile and there are many thousands of smaller and specialized search engines.  (Please see the 01 April edition of this newsletter where we covered a few of the best specialist search engines).

Crawling a spider's web

The world wide web is called "the web" because web pages contains links pointing to other web pages - a spider's web of pages.

If you were to draw a diagram of a web site, it would resemble a mixture between a hierarchical (organizational ) diagram and the web of a slightly deranged spider.  If you were to extend this diagram to, say, all the web sites in South Africa, and draw lines between pages linking to one another, the end result will look like a spider's web: as spun by a seriously hung-over spider, ie it will lack any form of symmetry or order - just a huge number of lines in a seemingly random pattern.

A search engine's first task is to "crawl" every link in this web, in other words to look at each web page, extract all the links on it and then follow those links to the next page and repeat the process.  This process is called "spidering" or "crawling" and it is performed by automatic computer programs called "search engine spiders" or "bots" (for robots).

Every search engine has it's own spider, with it's own rules. 

It is the spider that makes one engine better or weaker than it's competition.

Indexing for relevance

Every page encountered by the search engine spider has to be classified and indexed, so that one can type in the phrase "web hosting south africa" and get a list of matches within a reasonable amount of time.  (Remember that the big search engines handle hundreds of thousands of searches per second, so the search process must be very fast to be useful). 

In order to classify and index a page, the spider will look at the content (ie words) on the page and will try to decide how to index it. 

For instance, if the spider finds that the title of the page is "web hosting" and it contains the same phrase a few times, it will categorize the page as one dealing with "web hosting". 

Remember that a computer program (spider) cannot understand or derive meaning from a picture, so only words on the page is used to classify the page.  This is why people are often disappointed that their web pages are not indexed "correctly" by search engines - because they insist on using a lot of graphics (pictures) and very little content (words).  The former is ignored by spiders and the latter is it's "food". :-)

To make matters more "interesting" for the search engine spider, a typical web page can be indexed in multiple ways: for instance, the Cozahost home page contains information about "web hosting" and "domain registration" and "south africa" - and so on. 

The spider is now faced with the question on how to index the page - is it primarily about "web hosting" or primarily about "domain registration"?

The easiest approach would be to index the page under all the key phrases it finds, but here lies the big issue for search engines:  Relevance.  The holy grail. Relevance is more important to a search engine than finding WMD in Iraq was for Bush; or a gold credit card was for Zuma; or farm land was for Mugabe. 

You get the point: it's VERY important.

Think about it: if you search for "web hosting", you want the engine to return ONLY pages that are relevant to that search term AND you want the results returned in order of RELEVANCE. 

Unless a search engine can produce relevant and well ranked results quickly, it will die.  People will stop using it.

If it does this well, it will turn into a billion dollar company virtually overnight - ask the two guys who created Google.

A quest to demonstrate the relevance of relevance

A vivid example of the importance of relevance, is when I tried to find an old Afrikaans poem a few days ago: 

I learnt about the poem in school, but it was much later as an adult that I first appreciated it's simple beauty and wisdom.  All I could remember about the poem was that it contained the verse "glasies by elektriese skyn" (glases lit by electric light), and I remembered that the poem was titled "Repos Allieurs (rest is elsewhere), but I did not know if my spelling of "Allieurs" was correct.  (It was not)

I also had a theory that the poem was written by CJ Langenhoven (which it was not), probably because "Langenhoven" is one of the very few old Afrikaans poets I remember. :-)

Anyway, off to Ananzi to find the poem - secure in the knowledge that it must be on the internet somewhere, and Ananzi is supposed to be a search engine specializing in South African content, it must be listed there right? 

Not a chance. 

After searching Ananzi for a while, I realized that the only thing I know for sure about the poem is that little snippet I remember: "glasies by elektriese skyn", so I searched for that.  My reward was a list of electrical contractors and miscellaneous junk that had no relevance to my quest whatsoever.  (Click here to see the junk returned by Ananzi)

After this rather rude reminder of how bad the Ananzi search relevance actually is, I went to the big boss - Google.  With one search I found the poem I was looking for - listed first on the page of results.  I learnt that the poem was written by Totius (not Langenhoven) and that the verse I remembered does not appear in the poem at all - it is actually snippets from other verses.

Despite my very inexact search, Google was able to find the poem for me - after making me wait for 0.13 seconds while it searched it's database of 4.2 billion web pages.  (Click here to see the results returned by Google)

(By the way: Google found the poem here: http://users.skynet.be/meeuws/tdierus.html - if you can read Afrikaans, I really recommend that you read the poem.  I think it is beautifully simple and it teaches a great wisdom.)

Back to the relevant point: (pun intended :-)

Unless a search engine can return relevant results quickly, it looses users.  After my experience with Ananzi, do you think I will use it again?  Not likely.

Not all spiders are that bad

The above example illustrates a) how bad the Ananzi spider is and b) how astoundingly brilliant the Google engine is. 

Google's untouchable status as the best search engine on the net is to a large extent thanks to it's spider.

Now don't get me wrong.  I don't like spiders.  In fact, those big black hairy things can make me scream like a girl.

Search engine spiders are fortunately an entirely different matter. First of all they are neither black nor hairy, and secondly they are invisible and they never climb on you. 

Google's spider specifically is a thing of beauty:  Not only was the spider (and supporting machinery) clever enough to index that page in such a way that my inaccurate search could be satisfied, but it allowed the engine to locate the correct results in less than a quarter of a second!

Magical spiders

Someone once said that any sufficiently sophisticated technology is indistinguishable from magic.  In Google's case, that observation is very true - but fortunately we do have some insight into how the spiders work.  (Search engines consider the inner workings of their spiders as trade secrets, but we can deduce some of it's internal workings by observing it's results)

As I've said earlier - spiders use the content (words) on a web page to determine how a page should be ranked and indexed.  Here are some of the other rules used by spiders:

  • Title of the web page
    (A phrase or word that appears in the page title is considered more important than words in the body of the page)
  • Headings
    (HTML headings (ie H1, H2, etc) and the words in them are considered to be more important than the rest of the page)
  • Links and titles on a page
    (Keywords that appear in the text of links are more important)
  • Frequency (density) and prominence
    (Words that appear more often and earlier in sentences are given more weight)
  • Links from other pages
    (The more pages linking TO a page the higher the page ranking.  The principle is that if a lot of pages link TO a page, then that page must contain important information)
  • "Freshness" of information
    ("Younger" pages will rank higher than older pages if all things are equal)

THE golden rule

The short list above is only a small sample of the methods used by spiders to classify a web page, and it gets a lot more complicated quite quickly.  At Cozahost we use sophisticated software to analyze our client's web pages to make sure that they archive the highest possible ranking for their chosen keywords; but the golden rule remains - as solid and immutable as the law of gravity:

No matter what you do - you will only achieve significant search engine rankings if your pages contain large amounts of quality, original content.

Ignore this rule and you will not benefit from thousands of potential clients a good search engine can send to your site.

A web site is not the same as a paper advertisement, or a flyer, or a TV add.  No quality content = no quality leads.  Simple.  Immutable.  (For more info on how to make sure your pages are ranked well, please read "website optimization with doorway pages" for practical tips on how to improve your site's search engine ranking)

Search engine optimization is both a science and a black art.  Books have been written on the topic, and there is no easy way out - so don't be fooled by the "marketing fundis" who tell you they can get your site ranked first on Google, or get you listed on a gazillion search engines.  There is no such thing as a free lunch.

The ONLY way is to write quality content or your site and making sure that your content is structured in a way that is search engine spider friendly.  No shortcuts.  No get get out of jail free card.

Spiders on your site

Given that MILLIONS of potential customers use search engines every day, it is plainly obvious that you need to have your web site crawled regularly if you hope to make a success of your internet business.  

Remember that internet search engines are independent entities (companies) with no obligation whatsoever to spider or index your web site.  You cannot force them to spider your site, although, in many cases, you can pay them to do it.

Spiders are not stupid.  If they were, the search engine that sent them won't last very long:  If a search engine spider thinks that your are trying to fool it, for instance by repeating your keywords too many times, or by using any other technique (like invisible text) intended to mislead, the search engine will simply drop your site from their index.  If the attempt at fooling them (called "spamdexing") is blatant enough, they may even blacklist your site forever.

The point is that the search engine does you a favor by listing your site and sending visitors to it.  In return, you have to provide the search engine's customers with quality, relevant information.  Not a bad deal if you ask me.

To get your site spidered, you need to update your content often by adding more pages and making sure that all information on existing pages are accurate and up to date. 

Because spiders have BILLIONS of web pages to visit, they take anything from 6 to 12 weeks to return to your site, or even to visit it for the first time.  (If your site is not listed at all, you should contact Cozahost so that we can submit your site to the major search engines on your behalf)

You can encourage repeat visits by spiders by frequently updating your web site - spiders LOVE fresh new juicy content. :-)

And this brings us to the subject of running your very own search engine...
 

..:: Google yourself
 
How much time do you waste trying to find an old email, or to remember the URL of that web site you visited a while ago?

Does it irk you that it is faster to find information on the internet than on your own PC?  Nuts isn't it?

Well, help is at hand.

Google released a beta (test) version of a search engine that runs on your own PC. 

After you download and install it, it starts indexing web pages, files and email on your computer in the background (it takes about a day for the process to complete) and after that you can find any information as quickly and as conveniently on your own PC as you can on the internet.

I am not kidding. 

After the indexing is complete, you can search your PC for information and get RELEVANT, RANKED information - in less than a second.  It's stupefying brilliant, and even though I should know better, I think that the author of this software must have sold his soul to the devil, or sacrificed black cats on a moonlit night, or something like that... :-)

In all my time in the IT industry, I have seldom seen a more spectacularly brilliant piece of software.  My advice is to download and install it right away. 

You can download a free copy here...
 

..:: Services and products


Here are some quick links to CozaHost services and products:

About us - Background information on Cozahost: who we are, why we are here and what we aim to do. Contact us - Use this link if you need to contact us for help, advice or support.
Register a domain name - Get an instant no obligation quote to register a domain name. (With optional email or dialup access.) Faster modem, ISDN or ADSL - Cozahost offers faster modem, ISDN and ADSL internet access at heavily discounted rates to our clients.
About web hosting - Article on how business can use a web site to gain new customers or become more competitive. Fax to email service - Frequently asked questions about the fax to email service which allows you to receive your faxes privately, hassle free and anywhere in the world - via email

..:: Your smile for the day - Things to do when you are bored


1. At lunch time, sit in your parked car with sunglasses on and point a hair dryer at passing cars. See if they slow down.

2. Page yourself over the intercom. Don't disguise your voice.

3. Finish all your sentences with "in accordance with the prophecy."

4. Ask people what sex they are. Laugh hysterically after they answer.

5. Bark like a dog when you are in a lift with other people.

6. Put mosquito netting around your work area. Play a tape of jungle sounds all day.

7. Have your coworkers address you by your wrestling name, Rock Hard Kim.

8. When the money comes out the ATM, scream "I won!", "I won!" "3rd time this week!!!!!"

9. Tell your children over dinner. "due to the economy, we are going to have to let one of you go."

10. Every time someone asks you to do something, ask if they want fries with that.
 

..:: Subscribe

If you like this newsletter, please do us a favor and ask your friends to subscribe here: http://www.cozahost.com/news/
 
..::Goodbye! :-)


Thanks for reading this newsletter and we hope you enjoyed it!  Please contact us if you have comments, suggestions or questions - we would love to hear from you!
 

(c) Cozahost 2004, All rights reserved.


Cozahost Newsletter archive   Cozahost Home   Contact us