It is very important to know how something works before you can manipulate it for your own benefit. Same applies to google. You cant manipulate your serps effectively unless you know what really is going on. That is why I have come up with this post to particularly explain what really happens while indexing a site.
First steps involves google finding your page which is only possible when some site is linking to your site or you ‘Suggest’ google to index it. I would go with the first option to get my site indexed because the second one might take weeks to get my site indexed. But currently due to excessive amount of ‘ SPLOGS ‘ ( Web 2.0 properties with spined gibberish on it ) getting indexed, google has become a bitch when it comes to indexing new sites unless you have some really strong links to it. With normal social bookmarking links , my site took more than 7-8 days to get indexed though google bot was coming to my site rite from day 1.
The second steps involves fetching and processing the data present on the web page. I am pretty much sure google knows what content platforms they are dealing with ( At least the popular ones ). Google knows when a site is a blog or a forum. The crawler first fetches the html code for the site. The next step would be to get rid of all the script tags and break the whole thing down to sheer links and content.
Lets talk about the content first. The text is first purified by removing words like ‘a , of , is…. ‘ and other words that don’t really impart and value to the text. Next step is to assign a pointer to each text which would give the main frame an idea where each keyword was during the final processing of data . Example would be like , some particular keyword was found in the title the other was found in the <h1> tag . Basically, similar to assigning coordinates on a globe. Once the whole processing is done , the whole data is relayed back to the main frame for further processing.
Through the links on the homepage of the site google-bot just fetched it crawls further into the site and repeats the same process and also measures the depth of its crawling . By the end of the crawling process google knows what your pages are about and what pages link to what pages. Google also notes down the date when the page was indexed.
This onpage data plays a major role in calculation of your serps.
Another important parameter to talk about here would be the crawl rate. This is also a very important factor which can be tracked from ‘Google WebMasters’ or ‘Awstats’ available in the cpanel ( I personally prefer google webmasters ) . It basically means the number of times the google bot comes to your site to check for content.
I see crawl rate being Dependant on 2 main factors : Authority of your site and the number of times you update your site. Big news paper sites get indexed in a matter of minutes where as for normal sites it takes a couple of hours. Its also seen that the crawl rate for sites which update their content frequently is way more than sites that update once a week or month.
We could go deeper into the whole indexing process but that would take a deep knowledge about logic and maths. So i guess this much is enough for today.
PS: Do let me know what you think about the post !