Photo by Rúben Marques on Unsplash

Please Feed the Spiders

I think we all forget just how amazing the Internet is. In the developed world instant access to information is almost ubiquitous and it is easy to take it for granted.

On the 6th August 1991 Tim Berners-Lee made the World Wide Web publicly available, since then it has transformed the way we live and work, but an ocean of information isn’t much use without a way to discover information. Search engines like Google and Bing are the tools that have helped to tame this ocean of information and help us find the nuggets of information that we actually want.

Spiders and The Long Tail

Take a look at the log files of a web server you’ll see vast numbers of requests made by spiders and bots as they traverse sites following links and indexing the information that they find on each page.

Every website owner wants to drive visitors to their website, and the best way to ensure visitors is to make sure that your site is regularly indexed by search engines. If a page doesn’t get visited by a search engine spider and its content doesn’t get indexed, the page might as well not exist. If a tree falls in a forest and no one is around to hear it, does it make a sound?

Search engines make sense of the huge amounts of unstructured content that makes up the majority of the internet. In an effort to extract and organise more information, standards like schema.org are becoming more and more important – it is in the interests of website owners to apply these standards on their websites because at the end of the day they want traffic.

Most people working with the web are aware of the idea, popularised by Chris Anderson in his 2006 book ‘The Long Tail’, that there are a lot of business opportunities serving the needs of people with minority interests or looking for niche products in the tail. About half the world’s population now have some access to the internet, so even a niche interest presents vast opportunities.

The problem is that in order to make use of the long tail, your information needs to be found – in the internet age the gatekeepers of knowledge are the big search engines like Google and Bing. Search engines provide their services to us, the end users for free, but these services are not free, they cost an enormous amount of money to operate and are ultimately funded by advertising.

Spidering and then indexing a page costs money. It takes time and it uses energy, in fact it’s estimated that in the US data centres account for  approximately 2% of energy usage. The millions of servers owned Google and Microsoft need to be built and managed, as do the servers of the websites being spidered. There is bandwidth to be paid for… the list goes on and on. It’s hard to get an exact cost but we can get glimpses:

In terms of greenhouse gases, one Google search is equivalent to about 0.2 grams of CO2. The current EU standard for tailpipe emissions calls for 140 grams of CO2 per kilometer driven, but most cars don’t reach that level yet. Thus, the average car driven for one kilometer (0.6 miles for those in the U.S.) produces as many greenhouse gases as a thousand Google searches.

Because searching and indexing does cost money, not all content actually does get indexed – it is clearly in the interest of a search engine to focus on the most popular content, the content that will appear in the most searches and generate the most advertising revenue, so pages in the long tail are less likely to be indexed or are indexed less frequently.

As an internet user I want searches that return exactly what I’m looking for. As somebody responsible for websites I want my end users to find exactly what they are looking for as easily as possible.

Spidering is a mechanism to make sense of large amounts unstructured data but it doesn’t work so well with highly structured data with lots of variations and filters.

The problem with filters

This is not a problem about having too little information, the problem is that there is too much to organise.

It is easy to build a navigation structure that will generate pages for as many variations and filters as necessary, and of course a dynamically generated web page doesn’t use any resources until it is requested, the trouble is spiders will not index all your pages. For want of a better way to describe it, spiders get bored and you run into esoteric issues like Crawl Budgets, Faceted Navigation and Infinite spaces – all clever ways of acknowledging that indexing costs money, but the practical manifestation of this is that filters often get ignored.

It’s easy to demonstrate just how quickly dynamically generated pages add up and therefore why this is a problem.

Imagine a website with a search facility that allows you filter your results. A good example is a website that relates to the physical world and which contains information about real things with attributes or properties – the facilities at a leisure centre or the type of food served in restaurant.

You can choose to filter by any, all or none of these properties. Each of these different filters is a landing page and would ideally be the page that appears at the top of the Google or Bing’s search results when somebody searches for something specific like ‘Vegan restaurant near Luton’, for example. At the moment sometimes landing pages will have been indexed and sometimes not – the experience can be a bit hit and miss. The point is, if an end user is looking for something specific (rather than just restaurants), then just providing a list of restaurants is an unsatisfying experience.

Of course it is easy enough to implement filtering on your own website and there are some great examples like the pub search facility on Useyourlocal.com, but how much better would it be to always have the opportunity to go directly to an exact landing page from the search engine results page?

Pascal’s Triangle and calculating combinations

Let’s think about how quickly combinations grow in size. If we have 2 properties to filter by [ a ] and [ b ] there are 4 possible combinations:

[  ] (no filters)

[ a ] (just a)

[ b ] (just b)

[ a b ] (a and b)

If we have 3 properties there are 8 possible combinations:

[  ]

[ a ]

[ b ]

[ c ]

[ a b ]

[ a c ]

[ b c ]

[ a b c ]

If we have 4 properties we get 16 possible combinations:

[  ]

[ a ]

[ b ]

[ c ]

[ d ]

[ a b ]

[ a c ]

[ a d ]

[ b c ]

[ b d ]

[ c d ]

[ a b c ]

[ a b d ]

[ a c d ]

[ b c d ]

[ a b c d ]

Each time we add another property, the number of possible combinations doubles, so if we filter by 6 properties = 64 possible combinations, 8 properties = 256 combinations etc.

Rather than typing out all the possible combinations you can use Pascal’s Triangle.

Item count 0 1 2 3 4 5 6 7 8 Total
Row
0 1 1
1 1 1 2
2 1 2 1 4
3 1 3 3 1 8
4 1 4 6 4 1 16
5 1 5 10 10 5 1 32
6 1 6 15 20 15 6 1 64
7 1 7 21 35 35 21 7 1 128
8 1 8 28 56 70 56 28 8 1 256

To calculate the possible number of non repeating combinations of items.

  • On the Y axis, go to the row equivalent to the number of items you have (the first row of the triangle is 0)
  • The X axis displays the number of ways you can combine these items – again the first number is 0

So confirming what we previously worked out, the number of unique combinations of 3 items:

1 + 3 + 3 + 1 = 8

1 combination of 0 items [ ]

3 combinations of 1 item [ a ] [ b ] [ c ]

3 combinations of 2 items [ a b ] [ a c ] [ b c ]

1 combination of 3 items [ a b c ]

As you can see the number of possible pages soon becomes vast.

Imagine you have a website for cafes and restaurants. There are approximately 45,000 town and villages in the UK. For the sake of argument let’s say that 20,000 of those have a cafe or restaurant. Imagine that we store information about the following properties or facilities and allow users to filter by them:

[ a ] Free parking

[ b ] Air conditioning

[ c ] Vegan

[ d ] Organic

[ e ] Fair trade

[ f ] Gluten Free

For 6 filters, we need to look at row 6: 1 + 6 + 15 + 20 + 15 + 6 + 1 = 64

We have 64 possible combinations of filters.

We can easily create a system that allows us to filter by location and optionally by one or properties (e.g. Free Parking). With 6 properties and 20,000 locations we now have 64 x 20,000 = 1,280,000 possible location filter pages (excluding pagination and 20,000 potential detail pages themselves). It is unlikely that all these pages will ever be indexed and if they are it certainly won’t be regularly.

If we decide to restrain ourselves to only a single filter, and then look at row 6, we can see that the new number of possible combinations is 1 + 6 = 7 (1 combination of no filter chosen, and 6 possible combinations of just a single filter). This is much more manageable but still results in 140,000 landing pages.

It is easy to say that this is a contrived example, and to an extent it is – but if that information is available, why shouldn’t it be possible to have that information appear directly in search results?

Even if a spider indexed 1.3 million pages it would use a lot of resources (time and energy), a lot of bandwidth would be needlessly consumed (and paid for) – the average web-page today is now comfortably over 2MB. On the other hand you would be improving the experience of many, many users – individually each search result may only be of interest to a handful of people but in the vast spaces of the long tail the small numbers add up and soon become huge.

A solution. Don’t pull, push

Instead of waiting for a spider to come along and slowly crawl millions of pages for highly structured data it would be far more efficient to either be able push the data to search engines or to be able to supply it on demand in a single chunk of data.

The data for all 20,000 of our hypothetical cafes and restaurants, including addresses and descriptions, could probably, in a text format like JSON, CSV or XML fit into a single 2MB file.

In order to implement this we need three things, firstly our structured data, secondly a schema for the data that defines how to construct the URL for any possible landing page and thirdly how to get the data to the search engine.

The necessary technology already exists and there are precedents for this push approach already. From time to time we hear about deals like the 2015 deal between Google and Twitter that enabled tweets to instantly appear in Google’s Search results. Far more common are examples like Google Product Feeds that let shop owners upload or provide feeds of product data for Google Shopping searches.

Google search already makes use of structured data in searches (but the data is all harvested by Googlebot) and with the Sitelinks Searchbox can send searches directly to search results pages on websites. The OpenAPI / Swagger specification is used to define REST APIs, but could equally be used to define the filters and URL structure of a search results page.

So, when shall we get started?

This article was originally published at www.ab-uk.com

Meta description is the new black

Meta description is one of those areas of SEO that seems to drift in and out of  popular attention. I don’t think it has ever made sense to ignore the value of meta description, although I have rarely given it the love it deserves.

Right now is a good time to think about this tag, and think about it carefully. The meta description tag is frequently overlooked for a number of reasons – meta description is old – it seems to have been around forever, it is not perceived as exciting, and it is not a quick fix – writing a good descriptions takes both though and time. The ubiquity of content management systems does nothing to help either – meta data is too often either left to be generated automatically or somehow just gets forgotten, buried under a backwater tab or lost at the bottom of a rarely used menu.

Just imagine

Everything has slotted into place, your site now has optimal urls, the content is good  (but still getting better of course), your page titles are spot on. Your site has reached page 1 of Google for the terms you are focussing on… everything is going to be rosy and you can start think seriously about your new improved  life spent sipping cocktails on the beach whilst your website just hums away.

Wait a minute, there’s a problem. People see your site in the search results, but nobody clicks through. This is where meta description comes in. Think of everything you have done so far as setting the stage, getting your product into that shop front in the prime position on the high street. The trouble is your shop window just doesn’t appeal and nobody bothers to come in.

The shop window

Meta description is your shop window, it is a key component of the snippets that Google shows on it’s results page. There is no guarantee that Google will use your description word for word, but the chances are it will use at least part of it.

As usual Webmaster Tools is great reference http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35624 and it pays to re-read the published advice from time to time.

Write for people

To many people  just use meta description as an opportunity to place keywords and key phrases, there is clearly a place for this, but it’s no good just targeting machines. If the snippet that appears in the search results is your shop window, then it is also your chance to engage the viewer.

Don’t just treat your description as an opportunity to get one up on the system, treat it is a ‘Call to action’ – grab the viewer’s attention and make sure they want to click through to your website.

 

 

 

 

 

Speed matters

To my unending amazement, many people I’ve worked with simply don’t believe that speed matters as a factor in Search Engine rankings. I don’t understand why. It’s not like Google (and lets face it, we are talking about Google) are being coy about it. (see links)

When you work in an office with stupidly fast broadband, maybe you just forget? But you shouldn’t assume that the people looking at your websites do.

The bit that really bugs me is this peculiar ‘head in the sand’ attitude. It does take more than a few minutes of thinking to come up with any number plausible reasons why speed is important for Search Engine Optimisation (aside from the fact the Google say it is). Try it, take a few seconds to think.

For me though the reasons that scream out at me are User Experience and Money.

User experience

First off – I admit this isn’t really an SEO factor, is it? People are busy, they don’t want to spend valuable time waiting for a web page to load, at least I know I don’t. I have better things to do, and I’m not that weird.

But why would Google care? I can rationalise that they want people to look at Ads. And then click on them. Probably a good idea to make websites faster so people don’t get bored and do something else, or find the bit of content they are looking for before the ads have loaded (because the ads come last).

Time = Money

Google must index billions of pages per day. It stands to reason that the faster a site loads, the faster it can be indexed. I’m guessing that it costs about the same to run a server whether it is indexing million pages per day or 10 million – but if your business is based on data the faster you can chew through data the better – and therefore it also makes sense for Google to promote an internet that lets them index more, faster.

But we don’t have the time, it’s to hard…

No it isn’t. Of course, as with any work there is optimum point in terms of cost vs. benefit (and that will be different for every site).

My gut feeling (and I don’t have any firm evidence for this) is that speed becomes more important the more traffic you get. If you’re site gets 50 visitors per day, then in a sense it is what it is – what you need to focus on is link building and content. However if your site gets 5000 visitors per day, then the chances are you’ve got quality links and good content – so how do you improve? I think the answer is the speed and user experience.

A fast website either means no traffic, no load on your server, no pictures, no javascript – in short a bit of old school html text… or it if is a big site, it means you care. It means that you have taken the time to think about users, you’ve spent a bit of money on hardware or code. It means that you are a quality business and therefore it is a good bet that you have better quality data on your site. It means that people will pay more to be associated with your site.

How to make your site fast

Hell, that’s a bold statement. It’s different for everybody. Think about the usual suspects, too much JavaScript, un-optimised images, sloooow queries with no indexes.

There are a myriad of solutions, and you don’t have to have Prototype, Jquery and Mootools all running on the same page, choose a single framework at a time – you will be amazed.

  • Use a CDN.
  • Use a cache server like Varnish.
  • If you’re a bit lazy – try out mod_pagespeed (sometimes it’s fab, sometimes it’ll melt your server)
  • Be brave, leave the Apache comfort zone and use Nginx or Lighttpd
  • Use minify

Fringe benefits

Chances are you’ll end up using some kind of caching and you’ll start setting sensible expires headers for your static content. But just think you’ll never have to tell somebody at the end of the phone to clear their browser cache again. And remember, the client is not an idiot because their browser hasn’t magically detected that a style sheet or an image has changed, and the website they have paid for looks a bit rubbish.

(by the way I don’t always practice what I’ve just preached, but at least I feel bad about it)

http://googlewebmastercentral.blogspot.com/2010/04/using-site-speed-in-web-search-ranking.html
http://www.mattcutts.com/blog/site-speed/
etc.