Please Feed the Spiders – Flipflops.org

I think we all forget just how amazing the Internet is. In the developed world instant access to information is almost ubiquitous and it is easy to take it for granted.

On the 6th August 1991 Tim Berners-Lee made the World Wide Web publicly available, since then it has transformed the way we live and work, but an ocean of information isn’t much use without a way to discover information. Search engines like Google and Bing are the tools that have helped to tame this ocean of information and help us find the nuggets of information that we actually want.

Spiders and The Long Tail

Take a look at the log files of a web server you’ll see vast numbers of requests made by spiders and bots as they traverse sites following links and indexing the information that they find on each page.

Every website owner wants to drive visitors to their website, and the best way to ensure visitors is to make sure that your site is regularly indexed by search engines. If a page doesn’t get visited by a search engine spider and its content doesn’t get indexed, the page might as well not exist. If a tree falls in a forest and no one is around to hear it, does it make a sound?

Search engines make sense of the huge amounts of unstructured content that makes up the majority of the internet. In an effort to extract and organise more information, standards like schema.org are becoming more and more important – it is in the interests of website owners to apply these standards on their websites because at the end of the day they want traffic.

Most people working with the web are aware of the idea, popularised by Chris Anderson in his 2006 book ‘The Long Tail’, that there are a lot of business opportunities serving the needs of people with minority interests or looking for niche products in the tail. About half the world’s population now have some access to the internet, so even a niche interest presents vast opportunities.

The problem is that in order to make use of the long tail, your information needs to be found – in the internet age the gatekeepers of knowledge are the big search engines like Google and Bing. Search engines provide their services to us, the end users for free, but these services are not free, they cost an enormous amount of money to operate and are ultimately funded by advertising.

Spidering and then indexing a page costs money. It takes time and it uses energy, in fact it’s estimated that in the US data centres account for approximately 2% of energy usage. The millions of servers owned Google and Microsoft need to be built and managed, as do the servers of the websites being spidered. There is bandwidth to be paid for… the list goes on and on. It’s hard to get an exact cost but we can get glimpses:

In terms of greenhouse gases, one Google search is equivalent to about 0.2 grams of CO2. The current EU standard for tailpipe emissions calls for 140 grams of CO2 per kilometer driven, but most cars don’t reach that level yet. Thus, the average car driven for one kilometer (0.6 miles for those in the U.S.) produces as many greenhouse gases as a thousand Google searches.

Because searching and indexing does cost money, not all content actually does get indexed – it is clearly in the interest of a search engine to focus on the most popular content, the content that will appear in the most searches and generate the most advertising revenue, so pages in the long tail are less likely to be indexed or are indexed less frequently.

As an internet user I want searches that return exactly what I’m looking for. As somebody responsible for websites I want my end users to find exactly what they are looking for as easily as possible.

Spidering is a mechanism to make sense of large amounts unstructured data but it doesn’t work so well with highly structured data with lots of variations and filters.

The problem with filters

This is not a problem about having too little information, the problem is that there is too much to organise.

It is easy to build a navigation structure that will generate pages for as many variations and filters as necessary, and of course a dynamically generated web page doesn’t use any resources until it is requested, the trouble is spiders will not index all your pages. For want of a better way to describe it, spiders get bored and you run into esoteric issues like Crawl Budgets, Faceted Navigation and Infinite spaces – all clever ways of acknowledging that indexing costs money, but the practical manifestation of this is that filters often get ignored.

It’s easy to demonstrate just how quickly dynamically generated pages add up and therefore why this is a problem.

Imagine a website with a search facility that allows you filter your results. A good example is a website that relates to the physical world and which contains information about real things with attributes or properties – the facilities at a leisure centre or the type of food served in restaurant.

You can choose to filter by any, all or none of these properties. Each of these different filters is a landing page and would ideally be the page that appears at the top of the Google or Bing’s search results when somebody searches for something specific like ‘Vegan restaurant near Luton’, for example. At the moment sometimes landing pages will have been indexed and sometimes not – the experience can be a bit hit and miss. The point is, if an end user is looking for something specific (rather than just restaurants), then just providing a list of restaurants is an unsatisfying experience.

Of course it is easy enough to implement filtering on your own website and there are some great examples like the pub search facility on Useyourlocal.com, but how much better would it be to always have the opportunity to go directly to an exact landing page from the search engine results page?

Pascal’s Triangle and calculating combinations

Let’s think about how quickly combinations grow in size. If we have 2 properties to filter by [ a ] and [ b ] there are 4 possible combinations:

[  ] (no filters)

[ a ] (just a)

[ b ] (just b)

[ a b ] (a and b)

If we have 3 properties there are 8 possible combinations:

[  ]

[ a ]

[ b ]

[ c ]

[ a b ]

[ a c ]

[ b c ]

[ a b c ]

If we have 4 properties we get 16 possible combinations:

[  ]

[ a ]

[ b ]

[ c ]

[ d ]

[ a b ]

[ a c ]

[ a d ]

[ b c ]

[ b d ]

[ c d ]

[ a b c ]

[ a b d ]

[ a c d ]

[ b c d ]

[ a b c d ]

Each time we add another property, the number of possible combinations doubles, so if we filter by 6 properties = 64 possible combinations, 8 properties = 256 combinations etc.

Rather than typing out all the possible combinations you can use Pascal’s Triangle.

	Item count	0	1	2	3	4	5	6	7	8	Total
Row
0		1									1
1		1	1								2
2		1	2	1							4
3		1	3	3	1						8
4		1	4	6	4	1					16
5		1	5	10	10	5	1				32
6		1	6	15	20	15	6	1			64
7		1	7	21	35	35	21	7	1		128
8		1	8	28	56	70	56	28	8	1	256

To calculate the possible number of non repeating combinations of items.

On the Y axis, go to the row equivalent to the number of items you have (the first row of the triangle is 0)
The X axis displays the number of ways you can combine these items – again the first number is 0

So confirming what we previously worked out, the number of unique combinations of 3 items:

1 + 3 + 3 + 1 = 8

1 combination of 0 items [ ]

3 combinations of 1 item [ a ] [ b ] [ c ]

3 combinations of 2 items [ a b ] [ a c ] [ b c ]

1 combination of 3 items [ a b c ]

As you can see the number of possible pages soon becomes vast.

Imagine you have a website for cafes and restaurants. There are approximately 45,000 town and villages in the UK. For the sake of argument let’s say that 20,000 of those have a cafe or restaurant. Imagine that we store information about the following properties or facilities and allow users to filter by them:

[ a ] Free parking

[ b ] Air conditioning

[ c ] Vegan

[ d ] Organic

[ e ] Fair trade

[ f ] Gluten Free

For 6 filters, we need to look at row 6: 1 + 6 + 15 + 20 + 15 + 6 + 1 = 64

We have 64 possible combinations of filters.

We can easily create a system that allows us to filter by location and optionally by one or properties (e.g. Free Parking). With 6 properties and 20,000 locations we now have 64 x 20,000 = 1,280,000 possible location filter pages (excluding pagination and 20,000 potential detail pages themselves). It is unlikely that all these pages will ever be indexed and if they are it certainly won’t be regularly.

If we decide to restrain ourselves to only a single filter, and then look at row 6, we can see that the new number of possible combinations is 1 + 6 = 7 (1 combination of no filter chosen, and 6 possible combinations of just a single filter). This is much more manageable but still results in 140,000 landing pages.

It is easy to say that this is a contrived example, and to an extent it is – but if that information is available, why shouldn’t it be possible to have that information appear directly in search results?

Even if a spider indexed 1.3 million pages it would use a lot of resources (time and energy), a lot of bandwidth would be needlessly consumed (and paid for) – the average web-page today is now comfortably over 2MB. On the other hand you would be improving the experience of many, many users – individually each search result may only be of interest to a handful of people but in the vast spaces of the long tail the small numbers add up and soon become huge.

A solution. Don’t pull, push

Instead of waiting for a spider to come along and slowly crawl millions of pages for highly structured data it would be far more efficient to either be able push the data to search engines or to be able to supply it on demand in a single chunk of data.

The data for all 20,000 of our hypothetical cafes and restaurants, including addresses and descriptions, could probably, in a text format like JSON, CSV or XML fit into a single 2MB file.

In order to implement this we need three things, firstly our structured data, secondly a schema for the data that defines how to construct the URL for any possible landing page and thirdly how to get the data to the search engine.

The necessary technology already exists and there are precedents for this push approach already. From time to time we hear about deals like the 2015 deal between Google and Twitter that enabled tweets to instantly appear in Google’s Search results. Far more common are examples like Google Product Feeds that let shop owners upload or provide feeds of product data for Google Shopping searches.

Google search already makes use of structured data in searches (but the data is all harvested by Googlebot) and with the Sitelinks Searchbox can send searches directly to search results pages on websites. The OpenAPI / Swagger specification is used to define REST APIs, but could equally be used to define the filters and URL structure of a search results page.

So, when shall we get started?

This article was originally published at www.ab-uk.com