We were promised Strong AI, but instead we got metadata analysis

How simple structured data trumps clever machine learning

a photo of a scale model of the Notre-Dame cathedral

The late nineties dream of search engines was that they would use grand-scale Artificial Intelligence to find everything, understand most of it and help us retrieve the best of it. Not much of that has really come true.

AI to rule them all! AI to find them?

Google has always performed a wide crawl of the entire web. But few webmasters are so naive as to assume their pages will be found this way. Even this website, which has fewer than 20 pages, has had problems with Google finding all of them. Relying solely on the general crawl has proved unworkable for most.

Google introduced the Sitemap standard in 2005 to allow webmasters to eliminate the confusion by just providing a list of all their pages. Most websites now provide sitemap files instead of relying on the general crawl.

A sitemap file is, in short, a big XML file full of links to your site's pages. I think it says something that even with this seemingly foolproof data interchange format that Google still have to provide tooling to help webmasters debug issues. That said, it's a huge improvement compared to trying to riddle out why their general crawl did or did not find certain pages. Or found them multiple times.

AI to mine them?

After a search engine finds a page the next step is to read it and understand it. How well does this work in practice? Again, relatively few websites expect Google to manage this on their own. Instead they provide copious metadata to help Google understand what a page is about and how it sits relative to other pages.

Google gave up at some point trying to work out which of two similar pages is the original. Instead there is now a piece of metadata which you add to let Google know which page is the "canonical" version. This is so they know which one to put in the search results, for example, and don't wrongly divvy up one page's "link juice" into multiple buckets.

Google also gave up trying to divine who the author is. While Google+ was a goer, they tried to encourage webmasters to attach metadata referring to the author's Google+ profile. Now that Google+ has been abandoned they instead read metadata from Facebook's OpenGraph specification, particularly for things other than the main set of Google search results (for example in the news stories they show to Android users). For other data they parse JSON-LD metadata tags, "microformats" and probably much more.

Google doesn't just search web documents, they also have a product search, Google Shopping (originally "Froogle"). How does Google deduce the product data for an item from the product description page? This is, afterall, a really hard AI problem. The answer is that they simply don't - they require sellers to provide that information in a structured format, ready for them to consume.

Google of course do do text analysis, as they have always done, but it's often forgotten that their original leg up over other search engines was not better natural language processing but a metadata trick: using backlinks as a proxy for notability. The process is detailed in the original academic paper and in the PageRank paper.

Backlink analysis was a huge step forward, but PageRank is not about understanding what is on the page and indeed early on Google returned pages in the search results that it had not yet even downloaded. Instead PageRank judges the merit of a page based on what other pages link to it. That is, based on metadata.

...and in the darkness, combine them?

And how well, after all this, does the Artificial Intelligence do at coming up with the relevant documents in response to search queries? Not so well that showing structured data lifted from Wikipedia's infoboxes on the right hand side wasn't a major improvement. So many searches are now resolved by the "sidebar" and "zero click results" that traffic to Wikipedia has materially fallen.

The remaining search results themselves are increasingly troubled. My own personal experience is that they are now often comprised of superficial commercial "content" from sites that are experts in setting their page metadata correctly and the other dark arts required to exploit the latest revision of Google's algorithm. There's also a huge number of adverts.

Perhaps the best measure of this problem is how often I have to append the search terms "reddit" or "site:reddit.com" to a query. Increasingly this is the only way to find the opinions of people who aren't being paid to give them. I do wonder why Reddit never seems to rank particularly well for any keyword that commercial "content sites" cover.

Perhaps the bigger illusion is that when you search with Google you are somehow searching the sum total of human knowledge. Of course, you aren't. The accumulated knowledge of human civilisation is still mostly in books. Humanity wrote books for thousands of years and has only written web pages for a few decades. When you search, you are really just searching the sum total of things that people have put, and managed to keep, on the web since about 1995. Perhaps this is one reason why commercial "content sites" appear often in searches: they put a lot of stuff on the web.

Metadata tends to displace Artificial Intelligence

The phenomenon of metadata replacing AI isn't just limited to web search. Manually attached metadata trumps machine learning in many fields once they mature - especially in fields where progress is faster than it is in internet search engines.

When your elected government snoops on you, they famously prefer the metadata of who you emailed, phoned or chatted to the content of the messages themselves. It seems to be much more tractable to flag people of interest to the security services based on who their friends are and what websites they visit than to do clever AI on the messages they send. Once they're flagged, a human can always read their email anyway.

There are woolly intimations that self driving cars will read roadsigns to work out what the speed limit is for any stretch of road but the truth seems to be that they use the current GPS co-ordinates to access manually entered data on speedlimits. You can live in the future right now, if you use the right mobile app as your satnav.

One of the earliest commercial applications of neural nets to was detect fraudulent credit card transactions. The neural nets worked very well, but not well enough to not be a nuisance, locking you out of your account when you went on holiday or bought a coffee in a new place. American Express now use the combination of a cardholder provided whitelist of merchants and text message codes in preference to allowing the AI models to run free.

A general pattern seems to be that Artificial Intelligence is used when first doing some new thing. Then, once the value of doing that thing is established, society will find a way to provide the necessary data in a machine readable format, obviating (and improving on) the AI models.

I'm sure there's someone out there working tirelessly to perfect all the disparate technologies - computer vision, control systems, depth perception, etc - required in order for a Tesla to successfully navigate a McDonald's drive through. Just as they get it sorted and demonstrate its utility, McDonalds will probably just calculate and provide those routes as public information. After all, why bother with the maths and machine vision when you can just write it down in an XML file?

Metadata conflict

Of course, this all only works when you can trust that the metadata is right. That's not always the case and this is the primary reason why Google no longer indexes meta description strings. Those dastardly webmasters keep entering lies!

But you don't always have to use metadata from the owner of a thing. The metadata might be provided by some neutral third party, as a matter of public record or just the accumulated weight of numerous uncorrelated data points. This is what happens when Google shows Wikipedia data on search engine result pages. Or business addresses. It's also how PageRank works.

The virtues of metadata

Google never publish what they have inferred about a web page with their clever AI techniques. Even webmasters are only given access to a very small portion of the data about their own sites to allow them to debug issues. The whole system is stunningly opaque.

The best argument for metadata is that it's open and there for anyone to read. Anyone who wants to can easily write a parser for the OpenGraph tags. They don't need gads of AI models or cloud computing or whatever to understand something simple about a web page.

It's important, though, that the metadata sits on or near the thing itself, and that if it doesn't, that there isn't a requirement for lots of interaction or co-operation to get it. Having to plead for access to or pay for metadata usually ends up empowering monopolies or creating needless data middlemen (who drone on and on about how "data is the new oil"). At best it creates little barriers to getting started. Finance in particular is riddled with this problem.

The vices of the AI myth

Google themselves say loudly and often that webmasters should "forget metadata and focus on content". This feeds into the Google mythos that they have some godlike power to algorithmically understand web pages. It also misleads the public that metadata is somehow ancillary and that search engines will work all it out on their own. This discourages webmasters from bothering with the basic things that will help people discover their pages, like OpenGraph tags or Twitter cards. The enormous number of people with "SEO" as their job title really should put the lie to the idea that metadata doesn't matter and that Google is a fair system.

Over-confidence in either an extant (but mysterious) or forthcoming piece of Artificial Intelligence often discourages people from seeking out simpler solutions. You feel like a an idiot suggesting something as diminutive as an XML tag when others make wild (and wildly confident) claims about what the burgeoning Strong AI will achieve. After all, with all these recaptchas I'm filling out, the machines must be getting really good at recognising palm trees.

But "machine readable" strictly dominates machine learning. And worse yet for the data scientists, as soon as they establish the viability of doing something new with a computer, people will rush to apply metadata to make the process more reliable and explainable. An ounce of markup saves a pound of tensorflow.


See also

Peter Thiel writes fairly convincingly in a chapter of his book about how humans will work together with machine learning for a long time. It's just a shame he's talking about surveillance of the public by their local council.

Thiel is also the source of the "We wanted flying cars, but instead we got 140 characters" quote which has long since been memed into oblivion by its send up: "We wanted flying cars but all we got were pocket-sized black squares holding all of human knowledge". If only it were true.

Cory Doctorow wrote an article titled Metacrap long ago. I think it's a great argument against being too ambitious about the possibilities of metadata (which the Semantic Web people were) but it does conclude with the thought that on-page metadata is a fundamentally good idea.

Larry Page and Sergey Brin were originally pretty negative about search engines that sold ads. Appendix A in their original paper says:

we expect that advertising-funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers

and that

we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm

Another blog post could be written on the incredible growth of the other kind of web metadata: that present for security reasons. X-Content-Type-Options, X-Frame-Options and X-XSS-Protections are all pretty baffling and probably mostly mis-set or ignored. How many sites set Content-Security-Policy correctly, or at all? If you're interested in this, I highly recommend the book The Tangled Web. Even if it is slowly getting dated it remains a great source of intuition on all the things that can go wrong in network protocols and "sandboxed" code execution.