Good books and Stack Overflow

Quick post, mostly to recommend two books:

  • Search Patterns: as you may know from my previous post, we are brainstorming about search at my work, and this book is a great place to get the discussion started. Not very technical, its clear goal is to make you think about search in novel ways, and not to teach you how to copy what has already been done.Great read!
  • Programing Pig: to get started with this higher language of Hadoop, written by one of the original engineer working on Pig at Yahoo. Getting slowly out of date, but still a good place to get the fundamentals of Pig.

Otherwise, after years of using it, I finally started contributing to Stack Overflow. I hadn’t realized how exciting it is to win reputation… and how competitive it gets when a question has an easy answer, you only have a few minutes to write it!
My only concern, so far my answers have mostly been hacks (e.g. this format switcher) instead of deep, insightful comments about a language or program. But if it’s useful to someone, great! I’m glad I can finally upvote answers and write comments.

Finally, I’ve discovered Reddit AWW and can’t get enough 😉
Cheers!

Investigating Google Site Search

The company I work at is wondering how search can be improved on our site (current solution is not state-of-the-art). To explore our options, I spent a few days building an minimum viable product (MVP) with Google Site Search (GSS). This is what I learned in the process: how to get started, how good it works, and its limitations. First takeaway: Google’s documentation is terrible!

What is Google Site Search?

Here is the website, you can think of it as the paying version of Google’s Custom Search Engine (CSE). A CSE is similar to the regular Google search, but you can specify which pages to index (or not to index), and you have some control over the ranking (e.g. by boosting some pages when they’re relevant).

How to get started?

A very simple (free) example: in CSE, create a new custom search engine (give it a cool name, like SearchMan or SuperSearch), then in ‘edit search engine’ / ‘setup’ / ‘basic’, under site to search, add: http://www.birchbox.com/ and that’s it! You have a search engine that only return results from birchbox.com. Try it by typing ‘shampoo’ in the right panel. You can also have a public link to it where anyone can use the search engine, it looks like a modified Google homepage.

GSS uses your custom search engine, you pay to be able to query it more (and get rid of ads and get an XML feed, more on this later). So, before continuing any further, spend time with your CSE to make sure you like the results: play with indexing only parts of your website, or excluding some, specify autocompletion words, etc. Pass it to other people around your organization for feedback. If you don’t like the results, it won’t change with the paid version!

Is GSS a viable option?

I assume you have a CSE you like and you’re thinking “wouldn’t it be great if those were the results on my website, but with a different UI?”. At the same time, GSS is a paid service, as opposed to hosted solutions such as Solr or ElasticSearch. So… not an obvious choice! GSS requires little maintenance, it’s ready to work from the get go, and the service should not fail. The highest publish prices used to be (last week) $12K/year for up to 3M queries. Now it says $2K/year for <500K queries. You have to contact them in case of more traffic. Still, assuming you have a reasonably successful website at 1M queries/year, $12K might be worth it if you factor in the extra engineering time for a good Solr implementation and the additional risks. Now, price is not the only factor. To make GSS do everything you want, you might have to modify your frontend code (to add tagging and labels that Google crawlers will pick up). There are also other customization limits we'll discuss later. But GSS deserves a MVP! Getting started with the XML feed

GSS comes with its own search box and search engine result page (SERP) implementations, but chances are you’ll want to fully customize those things to fit your website’s look and feel. Don’t bother looking at Google CSE themes, it won’t get you there. You want to call GSS through an API and deal with the result yourself. For that, you need to move away from basic CSE and buy the 200K queries/year option for $100 (using Google wallet).

Don’t look at the JSON API! The pricing is different and doesn’t change if you buy GSS. There’s also a hard limit of 10K queries/day that can easily be reached in a high traffic day.

So, XML API it is. You query your public custom search engine using this call:
http://www.google.com/cse?&cx=CSE_ID&q=shampoo&num=20&output=xml_no_dtd&client=google-csbe&gl=us
where CSE_ID is a set of digits + ‘:’ + letter/digit hash identifying your CSE. You can find it from your public URL. Here is a sample result.

Side note, to deal with XML, I use Hash.from_xml in Ruby (my MVP uses Rails, no comments…) or xmltodict in Python. Note that in Ruby, it seems that different Ruby versions return slightly different Hash from the same XML. So… stick to one version? Don’t use Rails?

Anatomy of an XML GSS response

Read the full documentation, this is just a cheat sheet of the few XML objects I found useful. A sample result is available.

  • TM is the response time in milliseconds.
  • Q is the query.
  • RES contains the results and the number of results as an attribute.
  • R is an individual result, containing the field below.
  • TM is the response time in milliseconds.
  • U is the page link.
  • T is the page title.
  • S is the summary.
  • PageMap contains info crawled from the website in a DataObjects element.
  • cse_thumbnail, inside a DataObject, contains the thumbnail link hosted by Google.

Title, link, thumbnail, summary… that should be enough to get started with your MVP!

Customizing the search box

So far I have assume that your app is responsible for the search box, you query GSS using the link above, and display a result based on the XML answer. Problem is, some of the cool GSS features are only available through their widget, the main one being autocompletion. But if you install the widget, it also displays the results…

One way to go is to install the widget using code provided in the CSE panel. Use version 1 as it is more customizable. It will give you a widget with autocompletion. Now, use the following callback to prevent the widget to actually search and present the results:

customSearchControl.setSearchStartingCallback({}, function() {
var q = customSearchControl.getInputQuery();
window.location = '/search?q=' + q;
});

What’s happening is that, when the user press search, my function is called. It gets the query, and loads another page (/search) with the query as argument. This new page (which can be the one you’re currently on!) will see the ‘q’ param, call GSS, and display the results. That’s it! It’s hacky, but it works, you have the power of the GSS search box with a fully customized SERP.

Another trick, to remove the Google logo from the search box, add this to your CSS

.gsc-input{
background:none;
}

That said, it doesn’t seem that easy to fully modify the look & feel of the search box, but I’m not a CSS expert. Just, expect spending some time on this if it’s important to you.

Limitations

Now that you have an MVP similar to mine below:

Google Site Search MVP screenshot

Example of a quick Google Site Search MVP

What’s wrong with it? For me, those are some of the issues:

  • Indexing: if you don’t want to index a whole website, and there is no easy way to tell Google not to index unwanted part (e.g. with -www.website.com/garbage/*), you will want to specify every part of the website you want as a whitelist. You can do that by uploading an annotation file. However, GSS only accepts 5K entries that can contain wildcards. If you’re site is well structured and you want to annotate all posts like this: www.mywebsite.com/posts/*, it’s perfect. But if your website contains a lot of url slugs that can vary and there’s no way to include all of them without including bad ones (and some CRM do create a lot of URLs!), 5K can be reached fast.
  • Autocompletion: it requires using and hacking the Google search box to customize it. Also, auto-detected words for autocompletion are poor in my case. You can provide your own list, but there’s a size limit. I haven’t fully found it, but it’s less than 1K (I successfully uploaded ~70 words), which can be a small limit.
  • Thumbnail: the thumbnail provided by Google is not always the right picture for a page. I’m sure it’s a website issue, being not crawler-friendly, but you don’t necessarily control that.
  • Tag individual pages: I don’t know if tag is the right word, but following the previous issue, if every result could come with a page ID, or a product ID for e-commerce pages, I could build the full SERP out of that. Problem is, there is no way of passing that information to Google as a dictionary of URL->ID. The only way I can think of is to include that information in the webpage in a way Google can pick it up: see structured data. I believe it is a great solution, but it might require a lot of frontend engineering time!

Conclusion
Google Site Search let me built a search MVP in 3 days for $100. It’s great, it shows me what I should expect from our site search, and I can pass the MVP around the company for feedback and guided brainstorming. If you have a small website and needs something that works fast, GSS might be your solution. If you find the Google SERP ok, you can be done in a few hours.

But to push it to a fully customized, professional site search, it would require serious engineering time, and I’m afraid of limitations I don’t control. My current thinking: the $12K or more per year would be better invested in a long-term, in-house Solr or ElasticSearch solution.