Setting up the Google Mini
When you first login to the Google Mini, you are presented with a web-based interface to configure the general parameters. These settings include the device IP Address, DNS Server, Admin E-mail address, Time zone, etc. Nothing fancy here, although we did have to configure an internal DNS server due to some firewall routing issues.Configuring your first Collection
Like most any search product, the first task is to create a collection of what you want searched. The Google Mini supports one collection while its larger brother, the “Google Search Appliance, supports an unlimited number of collections. Collections can contain sub-collections (which I’ll explain a bit later).Once you’ve created your first collection, the first step is to edit the collection parameters and set it up for indexing. URLs to Crawl was where we started, which contains a few parameters, Starting URLs from which to crawl, Follow and Crawl certain URLs or parts thereof and Do Not Crawl URLs matching certain patterns. This was probably where we spent 99% of our time configuring the Mini. The mini allows for 100,000 documents/URLs to be stored in a collection, and AnandTech contains approximately 40,000 articles, news and blog entries.
When we first set up the Mini, we told it to start in each of the website’s sections (for example, http://www.anandtech.com/it/) and in the web news area. The Mini considers any unique URL string to be a unique document, which makes sense (but is a bit surprising the first time that you run an index).
After four hours of indexing, the Mini had managed to reach its document limit and we had to improvise. After several attempts at filtering out various URL patterns and restricting the crawling as much as we could, we ended up writing some code. We created a file to which a link to every article, news post and blog post that have been published on the site would be dumped. That file is cached for a few hours as we update the index 3 times a week. We then configured the Mini to start at those URLs and restricted it only to URLs ending in showdoc.aspx, shownews.aspx and a few others. It worked - the next index was around 38,000 documents. A word to the wise: don’t let the Mini crawl your entire site without keeping a close eye on it.
Sub-collections
Before you let the Google Mini go off and crawl to its hearts content, consider creating some sub-collections if they are required. Sub-collections are simply small collections containing specific fragments of your site. For instance, on AnandTech, we have Articles, News, Blogs and FAQs as sub-collections. Each of these can be searched separately within the collection to allow us to have targeted searches within the various sections of the web site.KeyMatch/Synonyms
Like the google.com search, the Google Mini supports key matches that allow you to have links appear at the top of your search results, which match keywords that you enter in the Google Mini interface. Another useful feature that is included is Synonyms, which allow you to enter synonyms for various search terms. We have a few created. Try typing “ iram” into our search, and you’ll notice that it suggests “i-ram” as a possible search.Look and feel integration
The last thing that we worked on was making the Mini look like it is part of AnandTech.com. There are two ways to go about this in the Mini admin. One is to use their built-in page layout helper, which allows you to wrap the search screens with a custom header and footer. The other way (which we prefer) is to use the XSLT Stylesheet editor and modify the stylesheet to meet your needs.All in all, our integration went fairly smoothly, and the Mini has made it exponentially easier to find content on AnandTech.com.
Screen Shots
Settings |
Collection |
Output |
Subcollections |
48 Comments
View All Comments
zmagaw - Tuesday, September 6, 2005 - link
there are a few methods - including creating separate collections by user type filtering out or in urls by pattern matchingBrickster - Wednesday, September 7, 2005 - link
Actually, I found you can achieve that with an upgrade... of course :)http://www.google.com/enterprise/feature_compariso...">http://www.google.com/enterprise/feature_compariso...
Here is the Appliance feature:
Secure Content API - Search across secure content using Google's Authorization API to integrate into existing access control systems.
Looks like the Mini doesn't support secured content.
zmagaw - Tuesday, September 6, 2005 - link
we signed a non-disclosure that said we couldnt open the google search appliances... although the hardware looks simple and run of the mill... the software is not... a lot of open source stuff on that puppy but execution is everything... the support we got though was horrible... 2 day respose times... so not easy because the software is full of bugs that are not easily diagosed... hardware failures - disks... and speedy google working with large corporations has been seen as a daunting task for the bright people at Googlenadirshakur - Tuesday, September 6, 2005 - link
What is the warranty on these puppies. Hey didn't Anandtech void there's by opening it like that and showing the whole world they did.flatblastard - Tuesday, September 6, 2005 - link
Thats okay, if the RAM/CPU goes bad, I'll sell them my old p3 450Mhz system I got laying around for spare parts. Heck, I'll even give them a sweet deal.....$1999.95 and I'll even throw in Windows 98 (not SE)..... ;)deathwalker - Tuesday, September 6, 2005 - link
Is this really important when it comes to your experience when visiting the AnandTech website? I guess I'll get blasted for that statement!! So much really good stuff that could be the news of the day...this article is just cannon foder...something to fill the need for a new article to read on this day.Jason Clark - Tuesday, September 6, 2005 - link
Guys, we don't crawl every day :) It crawls 3-4 times a week, since large articles are on the front page, searching for them is pretty unnecessary.Gooberslot - Tuesday, September 6, 2005 - link
Is it normal for servers to give you no control like that? I wouldn't want anything that had a bios password that I couldn't change.I'm also surprised that you can even get a P3 or a P3 motherboard anymore.
zmagaw - Tuesday, September 6, 2005 - link
i think when you are buying appliances yes... the reason google does this... you would be able to decompile their software on that hard-drive which is formatted in a google HD format - or so I have heardsmn198 - Tuesday, September 6, 2005 - link
You could remove BIOS password but then you loose all the BIOS info as well and maybe they are doind something special there.