harbar.net component based software & platform hygiene

Configuring a Dedicated “Crawl Front End” with Request Management

Print | posted on Wednesday, July 24, 2013 12:06 AM

I keep getting asked about how to use Request Management in SharePoint 2013 to configure a dedicated “crawl front end”. In other words how to use RM to ensure that your search crawl traffic gets sent to a specific machine or machines in the farm, which do not serve end user requests.

Hopefully you already know that by simply turning on RM on your Web Servers in your farm and with no additional configuration, you get health based routing for free. And this is health based routing that actually works, unlike the default configuration of the most popular “intelligent” load balancers. All of which require additional scripting to make them “intelligent”. But you may want to leverage RM to do other things, and “crawl front ends” are the best example of a reasonably common requirement.

If you are not familiar with RM, head on over to my Request Management article series find out about it before doing anything with the scripts in this post!

It’s all very simple, but the number of times I’ve been asked about it means it makes sense to do a quick post on the required configuration. Now, before we get into the details I want to make a couple of things very clear.

  1. Just because you can, doesn’t mean you should™!
    Whilst I will show you how to set it up, this does not mean “Spence says you should use RM for configuring a crawl front end." RM is simply one of the tools available to us in SharePoint 2013 that we can choose to use. The call as to whether to use it or not depends to a large degree on the specifics of your deployment. Furthermore if you are not comfortable with RM, then this is not the approach you are looking for. Remember, it’s a tool. Another tool for the same job is the hosts file on your server file system. It’s your decision to make, not mine! :)
     
  2. Complexity is the enemy of everything!
    The vast majority of on-premises customers don’t need a dedicated crawl front end in the first place. They make the farm topology more complicated, expensive and increase the operational service management burden. None of those things are good! “Architects” that over-complicate farm designs for customers by deploying such things really get on my wick. Only deploy dedicated crawl front end(s) if you really need them. Some folks do. Most don’t and would be better off by increasing the resources available to their existing web servers. Oh, and whilst I'm in rant mood, just say no to the idea of “crawling yourself” by running the Web Application Service on the box running the crawler. That’s just stupid. Period.
     
  3. Remember that a “front end” is complete and utter claptrap!
    Yup, “Web Front End”, “Crawl Front End”. Stupid terms which simply won’t die. Everything is a web server. that’s it. Nothing more to see here. I was always very disappointed with Microsoft when they didn’t ship a role called “Web Back End” in SharePoint 2007! :)  “Web Front End” is a term made up by people that don’t (or didn’t) understand HTTP. But of course the trouble is everyone uses them (even me, when I'm tired or otherwise not concentrating). The stupidest term in SharePoint (at least since the “Single Sign On” service got renamed :).

 

Okay, cool. Disclaimers are always fun. With that out of the way let’s get into some details.

 

Host Named Site Collections

The fundamental thing here to understand is that Request Management only really works with Host Named Site Collections. That means a single Web Application in the farm, hosting all Site Collections. Microsoft’s preferred logical architecture design for SharePoint 2013, and incidentally the way we should have been doing it all along if the product had been up to scratch.

Whilst it is possible to hack things together to make RM work with Host Header Web Applications, I will NOT be detailing that configuration, ever. It’s not something you should even consider.

So, Host Named Site Collections it is. That means if you are not using them, this approach can NOT be used to configure your crawl front ends. That means you need to look at the alternative approaches such as the properties in the Search Service Application, or better yet, good old name resolution.

 

Starting the Request Management Service Instance on Web Servers

Request Management should be started on all Web Servers in the farm. You can do this via Services on Server or with the Start-SPServiceInstance Windows PowerShell cmdlet. I created a helper script which basically does all the work, regardless of the Farm you are deploying to.

# Start the RM service instance on all machines in the farm which run the WA service instance
# we could of course simply pass in an array of servers

# Gets a list of server names for servers runnning a given service instance type
function GetServersRunningServiceInstance ($svcType) {
    return get-spserviceinstance | ? {$_.TypeName -eq $svcType -and $_.Status -eq "Online"} | select Server |%{($_.Server.Name)}
}

# Starts a service instance on a given server
function StartServiceInstance ($svcType, $server) {
    Write-Host("Starting " + $svcType + " on " + $server + "...")
    $svc = Get-SPServiceInstance -Server $server | where {$_.TypeName -eq $svcType}
    $svc | Start-SPServiceInstance
    while ($svc.Status -ne "Online") {
        Start-Sleep 2
        $svc = Get-SPServiceInstance $svc
    }
    Write-Host("Started "+ $svcType + " on " + $server + "!")
}

$waSvcType = "Microsoft SharePoint Foundation Web Application"
$rmSvcType = "Request Management"

# loop thru servers and start RM..
GetServersRunningServiceInstance $waSvcType | % {StartServiceInstance $rmSvcType $_}

 

Configuring Request Management for dedicated crawl front end(s)

The first step is to get the RM Settings for our Web Application. We will pass this into other cmdlets later on. I like to output the objects to validate things are working.

# grab and display the settings
$waUrl = "http://webapp/”
$wa = Get-SPWebApplication $waUrl 
$rmSettings = $wa | Get-SPRequestManagementSettings
$rmSettings 

Obviously the Web App address is NOT http://webapp/ unless we have modified the Internal URL. It will be a server name, and you should update that variable to reflect your setup.

Now we will configure a Machine Pool which includes the servers we wish to act as crawl front ends. We do this by creating an array of server names, and then pass that into the Add-SPRoutingMachinePool cmdlet.

# Create a Machine Pool for the dedicated crawl front end
$crawlFrontEnds = @("Server1", "Server2")
$mpName = "Crawl Front End Machine Pool"
$mp1 = Add-SPRoutingMachinePool -RequestManagementSettings $rmSettings -Name $mpName -MachineTargets $crawlFrontEnds
# validate settings
Get-SPRoutingMachinePool -RequestManagementSettings $rmSettings -Name $mpName  

Now we need to create a Rule Criteria which will be evaluated to see if the incoming request is from the search crawler. This is the “magic” here (if it can be called such!). We will match the HTTP User Agent against that provided by the crawler.

$userAgent = "Mozilla/4.0 (compatible; MSIE 4.01; Windows NT; MS Search 6.0 Robot)"
$critera = New-SPRequestManagementRuleCriteria -Property UserAgent -MatchType Equals -Value $userAgent

Note the $userAgent variable there. That must match the crawler’s User Agent. This one is the default with SharePoint 2013. It’s stored in a Registry Key. There are plenty reasons why this may need to be changed. Make sure to check yours, it’s at

HKLM\SOFTWARE\Microsoft\Office Server\15.0\Search\Global\Gathering Manager\UserAgent

Of course, you could read the value directly in Windows PowerShell and shove it into the –Value parameter.

Lastly we create a Routing Rule using the above Criteria and Machine Pool. I also add this rule to Execution Group 0 to ensure that it will always fire.

$ruleName = "Serve Crawler requests from Crawl Front End Machine Pool"
$rule = Add-SPRoutingRule -RequestManagementSettings $rmSettings -Name $ruleName -ExecutionGroup 0 -MachinePool $mp1 -Criteria $critera
# validate
$rule
$rule.Criteria 

That’s it! One Machine Pool, one Routing Rule. It doesn’t get much simpler. Well unless you are editing a hosts file :).

Now we can kick off a Search Crawl and see it working by viewing the RM ULS (remember to view the ULS on the Web Servers and not the Crawler!). You will see a whole bunch of URI Mappings happening across those boxes, for example:

Mapping URI from 'http://webapp:80/robots.txt' to 'http://server2/robots.txt'

Mapping URI from 'http://webapp:80/_vti_bin/sitedata.asmx' to 'http://server2/_vti_bin/sitedata.asmx'

Mapping URI from 'http://cool.fabrikam.com:80/robots.txt' to 'http://server2/robots.txt'

Mapping URI from 'http://lame.fabrikam.com:80/robots.txt' to 'http://server2/robots.txt'

Mapping URI from 'http://cool.fabrikam.com:80/_vti_bin/sitedata.asmx' to 'http://server1/_vti_bin/sitedata.asmx'

Mapping URI from 'http://lame.fabrikam.com:80/_vti_bin/sitedata.asmx' to 'http://server2/_vti_bin/sitedata.asmx'

Mapping URI from 'http://cool.fabrikam.com:80/_vti_bin/publishingservice.asmx' to 'http://server1/_vti_bin/publishingservice.asmx'

So we can sit staring at the ULS forever whilst the crawler does it’s thing, but the important thing is that it’s working.

But wait. On my demo rig I have a couple of other web apps there – cool and lame – those are host header web apps, and it’s mapping those requests as well. That’s not so cool, because the response will be for the HNSC not the Path Based Sites. See I told you RM was designed for HNSC only. Don’t mix and match, weird shit will happen. You’ll get weird search results and users won’t be able to find a whole bunch of stuff:

24-07-2013 00-03-31

 

Conclusion

There you have it. Configuring dedicated crawl front ends using Request Management. If you are in a Host Named Site Collection world, life is good. It’s very simple. Still stuck in the world of Host Headers, then just say no to Request Management, period. Enjoy managing your crawler requests!

s.