Category Archives: Tracking

Live search referrer spamming

I don’t usually take much notice of Live search traffic to this blog because they usually don’t drive much if any traffic at all. But I decided to look today and noticed a few weird entries in Google Analytics. I had a bunch of referring keywords from live.com with 1 visit, 2 page views and 0 time on site. Click on the thumbnail to see the full size image:

Live Search Referrer Spam

When you see patterns like that, you have to assume that it’s a bot that’s hitting your site. I downloaded my logs for the past week and took a look. To my surprise there were entries like this:

65.55.165.26 - -
[29/Sep/2007:12:12:45 -0400]
"GET /black-people-on-ebay-again/ HTTP/1.0"
200 25905 www.reubenyau.com
"http://search.live.com/results.aspx?q=people&mrt=en-us&FORM=LIVSOP"
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)" "-"

If you’re not used to reading raw logs, the important items here are the IP address in the first line and the referrer in the 5th line.

Using whois.arin.net you can see that the IP address belongs to Microsoft:

OrgName: Microsoft Corp
OrgID: MSFT
Address: One Microsoft Way
City: Redmond
StateProv: WA
PostalCode: 98052
Country: US

NetRange: 65.52.0.0 - 65.55.255.255
CIDR: 65.52.0.0/14
NetName: MICROSOFT-1BLK
NetHandle: NET-65-52-0-0-1
Parent: NET-65-0-0-0-0
NetType: Direct Assignment
NameServer: NS1.MSFT.NET
NameServer: NS5.MSFT.NET
NameServer: NS2.MSFT.NET
NameServer: NS3.MSFT.NET
NameServer: NS4.MSFT.NET
Comment:
RegDate: 2001-02-14
Updated: 2004-12-09

RTechHandle: ZM23-ARIN
RTechName: Microsoft Corporation
RTechPhone: +1-425-882-8080
RTechEmail: noc@microsoft.com

OrgAbuseHandle: ABUSE231-ARIN
OrgAbuseName: Abuse
OrgAbusePhone: +1-425-882-8080
OrgAbuseEmail: abuse@microsoft.com

OrgAbuseHandle: HOTMA-ARIN
OrgAbuseName: Hotmail Abuse
OrgAbusePhone: +1-425-882-8080
OrgAbuseEmail: abuse@hotmail.com

OrgAbuseHandle: MSNAB-ARIN
OrgAbuseName: MSN ABUSE
OrgAbusePhone: +1-425-882-8080
OrgAbuseEmail: abuse@msn.com

OrgNOCHandle: ZM23-ARIN
OrgNOCName: Microsoft Corporation
OrgNOCPhone: +1-425-882-8080
OrgNOCEmail: noc@microsoft.com

OrgTechHandle: MSFTP-ARIN
OrgTechName: MSFT-POC
OrgTechPhone: +1-425-882-8080
OrgTechEmail: iprrms@microsoft.com

This type of behavior is certainly not in the spirit of the internet and something that’s definitely quite annoying.

So why is Microsoft referrer spamming me? I start searching on forums and I start finding that I’m not the only one being targeted here.

http://ekstreme.com/thingsofsorts/blogging/yell-if-microsofts-livecom-spammed-you-too

http://www.webmasterworld.com/msn_microsoft_search/3424476.htm

http://www.seo-scoop.com/2007/11/13/past-time-for-msn-to-pony-up-to-the-real-truth-about-referrer-spam/

http://smackdown.blogsblogsblogs.com/2007/11/13/microsoft-needs-to-quit-fucking-with-my-adsense-scripts/

So as it turns out, msndude from webmasterworld apologizes and basically says “don’t worry, be happy – and btw, if you block it, you might get banned”. Here’s the actual quote:

The traffic you are seeing is part of a quality check we run on selected pages. While we work on
addressing your conerns, we would request that you do not actively block the IP addreses used by this quality check; blocking these IP addresses could prevent your site from being included in the Live Search index.

HUH? Excuse me? You have a bot that’s not exactly being very nice but I’m not allowed to block it? What kind of practice is that? I don’t run a spammy MFA site. I don’t do anything shady, so why should I have to sit here and have my stats polluted for absolutely nothing in return. If Live was sending me traffic I could perhaps turn a blind eye, but considering that they’ve been on a pretty good downward trend recently, you’d think that they’d want to do a better job of appeasing webmasters. Here’s their traffic over the past few months according to quantcast:

live.com traffic by quantcast

As of tonight it’s being blocked. I really don’t want to play this game, but this type of behavior should not be tolerated by webmasters.

Google Analytics 404 Tracking And Downloaded Files

I was reading Bruce Clay’s September Newsletter and came across a couple of inaccuracies in Jim Sterne’s web analytics article, specifically this part:

Google won’t report on downloads of files like PDF’s, jpg’s of Flash. You want to know about server error messages? You have to look to the pay-to-play vendors.

Well Google Analytics may not give you those reports out of the box, but it’s not too difficult to put these two solutions together:

1) Tracking files downloaded from your site:
http://www.google.com/support/analytics/bin/answer.py?hl=en&answer=27242

Caveat – this method only reports on people clicking links on your website which are tagged with this code. If someone remotely links to a file on your site, none of the javascript web analytics packages will report on that traffic.

2) Tracking 404 error pages:
http://analytics.blogspot.com/2006/09/tip-tracking-404-pages.html

This could also be used to create error pages and tracking for 5XX errors.

And on a side note, for those of you who use WordPress, here’s a handy way to create a custom 404 page:
http://codex.wordpress.org/Creating_an_Error_404_Page

If you have your Google Analytics code in a footer include file, you could create a second include and call it from the 404.php.

Website Optimizer Is Search Engine Friendly

Website Optimizer by Google AdWords is a multivariate testing tool, that is to say, it takes A/B testing one step further. Instead of testing 2 versions of a page, you can test multiple page elements and the various combinations.

When you create an experiment you can specify page elements that you want to test, for example, a page heading, intro copy or a lead image. It uses javascript on the landing page to swap out the test element with the other variations that you specify within Website Optimizer.

I participated in the beta test of this and my initial concern was that it may not be search engine friendly due to the changing page elements, however, after a short call with the Website Optimizer Product Manager, he confirmed that it would not have any impact on organic rankings.

If you are still concerned about it, then you can always set up a specific landing page that is not linked to from your main navigation and use either the robots.txt or meta noindex tags to prevent search engines from crawling those pages.

Once your AdWords account is fairly well optimized, I would highly recommend trying out this tool. You will learn new things about your website, its traffic and motivators. Just make sure you carefully plan the test elements and don’t test too many elements at once to ensure that you can run through enough iterations with conversions to gain meaningful data.

Google Analytics Developers, Please Update urchin.js

I was looking through an overall keyword conversion report in Analytics and noticed some strange search phrases appearing. They looked like long strings of random numbers and characters. It turns out that AOL must be testing some new URL structure in the search results and changed the variable that identifies the search query. I took a look at the urchin.js and noticed that the new query variable (userQuery) is not included.

The urchin.js file currently has these variables to define AOL:

_uOsr[3]="aol"; _uOkw[3]="query";
_uOsr[4]="aol"; _uOkw[4]="encquery";

To track the new AOL search queries just place these 2 lines before the urchinTracker() function in your Google Analytics tracking code:

_uOsr.push("aol");
_uOkw.push("userQuery");

Update: I replaced the manual insertion of elements into the _uOsr and _uOkw arrays with the push() function which is a much better solution.

Google Analytics Tracking Code – HTTPS and Full External Referrer Only

Part 1 – Detecting http and https Mode Using Javascript
A while back I came across a scenario where a website (typically an ecommerce site) can serve part of their website in both http and https mode. These sites typically use the same template or footer include file for both browser modes. This causes a security alert popup in the browser because the remote javascript file is called using a http request. While this isn’t a security threat, it could cause some less technically savvy users to be concerned about the site security and perhaps not want to complete the transaction.

Google does offer the webmaster the ability to request the urchin.js file using a https call, which works well, except what we really need, is a way to detect which mode we’re in, then make the appropriate request on the javascript file.

With help from some members on SEORefugee we figured out how it can be done.

Part 2 – Only Obtaining External Referrers
Sunday night I was looking through my Top Content report and realized that after my hack to obtain the full referrer, it’s fairly indiscriminate and will obtain all referrers, both internal and external. While I already knew about this, I guess that night I was tired and grumpy and it just bugged me enough to want to fix it.

The whole point of my hack was to obtain the external referrer, so I came up with some more javascript to detect whether the referrer is internal or external and write out the urchinTracker function accordingly, so it will only record the external referrers.

The Grand Finale
So putting all this together we get this:



Just replace the XXX’s with your Analytics account number and “www.mywebsite.com” with your website.

Testing Google Analytics Regular Expressions In Real Time

If you have a dynamic website and want to set up conversion goals within Google Analytics, you may need to use regular expressions to define the goal page.

These are extremely useful because they can be used to match specific sequences of characters in a URL. They’re a huge expansion on the common wildcard characters * and ? and tend to work out very well for ecommerce websites.

Here’s a typical example of how you can use them to great effect. Usually when you set up a conversion goal you have to specify a goal URL, that is, a page the user gets to which triggers that a conversion has taken place. In the case of an ecommerce website, it would be the receipt page, or “thank you” page once the user has purchased something. But in many cases the URL of that page is dynamic and may contain OrderID or CustomerID variables. Since this page changes every time, we need to use a regular expression so we can match the URL regardless of the value of those variables.

If you’re new to Google Analytics, you probably started off setting up a goal with your first attempt at a regular expression, going to your website to complete a test transaction, waiting approximately 3 hours then checking your stats. Not exactly the most efficient way of going about things, especially if you’re new to regular expressions.

If that’s the way you normally do it, here’s a massive time saving tip. You can test your regular expression live on previous data and it will take you just a matter of minutes to complete the goal.

1) Once logged in to Analytics, go to Content Optimization > Content Performance > Top Content
2) In the filter at the top of the screen enter your regular expression and hit enter.
3) Look at the content results and keep fine tuning your regular expression until you see the desired match appear

Here’s a real example used on an ecommerce site. The regular expression I used to filter the data was:
^/orderthankyou\.asp.*

Click on the image to see a screenshot of the results:
Google Analytics Regular Expression Example

Just be sure to scroll through all results, to make sure it didn’t pick up anything else, otherwise your conversion stats will be overinflated.

More Help
Google Analytics Regular Expressions Syntax
Conversion University
Google Analytics Support Pages
Google Analytics Google Group

Edit: Corrected syntax, thanks Mark.

Google Analytics Full Referrer Tracking Update

Since I wrote about tracking the full referrer in Google Analytics, I had feedback from some people saying that their content management system, forum software, etc doesn’t allow them to modify the HEAD section or BODY tag. Some other people are also hesitant to place the code in the HEAD section incase the Google Analytics servers are slow, causing the page to pause while the tracking code is downloaded from Google. Another drawback to putting the tracking code high up on the page is that you may end up counting partial page downloads.

So to get around these issues here’s an alternative which you can place just before the end BODY tag.


Just replace the XXX’s with your Analytics account code.

Update: I’ve augmented the tracking code to also detect if the page is served in http or https mode to serve the appropriate call to the urchin.js file, and also detecting whether the referrer is internal or external so you don’t get your site appearing as a full referrer in the Top Content Report. View my Ultimate Google Analytics Tracking Code.