Google and Facebook Can Benefit from Understanding Proxies

Posted by Vectro 28 September 2011

Google and Facebook seem to have a lack of understanding as to what proxy scripts are and what effects they have. Sometimes this lack of understanding leads to technical mistakes and misunderstandings. They are of course honest mistakes, but one would think two of the most technologically advanced companies would have a better handle on this.

Suffix proxy is sometimes mistaken as a phishing attack or as duplicate content because of the URLs it generates to establish the proxied connection via DNS wildcard. Google has on more than one occasion attempted to crawl the URLs generated by Glype’s browse.php. There are steps webmasters can take to prevent these things. It would also benefit Google and Facebook to have a better understanding as well.

EXAMPLE 1: Let’s say there is a suffix proxy script on example.com. Visiting www.facebook.com.example.com would bring you to the actual Facebook site. The problem with this is the Facebook security team mistaking it for a fraudulent copy of the Facebook homepage.

The solution for proxy webmasters is to restrict access to Facebook from a suffix proxy but allow access from Glype. Facebook does not seem to have an issue with people using Glype to access it.

EXAMPLE 2: Google mistakes suffix proxy pages as duplicate content. In rare cases, this can cause the original site to be removed from the results but have the suffix proxy version indexed instead. This hurts the original content creator because their site was removed from the search results and they lose traffic. The solution to this might be for suffix proxy webmasters to use robots.txt in some way to prevent indexing. It might also help if Google becomes more aware of this technology as it becomes more popular. Having an understand can also help them make any updates on their end which might be necessary.

EXAMPLE 3: Google’s crawler does not seem to understand Glype. Sometimes it attempts to index one proxied URL, then finds more proxied links from within and attempts to crawl them. This leads to Google attempting to index hundreds or thousands of proxied pages from one proxy. This increases strain on the server and causes spikes in resource usage. The high volume of requests to Glype in rapid succession can cause web server errors. Sometimes, Google attempts to retry proxied pages that generate HTTP error codes. This only adds to the problem. In some cases, it can cause the site to become unresponsive or crash completely. If Google had a different understanding of proxies, perhaps their crawler would be aware of when it identifies Glype. Ideally, it would not try to index proxied pages repeatedly. In all likelihood, proxied pages look like duplicate content to Googlebot. This is a side effect of the fact that a PHP script is being used to make a proxied connection to a website and returning the output as a URL.

Glype generates URLs that look somewhat like this:
http://www.example.com/browse.php?u=Oi8vd3d3Lm1zbmJjLm1zbi5jb20v&b=13

All links within the pages also have URLs that look similar. Googlebot will sometimes get hold of one of these URLs, then try to index every link it finds within. Sometimes it doesn’t stop for a long time. This can amount to a high number of requests to one site. This can tie up resources and sometimes causes connection limits to become maxed out, making a proxy hosting server unresponsive. The solution for Glype proxy webmasters is to disallow browse.php in your robots.txt file. Of course, it wouldn’t hurt if Google came to a solution on their end if they’re not already working on it.

Sorry, comments are closed.

Previous Post
«
Next Post
»