Hola - Free VPN, Secure Browsing, Unrestricted Access

How to crawl a website without getting blocked or misled (cloaked)?

Why should I care?

When a target website detects crawlers from a proxy (datacenter) IP, it typically

  • Blocks the IP, or
  • Presents the IP with purposely misleading information, or
  • Throttle down the response rate

How does the target website identify my crawling activity?

Target websites log the IPs of whomever visits them and analyzes the activity of these IPs. Assuming you are using a traditional data center proxy, the target website can:

  • Identify that the activity from a single IP (the rate of requests) is much greater than what a human can accomplish in a given time frame
  • Identify that the IP address originated from a proxy server list, which these target websites have access to
  • Identify that the IPs have the same subnet block range

How to prevent being detected?

  • To prevent being detected by the amount of requests per IP, you can reduce the number of requests per seconds. However, this will reduce your crawling speed. learn about super fast crawling capacities here.
  • To prevent the target website from identifying your IP as coming from a proxy server, you must rotate your requests through residential IPs. You should be able to circulate through enough IPs that the target website can not detect your activity.
  • When using residential IPs there is no subnet block range.

By using a traditional proxy solution, it’s only a matter of time before the target website will identify your crawling activities, and can block or provide you with the wrong information.

Guidelines to rotating requests through millions of residential IPs:

  • Sign up to Luminati
  • Contact Luminati’s support team and ask for access to the residential IP network
  • Once activated, install the Luminati proxy manager from GitHub. We recommend you watch this guide for a smooth start
  • Go to the ‘proxies’ tab
  • Check the port of your ‘gen’ zone
  • Route your requests to 127.0.0.1:{portnum} where {portnum} is the port of the ‘gen’ zone