Challenges
Common technical obstacles which a scraper is supposed to be capable of handling are -
Captcha
Captcha is the show-stopper for scrapers and bad news is that almost all the websites have captcha enabled for the suspicious requests. Getting away with captcha is not an easy thing.
HTTP fingerprinting
Apart from detecting the IP address, server-based solutions can also detect the requests coming from a client using HTTP fingerprinting. So, the requests can be blocked if it’s coming from the blacklisted client machine/device even if it’s using a new IP.
DOM manipulation
Almost all the websites are using DOM manipulation and Dynamic content. A generic scraper script can’t access the data from these websites and moreover, the APIs serving these websites can’t be scraped directly unless the dynamically generated security token is passed on to the server the request header.
DDoS attack protection
Most of the websites are enabled with DDoS attack protection, which can also block the scrapers.
Changes in HTML structure
Frequent changes on these websites is not an uncommon phenomenon, however, this can cause breakdown of the scraper and would require immediate updates in the script.
App-only platforms
As the companies are moving to “App-only” concept, it throws a new challenge to scrapers…what to scrape?
Our approach
The above points represent very common challenges that are encountered when scraping websites/apps. Further below, we have detailed out the strategy to overcome these challenges -
IP blocking
We can avoid this by routing the requests through thousands of different IP addresses. We can randomize the request origination instead of making the requests at a set frequency.
Captcha
Not all the websites implement captcha security as it severely interferes with user experience as well as user interaction. So, we can identify which websites have the captcha implemented and then find a solution for the same by following this approach –
- Analyze which website has implemented the captcha protection and then reverse engineer which action/event triggers the captcha. If a pattern/cause can be identified, adapt your request/code to circumvent it.
- If no pattern can be identified, or if the identified pattern can’t be circumvented, then the solution is to break the captcha. Many captcha solutions are known to include bugs/exploits so they can be easily avoided. If the captcha solution can’t be avoided and does not have known exploits, apply an automated OCR-level captcha recognition. If OCR-level captcha recognition is also not working, then you would require human manual interaction (human farms).
App-Only Solution
- For such apps, we can scrape the data by identifying the relevant APIs and finding a solution to scrape those APIs.
- These APIs are generally secured by a header token, which is generated at the user end by the App, so direct scraping of the APIs may not work.
- For this, you can use the techniques which are used in automation testing by the app developers.
Detecting changes in web page DOM, Navigation or API structure
Frequent changes on these websites is not an uncommon phenomenon, however, this can cause breakdown of the scraper and would require immediate updates in the script. However, by employing the template-based approach, not only you can instantly identify any changes to the website/API but can also swiftly adapt your script.
HTTP Fingerprinting
You can overcome this by using cloud computing technologies. You can initiate a cloud ephemeral instance using preconfigured system images and as soon as the instance is blocked, its destroyed and replaced by a new instance.
DOM Manipulation
We can scrape these websites by creating an app which can simulate the behavior of a website user. The script can even log in as a user and then browse/search the records in a captive browser. As the script has full access to the content of the browser, the content can be then scraped easily.
DDoS Attack Protection
All of the above-mentioned solutions when combined can make the scraper overcome this solution.
Scrapers
It will have two different components
Templates
- Create a rule-based parsing engine for the data scraping so that any changes in the web page can be handled swiftly without making changes to the rest of the program.
- All the rules should be stored in a file, called template.
- The template will define which data is to be scraped and how to locate the data on the target web page.
- If the structure of the target web page changes, then you only have to make changes to the relevant template and the rest of the program (Worker) will remain unchanged.
- Also, if any new data has to be extracted from the same page, this approach would ensure a significantly quicker turnaround time.
- This approach also allows us to add a new rule for a new website very easily, making the script highly extensible.
- This will be relevant for web pages as well as mobile APIs. Every web page will consequently have a separate template.
Worker
The worker will consist of the actual scraping code. It has 2 parts -
- Reader –It is the most significant part of the whole solution as every website has its own security and page/content loading mechanism which needs to be analyzed to identify the key challenges and then devise a strategy to develop the reader. On some websites, the content is loaded along with the DOM whereas some other websites use DOM Manipulation and have dynamic content loading.
- Extractor –It will be using the rules and patterns defined in the templates to identify and extract the required data from the HTML/JSON scraped by the reader. Extracted data will be stored in a cloud-based NoSQL database.