Overview‎ > ‎

Technology

Screen scraping is, according to Wikipedia, a technique in which a computer program extracts data from the display output of another program. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was intended for final display to a human user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing.

Wikipedia continues by shedding light on how difficult the scraping is from a technological standpoint. 

Screen scraping is generally considered an ad-hoc, inelegant technique, often used only as a last resort when no other mechanism is available. Aside from the higher programming and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but computer programs will often crash or produce incorrect results.

Prior to WordCaptureX library, screen scraping solutions were based on employing OCR techniques on screenshots. OCR is traditionally slower, error prone (the best accuracy rates are 95% for typewritten documents), not suitable for applications screens as many UI elements interfere with the OCR algorithms producing undesired results and it is extremely expensive.

The original idea behind WordCaptureX was to intercept and analyze Windows GDI API calls in order to detect the particular text that an application is displaying in a given rectangle. While the idea is clean and nice the implementation is not trivial.

It took us 5 years of continuous development to achieve the solid performance of today versions. Our stress test regularly performs 1 million consecutive screen scrapings without any crash or degradation in the system performance.

Besides the API intercepts method that we named Native method, we have added a new scraping method, named FullText that employs working with the API that a graphical user interface is exposing in order to extract data from windows or controls. This second method is more limited in scope but it has the advantage to extract entire text from a scrolling window and also it offers improved performance on scraping from particular applications. Generally speaking you need to try both methods to see which one is better suited to your particular requirements.

As the last resort, we have leveraged Google Tesseract OCR technology to be able to scrape those applications that uses a rendering engine that do not expose anything to the outside world. Tesseract was designed to recognize printed fonts so we had to tweak it to be able to read screen fonts.



Comments