Screen scraping is, according to Wikipedia, a technique in which a computer program extracts data from the display output of another program. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was intended for final display to a human user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing. Wikipedia continues by shedding light on how difficult the
scraping is from a technological standpoint. Screen scraping is generally considered an ad-hoc, inelegant technique, often used only as a last resort when no other mechanism is available. Aside from the higher programming and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but computer programs will often crash or produce incorrect results. Prior to WordCaptureX library, screen scraping solutions
were based on employing OCR techniques on screenshots. OCR is
traditionally slower, error prone (the best accuracy rates are 95% for
typewritten documents), not suitable for applications screens as many
UI elements interfere with the OCR algorithms producing undesired
results and it is extremely expensive. The original idea behind WordCaptureX was to intercept and analyze Windows GDI API calls in
order to detect the particular text that an application is displaying
in a given rectangle. While the idea is clean and nice the
implementation is not trivial. It took us 5 years of continuous
development to achieve the solid performance of today versions. Our
stress test regularly performs 1 million consecutive screen scrapings
without any crash or degradation in the system performance. Besides the
API intercepts method that we named Native method, we have added a
new scraping method, named FullText that employs working with the API
that a graphical user interface is exposing in order to extract data
from windows or controls. This second method is more limited in scope
but it has the advantage to extract entire text from a scrolling window
and also it offers improved performance on scraping from particular
applications. Generally speaking you need to try both methods to see
which one is better suited to your particular requirements. As the last
resort, we have leveraged Google Tesseract OCR technology to be able to
scrape those applications that uses a rendering engine that do not expose anything to the outside world. Tesseract was designed to recognize printed fonts so we had to tweak it to be able to read screen fonts.
|
Overview >