Re: [gentoo-user] extracting text, numbers from screencasts

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

From: "Urs Schütz" <u.schutz@bluewin.ch>
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] extracting text, numbers from screencasts
Date: Fri, 8 Apr 2016 21:54:30 -0300	[thread overview]
Message-ID: <570852C6.1070601@bluewin.ch> (raw)
In-Reply-To: <m/+pKjbYIbfzWo/QhKO4vz@Ml3T0rT5WgVXnfMZucxeg>

On 04/08/16 11:30, Helmut Jarausch wrote:
> On 04/08/2016 03:26:53 PM, hw wrote:
>>
>> Hi,
>>
>> what would be the best approach to extract data
>> from a screencast?
>>
>> The task is to acquire some data from the display of
>> a GUI program used interactively by a user.  There are
>> a couple 'fields' (as in "designated areas of the display")
>> in which the relevant data is being displayed while the
>> program is being used.  The acquired data needs to be
>> entered into a mysql database, preferably as soon as
>> possible.  (The program needs windoze, and the sources
>> are unavailable :( )
>>
>>
>> The idea is to make a screen recording and postprocess
>> the recording with some sort of OCR software.  This might
>> require using ffmpeg (or the like) to create a single
>> image from each frame of the recording; then treat each
>> image with an OCR software to get the interesting data
>> which can then be entered into the database.
>>
>> Data to extract is mostly numbers.  The relevant fields
>> can be expected to be either filled or empty.  The FPS rate
>> of the recording can be kept reasonably low, like 1 FPS,
>> or perhaps even less, depending on how frequent the relevant
>> fields change.
>>
>> Using tesseract comes to mind, but after reading that
>>
>> "Tesseract's output will be very poor quality if the input
>> images are not preprocessed to suit it: Images (especially
>> screenshots) must be scaled up such that the text x-height
>> is at least 20 pixels,[12] any rotation or skew must be
>> corrected or no text will be recognized, low-frequency
>> changes in brightness must be high-pass filtered, or
>> Tesseract's binarization stage will destroy much of the
>> page, and dark borders must be manually removed, or they
>> will be misinterpreted as characters."[1]
>>
>> I'm even more doubtful that this would produce usable
>> results with sufficient reliability.
>>
>> So what might be the best way to get text/numbers out of
>> what a program displays?
>>
>>
>> [1]: https://en.wikipedia.org/wiki/Tesseract_(software)
>>
>
> I can't help with Gentoo.
> Try to find an old (free) version of FineReader which runs under wine.
> If you do it only occasionally, transfer the image to an Android phone
> where there a good and cheap OCR apps, even FineReader.
>
>
>

I had some surprisingly good experience with tesseact in digitizing 
photographed pages of an old book recently. So I gave it a try today 
with a cropped screenshot of thunderbird.

$ convert scrsht.png -type Grayscale -filter point -resize 300% 
-normalize upscaled.png
$ tesseract -l eng upscaled.png out
$ less out.txt

convert is from media-gfx/imagemagick-6.9.0.3
tesseract is app-text/tesseract-3.04.00-r2

Here are my findings:
Any graphical elements sized similar to an character appear as strange 
letters.
Recognition of serif fonts was better than sans-serif fonts, even at 
smaller font size.
Text which can be spell-checked was nearly perfectly recognized.
Gentoo-specific words like "GLSA" and "NVMe" was not correctly recognized.
Selected text (white on blue background) was poorly recognized.
Dates were not recognized correctly.
Times were correctly read.
"convert" time for a initial screenshot size of 956 x 639 pixels was 0.4 
seconds.
"tesseract" time was a little more than 6s on an Intel(R) Core(TM) 
i7-4710MQ CPU @ 2.50GHz, without opencl.
The image conversion and tesseract ocr could easily be scripted.

In short I would say that the following steps would help with tesseract:
Avoid GUI with a lot of graphics.
Try to screenshot just the relevant areas.
Increase GUI font size.
Configure GUI to use a well known serif font, or train tesseract for the 
specific font used.
Configure GUI to use high contrasts, avoid colors which get converted to 
gray.
Tesseract time could be improved by enabling opencl.

I would be interested to hear about your findings with numerical data, 
and which approach finally works for you.

Urs

next prev parent reply	other threads:[~2016-04-09  0:54 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-08 13:26 [gentoo-user] extracting text, numbers from screencasts hw
2016-04-08 14:30 ` Helmut Jarausch
2016-04-09  0:54   ` Urs Schütz [this message]
2016-04-09  5:37     ` R0b0t1
2016-05-07 14:57     ` hw
2016-05-07 14:31   ` hw
2016-05-07 23:30     ` Alan McKinnon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=570852C6.1070601@bluewin.ch \
    --to=u.schutz@bluewin.ch \
    --cc=gentoo-user@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox