* [gentoo-user] extracting text, numbers from screencasts @ 2016-04-08 13:26 hw 2016-04-08 14:30 ` Helmut Jarausch 0 siblings, 1 reply; 7+ messages in thread From: hw @ 2016-04-08 13:26 UTC (permalink / raw To: gentoo-user Hi, what would be the best approach to extract data from a screencast? The task is to acquire some data from the display of a GUI program used interactively by a user. There are a couple 'fields' (as in "designated areas of the display") in which the relevant data is being displayed while the program is being used. The acquired data needs to be entered into a mysql database, preferably as soon as possible. (The program needs windoze, and the sources are unavailable :( ) The idea is to make a screen recording and postprocess the recording with some sort of OCR software. This might require using ffmpeg (or the like) to create a single image from each frame of the recording; then treat each image with an OCR software to get the interesting data which can then be entered into the database. Data to extract is mostly numbers. The relevant fields can be expected to be either filled or empty. The FPS rate of the recording can be kept reasonably low, like 1 FPS, or perhaps even less, depending on how frequent the relevant fields change. Using tesseract comes to mind, but after reading that "Tesseract's output will be very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots) must be scaled up such that the text x-height is at least 20 pixels,[12] any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract's binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters."[1] I'm even more doubtful that this would produce usable results with sufficient reliability. So what might be the best way to get text/numbers out of what a program displays? [1]: https://en.wikipedia.org/wiki/Tesseract_(software) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [gentoo-user] extracting text, numbers from screencasts 2016-04-08 13:26 [gentoo-user] extracting text, numbers from screencasts hw @ 2016-04-08 14:30 ` Helmut Jarausch 2016-04-09 0:54 ` Urs Schütz 2016-05-07 14:31 ` hw 0 siblings, 2 replies; 7+ messages in thread From: Helmut Jarausch @ 2016-04-08 14:30 UTC (permalink / raw To: gentoo-user On 04/08/2016 03:26:53 PM, hw wrote: > > Hi, > > what would be the best approach to extract data > from a screencast? > > The task is to acquire some data from the display of > a GUI program used interactively by a user. There are > a couple 'fields' (as in "designated areas of the display") > in which the relevant data is being displayed while the > program is being used. The acquired data needs to be > entered into a mysql database, preferably as soon as > possible. (The program needs windoze, and the sources > are unavailable :( ) > > > The idea is to make a screen recording and postprocess > the recording with some sort of OCR software. This might > require using ffmpeg (or the like) to create a single > image from each frame of the recording; then treat each > image with an OCR software to get the interesting data > which can then be entered into the database. > > Data to extract is mostly numbers. The relevant fields > can be expected to be either filled or empty. The FPS rate > of the recording can be kept reasonably low, like 1 FPS, > or perhaps even less, depending on how frequent the relevant > fields change. > > Using tesseract comes to mind, but after reading that > > "Tesseract's output will be very poor quality if the input > images are not preprocessed to suit it: Images (especially > screenshots) must be scaled up such that the text x-height > is at least 20 pixels,[12] any rotation or skew must be > corrected or no text will be recognized, low-frequency > changes in brightness must be high-pass filtered, or > Tesseract's binarization stage will destroy much of the > page, and dark borders must be manually removed, or they > will be misinterpreted as characters."[1] > > I'm even more doubtful that this would produce usable > results with sufficient reliability. > > So what might be the best way to get text/numbers out of > what a program displays? > > > [1]: https://en.wikipedia.org/wiki/Tesseract_(software) > I can't help with Gentoo. Try to find an old (free) version of FineReader which runs under wine. If you do it only occasionally, transfer the image to an Android phone where there a good and cheap OCR apps, even FineReader. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [gentoo-user] extracting text, numbers from screencasts 2016-04-08 14:30 ` Helmut Jarausch @ 2016-04-09 0:54 ` Urs Schütz 2016-04-09 5:37 ` R0b0t1 2016-05-07 14:57 ` hw 2016-05-07 14:31 ` hw 1 sibling, 2 replies; 7+ messages in thread From: Urs Schütz @ 2016-04-09 0:54 UTC (permalink / raw To: gentoo-user On 04/08/16 11:30, Helmut Jarausch wrote: > On 04/08/2016 03:26:53 PM, hw wrote: >> >> Hi, >> >> what would be the best approach to extract data >> from a screencast? >> >> The task is to acquire some data from the display of >> a GUI program used interactively by a user. There are >> a couple 'fields' (as in "designated areas of the display") >> in which the relevant data is being displayed while the >> program is being used. The acquired data needs to be >> entered into a mysql database, preferably as soon as >> possible. (The program needs windoze, and the sources >> are unavailable :( ) >> >> >> The idea is to make a screen recording and postprocess >> the recording with some sort of OCR software. This might >> require using ffmpeg (or the like) to create a single >> image from each frame of the recording; then treat each >> image with an OCR software to get the interesting data >> which can then be entered into the database. >> >> Data to extract is mostly numbers. The relevant fields >> can be expected to be either filled or empty. The FPS rate >> of the recording can be kept reasonably low, like 1 FPS, >> or perhaps even less, depending on how frequent the relevant >> fields change. >> >> Using tesseract comes to mind, but after reading that >> >> "Tesseract's output will be very poor quality if the input >> images are not preprocessed to suit it: Images (especially >> screenshots) must be scaled up such that the text x-height >> is at least 20 pixels,[12] any rotation or skew must be >> corrected or no text will be recognized, low-frequency >> changes in brightness must be high-pass filtered, or >> Tesseract's binarization stage will destroy much of the >> page, and dark borders must be manually removed, or they >> will be misinterpreted as characters."[1] >> >> I'm even more doubtful that this would produce usable >> results with sufficient reliability. >> >> So what might be the best way to get text/numbers out of >> what a program displays? >> >> >> [1]: https://en.wikipedia.org/wiki/Tesseract_(software) >> > > I can't help with Gentoo. > Try to find an old (free) version of FineReader which runs under wine. > If you do it only occasionally, transfer the image to an Android phone > where there a good and cheap OCR apps, even FineReader. > > > I had some surprisingly good experience with tesseact in digitizing photographed pages of an old book recently. So I gave it a try today with a cropped screenshot of thunderbird. $ convert scrsht.png -type Grayscale -filter point -resize 300% -normalize upscaled.png $ tesseract -l eng upscaled.png out $ less out.txt convert is from media-gfx/imagemagick-6.9.0.3 tesseract is app-text/tesseract-3.04.00-r2 Here are my findings: Any graphical elements sized similar to an character appear as strange letters. Recognition of serif fonts was better than sans-serif fonts, even at smaller font size. Text which can be spell-checked was nearly perfectly recognized. Gentoo-specific words like "GLSA" and "NVMe" was not correctly recognized. Selected text (white on blue background) was poorly recognized. Dates were not recognized correctly. Times were correctly read. "convert" time for a initial screenshot size of 956 x 639 pixels was 0.4 seconds. "tesseract" time was a little more than 6s on an Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz, without opencl. The image conversion and tesseract ocr could easily be scripted. In short I would say that the following steps would help with tesseract: Avoid GUI with a lot of graphics. Try to screenshot just the relevant areas. Increase GUI font size. Configure GUI to use a well known serif font, or train tesseract for the specific font used. Configure GUI to use high contrasts, avoid colors which get converted to gray. Tesseract time could be improved by enabling opencl. I would be interested to hear about your findings with numerical data, and which approach finally works for you. Urs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [gentoo-user] extracting text, numbers from screencasts 2016-04-09 0:54 ` Urs Schütz @ 2016-04-09 5:37 ` R0b0t1 2016-05-07 14:57 ` hw 1 sibling, 0 replies; 7+ messages in thread From: R0b0t1 @ 2016-04-09 5:37 UTC (permalink / raw To: gentoo-user [-- Attachment #1: Type: text/plain, Size: 412 bytes --] Reading GUIs is a lot easier than most things tesseract was designed for. You may still need a little preprocessing; I suggest OpenCV. Basically enlarge it, filter to remove noise (sharpening and blob detection and perhaps another), threshold to get BW image. Most unneeded for GUI. OpenCV helps greatly in the situations Schutz describes. Modification of the character data is better but more time intensive. [-- Attachment #2: Type: text/html, Size: 461 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [gentoo-user] extracting text, numbers from screencasts 2016-04-09 0:54 ` Urs Schütz 2016-04-09 5:37 ` R0b0t1 @ 2016-05-07 14:57 ` hw 1 sibling, 0 replies; 7+ messages in thread From: hw @ 2016-05-07 14:57 UTC (permalink / raw To: gentoo-user Urs Schütz schrieb: > On 04/08/16 11:30, Helmut Jarausch wrote: >> On 04/08/2016 03:26:53 PM, hw wrote: >>> >>> Hi, >>> >>> what would be the best approach to extract data >>> from a screencast? >>> >>> The task is to acquire some data from the display of >>> a GUI program used interactively by a user. There are >>> a couple 'fields' (as in "designated areas of the display") >>> in which the relevant data is being displayed while the >>> program is being used. The acquired data needs to be >>> entered into a mysql database, preferably as soon as >>> possible. (The program needs windoze, and the sources >>> are unavailable :( ) >>> >>> >>> The idea is to make a screen recording and postprocess >>> the recording with some sort of OCR software. This might >>> require using ffmpeg (or the like) to create a single >>> image from each frame of the recording; then treat each >>> image with an OCR software to get the interesting data >>> which can then be entered into the database. >>> >>> Data to extract is mostly numbers. The relevant fields >>> can be expected to be either filled or empty. The FPS rate >>> of the recording can be kept reasonably low, like 1 FPS, >>> or perhaps even less, depending on how frequent the relevant >>> fields change. >>> >>> Using tesseract comes to mind, but after reading that >>> >>> "Tesseract's output will be very poor quality if the input >>> images are not preprocessed to suit it: Images (especially >>> screenshots) must be scaled up such that the text x-height >>> is at least 20 pixels,[12] any rotation or skew must be >>> corrected or no text will be recognized, low-frequency >>> changes in brightness must be high-pass filtered, or >>> Tesseract's binarization stage will destroy much of the >>> page, and dark borders must be manually removed, or they >>> will be misinterpreted as characters."[1] >>> >>> I'm even more doubtful that this would produce usable >>> results with sufficient reliability. >>> >>> So what might be the best way to get text/numbers out of >>> what a program displays? >>> >>> >>> [1]: https://en.wikipedia.org/wiki/Tesseract_(software) >>> >> >> I can't help with Gentoo. >> Try to find an old (free) version of FineReader which runs under wine. >> If you do it only occasionally, transfer the image to an Android phone >> where there a good and cheap OCR apps, even FineReader. >> >> >> > > I had some surprisingly good experience with tesseact in digitizing photographed pages of an old book recently. So I gave it a try today with a cropped screenshot of thunderbird. > > $ convert scrsht.png -type Grayscale -filter point -resize 300% -normalize upscaled.png > $ tesseract -l eng upscaled.png out > $ less out.txt > > convert is from media-gfx/imagemagick-6.9.0.3 > tesseract is app-text/tesseract-3.04.00-r2 > > Here are my findings: > Any graphical elements sized similar to an character appear as strange letters. > Recognition of serif fonts was better than sans-serif fonts, even at smaller font size. > Text which can be spell-checked was nearly perfectly recognized. > Gentoo-specific words like "GLSA" and "NVMe" was not correctly recognized. > Selected text (white on blue background) was poorly recognized. > Dates were not recognized correctly. > Times were correctly read. > "convert" time for a initial screenshot size of 956 x 639 pixels was 0.4 seconds. > "tesseract" time was a little more than 6s on an Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz, without opencl. > The image conversion and tesseract ocr could easily be scripted. Considering the amount of video, 6s per frame would be too long. The application is time-critical such that I have a window of about 10s to extract and to process the data from at least 8 video streams. Recording at only 10 FPS and taking 8 seconds to extract and to process the data would require 640s per 10s window, and I don't have about 70 CPUs available to do the work. To make things worse, it's an ongoing process, i. e. dividing it into 10s windows is too artificial to keep things running as smoothly as they should. > In short I would say that the following steps would help with tesseract: > Avoid GUI with a lot of graphics. > Try to screenshot just the relevant areas. > Increase GUI font size. > Configure GUI to use a well known serif font, or train tesseract for the specific font used. > Configure GUI to use high contrasts, avoid colors which get converted to gray. > Tesseract time could be improved by enabling opencl. > > I would be interested to hear about your findings with numerical data, and which approach finally works for you. Thank you very much for giving me a better idea of what I'm looking at! Considering it, I have resorted to use autohotkey, which has the ability to actually read data from GUI-elements. It also can make requests to web servers. With that, things become a hell of a lot simpler than trying to process video streams, for I can simply read the data and send it over to the web server which puts it into the database where it needs to end up anyway. Unfortunately, the application the data is being read from has a bad habit of renaming the GUI-elements I need to read. This makes things difficult again. Autohotkey is a really nice tool, though. I wonder if there is an equivalent for X11. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [gentoo-user] extracting text, numbers from screencasts 2016-04-08 14:30 ` Helmut Jarausch 2016-04-09 0:54 ` Urs Schütz @ 2016-05-07 14:31 ` hw 2016-05-07 23:30 ` Alan McKinnon 1 sibling, 1 reply; 7+ messages in thread From: hw @ 2016-05-07 14:31 UTC (permalink / raw To: gentoo-user Helmut Jarausch schrieb: > On 04/08/2016 03:26:53 PM, hw wrote: >> >> Hi, >> >> what would be the best approach to extract data >> from a screencast? >> >> The task is to acquire some data from the display of >> a GUI program used interactively by a user. There are >> a couple 'fields' (as in "designated areas of the display") >> in which the relevant data is being displayed while the >> program is being used. The acquired data needs to be >> entered into a mysql database, preferably as soon as >> possible. (The program needs windoze, and the sources >> are unavailable :( ) >> >> >> The idea is to make a screen recording and postprocess >> the recording with some sort of OCR software. This might >> require using ffmpeg (or the like) to create a single >> image from each frame of the recording; then treat each >> image with an OCR software to get the interesting data >> which can then be entered into the database. >> >> Data to extract is mostly numbers. The relevant fields >> can be expected to be either filled or empty. The FPS rate >> of the recording can be kept reasonably low, like 1 FPS, >> or perhaps even less, depending on how frequent the relevant >> fields change. >> >> Using tesseract comes to mind, but after reading that >> >> "Tesseract's output will be very poor quality if the input >> images are not preprocessed to suit it: Images (especially >> screenshots) must be scaled up such that the text x-height >> is at least 20 pixels,[12] any rotation or skew must be >> corrected or no text will be recognized, low-frequency >> changes in brightness must be high-pass filtered, or >> Tesseract's binarization stage will destroy much of the >> page, and dark borders must be manually removed, or they >> will be misinterpreted as characters."[1] >> >> I'm even more doubtful that this would produce usable >> results with sufficient reliability. >> >> So what might be the best way to get text/numbers out of >> what a program displays? >> >> >> [1]: https://en.wikipedia.org/wiki/Tesseract_(software) >> > > I can't help with Gentoo. > Try to find an old (free) version of FineReader which runs under wine. > If you do it only occasionally, transfer the image to an Android phone where there a good and cheap OCR apps, even FineReader. It would be too much video to process. Besides, phones are ok for making phone calls and entirely incompatible with computers, which makes them useless for anything else but making phone calls. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [gentoo-user] extracting text, numbers from screencasts 2016-05-07 14:31 ` hw @ 2016-05-07 23:30 ` Alan McKinnon 0 siblings, 0 replies; 7+ messages in thread From: Alan McKinnon @ 2016-05-07 23:30 UTC (permalink / raw To: gentoo-user On 07/05/2016 16:31, hw wrote: > Helmut Jarausch schrieb: >> On 04/08/2016 03:26:53 PM, hw wrote: >>> >>> Hi, >>> >>> what would be the best approach to extract data >>> from a screencast? >>> >>> The task is to acquire some data from the display of >>> a GUI program used interactively by a user. There are >>> a couple 'fields' (as in "designated areas of the display") >>> in which the relevant data is being displayed while the >>> program is being used. The acquired data needs to be >>> entered into a mysql database, preferably as soon as >>> possible. (The program needs windoze, and the sources >>> are unavailable :( ) >>> >>> >>> The idea is to make a screen recording and postprocess >>> the recording with some sort of OCR software. This might >>> require using ffmpeg (or the like) to create a single >>> image from each frame of the recording; then treat each >>> image with an OCR software to get the interesting data >>> which can then be entered into the database. >>> >>> Data to extract is mostly numbers. The relevant fields >>> can be expected to be either filled or empty. The FPS rate >>> of the recording can be kept reasonably low, like 1 FPS, >>> or perhaps even less, depending on how frequent the relevant >>> fields change. >>> >>> Using tesseract comes to mind, but after reading that >>> >>> "Tesseract's output will be very poor quality if the input >>> images are not preprocessed to suit it: Images (especially >>> screenshots) must be scaled up such that the text x-height >>> is at least 20 pixels,[12] any rotation or skew must be >>> corrected or no text will be recognized, low-frequency >>> changes in brightness must be high-pass filtered, or >>> Tesseract's binarization stage will destroy much of the >>> page, and dark borders must be manually removed, or they >>> will be misinterpreted as characters."[1] >>> >>> I'm even more doubtful that this would produce usable >>> results with sufficient reliability. >>> >>> So what might be the best way to get text/numbers out of >>> what a program displays? >>> >>> >>> [1]: https://en.wikipedia.org/wiki/Tesseract_(software) >>> >> >> I can't help with Gentoo. >> Try to find an old (free) version of FineReader which runs under wine. >> If you do it only occasionally, transfer the image to an Android phone >> where there a good and cheap OCR apps, even FineReader. > > It would be too much video to process. Besides, phones are > ok for making phone calls and entirely incompatible with > computers, which makes them useless for anything else but > making phone calls. Huh? da fuck you talkin' 'bout? My trusty collection of Android devices would be very surprised to hear they now don't have real CPUs, wifi chips, RAM and storage. Or can't run a web browser, do email, instant chat, play x264 video with less cpu load than my 8 core laptop, share with smb on the network, do bluetooth, video calls or any of the other bazzillion things computers have always done with each other. How odd. I really thought my Android phones could do all of that. I must have imagined it .... that means my delusions are worse than I thought and maybe I need different and more pills from the nice lady who's my GP. -- Alan McKinnon alan.mckinnon@gmail.com ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-05-07 23:31 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-04-08 13:26 [gentoo-user] extracting text, numbers from screencasts hw 2016-04-08 14:30 ` Helmut Jarausch 2016-04-09 0:54 ` Urs Schütz 2016-04-09 5:37 ` R0b0t1 2016-05-07 14:57 ` hw 2016-05-07 14:31 ` hw 2016-05-07 23:30 ` Alan McKinnon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox