---
title: "Using the Tesseract OCR engine in R"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: true
    toc_depth: 2
    toc_float: true
    fig_caption: false
vignette: >
  %\VignetteIndexEntry{Using the Tesseract OCR engine in R}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The tesseract package provides R bindings [Tesseract](https://github.com/tesseract-ocr/tesseract):
a powerful optical character recognition (OCR) engine that supports over 100
languages. The engine is highly configurable in order to tune the detection
algorithms and obtain the best possible results.

Keep in mind that OCR (pattern recognition in general) is a very difficult
problem for computers. Results will rarely be perfect and the accuracy rapidly
decreases with the quality of the input image. But if you can get your input
images to reasonable quality, Tesseract can often help to extract most of the
text from the image.

## Extract Text from Images

OCR is the process of finding and recognizing text inside images, for example
from a screenshot, scanned paper. The image below has some example text:

![Image with eight lines of English text](../man/figures/wilde.png)

The `ocr()` function extracts text from an image file. After indicating the
engine for the language, it will return the text found in the image:

```r
library(cpp11tesseract)
file <- system.file("examples", "wilde.png", package = "cpp11tesseract")
eng <- tesseract("eng")
text <- ocr(file, engine = eng)
cat(text)
```

```
Complete Works
oF
OSCAR WILDE
EDITED BY

ROBERT ROSS
MISCELLANIES
â€˜AUTHORIZED EDITION

THE WYMAN-FOGG COMPANY

BOSTON :: MASSACHUSETTS
```

The `ocr_data()` function returns all words in the image along with a bounding
box and confidence rate.

```r
results <- ocr_data(file, engine = eng)
results
```

```
# A tibble: 18 Ã— 4
   word          confidence bbox            stringsAsFactors
   <chr>              <dbl> <chr>           <lgl>           
 1 Complete            96.5 159,48,304,79   FALSE           
 2 Works               96.5 320,49,422,74   FALSE           
 3 oF                  66.3 281,102,300,111 FALSE           
 4 OSCAR               94.9 92,131,274,161  FALSE           
 5 WILDE               96.6 308,132,490,167 FALSE           
 6 EDITED              95.2 248,187,303,197 FALSE           
 7 BY                  96.6 314,187,334,197 FALSE           
 8 ROBERT              96.1 207,212,302,227 FALSE           
 9 ROSS                96.1 318,212,373,227 FALSE           
10 MISCELLANIES        90.8 195,298,389,316 FALSE           
11 â€˜AUTHORIZED         82.2 200,504,306,515 FALSE           
12 EDITION             96.8 315,503,382,514 FALSE           
13 THE                 93.2 144,664,184,677 FALSE           
14 WYMAN-FOGG          90.3 195,663,331,676 FALSE           
15 COMPANY             95.8 342,662,438,675 FALSE           
16 BOSTON              66.6 144,693,218,706 FALSE           
17 ::                  81.4 246,697,255,705 FALSE           
18 MASSACHUSETTS       90.9 279,691,438,704 FALSE
```

## Language Data

The tesseract OCR engine uses language-specific training data in the recognize
words. The OCR algorithms bias towards words and sentences that frequently
appear together in a given language, just like the human brain does. Therefore
the most accurate results will be obtained when using training data in the
correct language. 

Use `tesseract_info()` to list the languages that you currently have installed.

```r
tesseract_info()
```

```
$datapath
[1] "/usr/share/tesseract-ocr/5/tessdata/"

$available
[1] "chi_sim" "eng"     "osd"    

$version
[1] "5.4.1"

$configs
 [1] "alto"             "ambigs.train"     "api_config"       "bigram"          
 [5] "box.train"        "box.train.stderr" "digits"           "get.images"      
 [9] "hocr"             "inter"            "kannada"          "linebox"         
[13] "logfile"          "lstm.train"       "lstmbox"          "lstmdebug"       
[17] "makebox"          "page"             "pdf"              "quiet"           
[21] "rebox"            "strokewidth"      "tsv"              "txt"             
[25] "unlv"             "wordstrbox"
```

By default the R package only includes English training data. Windows and Mac
users can install additional training data using `tesseract_download()`. Let's
OCR a screenshot from Wikipedia in Simplified Chinese.

![Image with thirteen lines of Chinese text](../man/figures/chinese.jpg)

```r
# Download once
dir <- tempdir()
tesseract_download("chi_sim", model = "fast", datapath = dir)
```

```
 Downloaded: 2.35 MB  (100%)
[1] "/tmp/RtmpfeKjPP/chi_sim.traineddata"
```

```r
# Load the dictionary
file <- system.file("examples", "chinese.jpg", package = "cpp11tesseract")
text <- ocr(file, engine = tesseract("chi_sim", datapath = dir))
cat(text)
```

```
å¥¥æž—åŒ¹å…‹è¿åŠ¨ä¼š (å¸Œè…Šè¯;: OMXohrtakot AYWvsc; æ³•è¯: Jeux olympiques; è‹±è¯:
Olympic Games) ï¼Œç®€ç§°å¥¥è¿ä¼šã€å¥¥è¿ï¼Œæ˜¯ä¸–ç•Œæœ€é«˜ç‰çº§çš„å›½é™…ç»¼åˆä½“è‚²è¹‡äº‹ï¼Œç”±å›½é™…
å¥¥æž—åŒ¹å…‹å§”å‘˜ä¼šä¸»åŠžï¼Œæ¯4å¹´ä¸¾è¡Œä¸€æ¬¡ã€‚å†¬å£è®æŠ€é¡¹ç›®åˆ›ç«‹å†¬å£å¥¥æž—åŒ¹å…‹è¿åŠ¨ä¼šåŽï¼Œä¹‹å‰
çš„å¥¥æž—åŒ¹å…‹è¿åŠ¨ä¼šåˆ™æ˜¯åˆç§°ä¸ºã€å¤å£å¥¥æž—åŒ¹å…‹è¿˜åŠ¨ä¼šã€ ä»¥ç¤ºåŒºåˆ†ã€‚ä»Ž1994å¹´èµ·ï¼Œå†¬å£å¥¥
è¿˜ä¼šå’Œå¤å£å¥¥è¿ä¼šåˆ†å¹¶ï¼Œç›¸éš”2å¹´äº¤è”¡ä¸¾è¡Œã€‚å¥¥æž—åŒ¹å…‹è¿åŠ¨ä¼šæœ€æ—©èµ·æºæ–¼å¤å¸Œè…Šï¼Œæ˜¯å½“æ—¶
å„åŸŽé‚¦ä¹‹é—´çš„å…¬å¼€è¾ƒé‡ï¼Œå› ä¸ºçš‹é—ªåœ°åœ¨å¥¥æž—åŒ¹äºšè€Œå¾—åã€‚ä¿¡å¹¸åŸºæ™´æ•™çš„åŠžå¤©çš‡å¸ç‹„å¥¥å¤šè¥¿
ä¸€ä¸–ä»¥å¥¥æž—åŒ¹å…‹è¿åŠ¨ä¼šå´‡æ‹œè€¶ç¨£ä»¥å¤–ç¥žåªä¸ºç”±ï¼Œç¦æ¢å¥¥è¿é¡¹æŠ€ï¼Œæ–¼æ˜¯å¥¥è¿åœ¨ä¸¾åŠžè¶…è¿‡
1,000å¹´åŽæ–¼4ä¸–çºªæœ«åœåŠžï¼Œå¥¥è¿è¿™æ¬¡åœåŠžæŒç»äº†1,503å¹´ï¼Œç›´åˆ°19ä¸–çºªæœªæ‰ç”±å’ŒåŽäººå‘çŽ°
é—è·ã€‚ä¹‹åŽï¼Œæ³•è¾†çš„é¡¾æ‹œæ—¦ç”·è¶…çš®è€¶,å¾·:å¤æŸå¦åˆ›ç«‹äº†æœ‰çœŸæ£å¥¥è¿ç²¾ç¥žçš„çŽ°ä»£å¥¥æž—åŒ¹å…‹è¿
åŠ¨ä¼šï¼Œè‡ª1896å¹´å¼€å§‹æ¯4å¹´å¦åŠžä¸€æ¬¡ï¼Œæ›´ç¡®ç«‹äº†ä¼šæœŸä¸è¶…è¿‡18æ—¥çš„ä¼ ç»Ÿã€‚çŽ°ä»£å¥¥è¿ä¼šåª
åœ¨ä¸¤æ¬¡ä¸–ç•Œå¤§æˆ˜æœŸé—´åˆå…±ä¸æ–è¿‡5æ¬¡(åˆ†åˆ«æ˜¯191 6å¹´å¤å£å¥¥è¿ä¼šã€1940å¹´å¤å£å¥¥è¿ä¼š
Dã€1940å¹´å†¬å£å¥¥è¿ä¼š(1]ã€1944å¹´å¤å£å¥¥è¿ä¼šå’Œ1944å¹´å†¬å£å¥¥è¿ä¼š) [å¹¿ ]]ï¼Œä»¥åŠåœ¨
2020å¹´å› å…¨çƒé˜²ç–«å»¶æœŸè¿‡ä¸€æ¬¡ (2020å¹´å¤å£å¥¥è¿ä¼š2][å¹¿ä¹ ) ã€‚
```

Compare with the copy and paste from the Wikipedia.

```r
text2 <- readLines(system.file("examples", "chinese.txt",
  package = "cpp11tesseract"))

cat(text2)
```

```
å¥§æž—åŒ¹å…‹é‹å‹•æœƒï¼ˆå¸Œè‡˜èªžï¼šÎŸÎ»Ï…Î¼Ï€Î¹Î±ÎºÎ¿Î¯ Î‘Î³ÏŽÎ½ÎµÏ‚ï¼›æ³•èªž Jeux olympiquesï¼›è‹±èªžï¼šOlympic Gamesç°¡
ç¨±å¥§é‹æœƒã€å¥§é‹ï¼Œæ˜¯ä¸–ç•Œæœ€é«˜ç‰ç´šçš„åœ‹éš›ç¶œåˆé«”è‚²è³½äº‹ï¼Œç”±åœ‹éš› å¥§æž—åŒ¹å…‹å§”å“¡æœƒä¸»è¾¦ï¼Œæ¯4å¹´èˆ‰è¡Œä¸€æ¬¡ã€‚å†¬
å£ç«¶æŠ€é …ç›®å‰µç«‹å†¬å£å¥§æž—åŒ¹å…‹é‹å‹•æœƒå¾Œï¼Œä¹‹å‰ çš„å¥§æž—åŒ¹å…‹é‹å‹•æœƒå‰‡æ˜¯åˆç¨±ç‚ºã€Œå¤å£å¥§æž—åŒ¹å…‹é‹å‹•æœƒã€ä»¥ç¤ºå€
åˆ†ã€‚å¾ž1994å¹´èµ·ï¼Œå†¬å£å¥§ é‹æœƒå’Œå¤å£å¥§é‹æœƒåˆ†é–‹ï¼Œç›¸éš”2å¹´äº¤æ›¿èˆ‰è¡Œã€‚å¥¥æž—åŒ¹å…‹é‹å‹•æœƒæœ€æ—©èµ·æºæ–¼å¤å¸Œè…Šï¼Œ
æ˜¯ç•¶æ™‚ å„åŸŽé‚¦ä¹‹é–“çš„å…¬é–‹è¼ƒé‡ï¼Œå› ç‚ºèˆ‰è¾¦åœ°åœ¨å¥§æž—åŒ¹äºšè€Œå¾—åã€‚ä¿¡å¥‰åŸºç£æ•™çš„ç¾…é¦¬çš‡å¸ç‹„å¥§å¤šè¥¿ ä¸€ä¸–ä»¥å¥§
æž—åŒ¹å…‹é‹å‹•æœƒå´‡æ‹œè€¶ç©Œä»¥å¤–ç¥žè¡¹ç‚ºç”±ï¼Œç¦æ¢å¥§é‹ç«¶æŠ€ï¼Œæ–¼æ˜¯å¥§é‹åœ¨èˆ‰è¾¦è¶…éŽ 1,000å¹´å¾Œæ–¼4ä¸–ç´€æœ«åœè¾¦ï¼Œå¥§
é‹é€™æ¬¡åœè¾¦æŒçºŒäº†1,503å¹´ï¼Œç›´åˆ°19ä¸–çºªæœ«æ‰ç”±å¾Œäººç™¼ç¾ éºè¹Ÿã€‚ä¹‹å¾Œï¼Œæ³•åœ‹çš„é¡¾æ‹œæ—¦ç”·çˆµçš®è€¶Â·å¾·Â·å¤æŸå¦
å‰µç«‹äº†æœ‰çœŸæ£å¥§é‹ç²¾ç¥žçš„ç¾ä»£å¥§æž—åŒ¹å…‹é‹ å‹•æœƒï¼Œè‡ª1896å¹´é–‹å§‹æ¯4å¹´èˆ‰è¾¦ä¸€æ¬¡ï¼Œæ›´ç¢ºç«‹äº†æœƒæœŸä¸è¶…éŽ18æ—¥
çš„å‚³çµ±ã€‚ç¾ä»£å¥§é‹æœƒåª åœ¨å…©æ¬¡ä¸–ç•Œå¤§æˆ°æœŸé–“åˆå…±ä¸æ–·éŽ5æ¬¡ï¼ˆåˆ†åˆ¥æ˜¯1916å¹´å¤å£å¥§é‹æœƒã€1940å¹´å¤å£å¥§é‹
æœƒ [1]ã€1940å¹´å†¬å£å¥§é‹æœƒ[1]ã€1944å¹´å¤å£å¥§é‹æœƒå’Œ1944å¹´å†¬å£å¥§é‹æœƒï¼‰[è¨» 1]ï¼Œä»¥åŠåœ¨ 2020å¹´å› 
å…¨çƒé˜²ç–«å»¶æœŸéŽä¸€æ¬¡ï¼ˆ2020å¹´å¤å£å¥§é‹æœƒ[2][è¨» 2]ï¼‰ã€‚
```

## Read from PDF files

If your images are stored in PDF files they first need to be converted to a
proper image format. We can do this in R using the `pdf_convert` function from
the `cpp11poppler` package. Use a high DPI to keep quality of the image.

```r
library(cpp11poppler)
file <- system.file("examples", "bondargentina.pdf", package = "cpp11tesseract")
pngfile <- pdf_convert(file, dpi = 600)
text <- ocr(pngfile)
cat(text)
```

```r
_ LISTING PARTICULARS CONSISTING OF
Pricing Supplement, Supplemental Information Memorandum and
Supplemental Information Memorandum Addendum dated May 26, 1998
AND
Information Memorandum Addendum dated October 17, 1997 |
ee .
THE REPUBLIC OF ARGENTINA
Euro 750,000,000 Interest Strip Notes due 2028
issued under its
U.S.$11,000,000,000
EURO MEDIUM-TERM NOTE PROGRAMME |
Issue Price: 100.835 per cent.
Series No.: 61
Tranche No.: 01 :
Dealer
ABN AMRO
The date of these Listing Particulars is May 26, 1998.
The Republic has warranted to the Dealer that, inter alia, these Listing
Particulars are true and accurate in all material respects, do not contain any
untrue statement of material fact nor omit to state any material fact known to
the Republic necessary to make statements herein not misleading and all
reasonable enquiries have been made to ascertain such facts and to verify the
accuracy of all such statements. The Republic accepts responsibility
accordingly.

No person has been authorised to give any information or to make any
representations, other than those contained in the Listing Particulars, in
connection with the offering or sale of the Notes and, if given or made, such
information or representations must not be relied upon as having been authorised
by the Republic or the Dealer. Neither the delivery of these Listing Particulars
nor any sale made hereunder shall, under any circumstances, constitute a
representation that there has been no change in the financial position or
prospects of the Republic since the date hereof or the information contained
herein is correct as of any time subsequent to the date hereof.
```

## Tesseract Control Parameters

Tesseract supports hundreds of "control parameters" which alter the OCR engine
 Use `tesseract_params()` to list all parameters with their default value and a
 brief description. It also has a handy `filter` argument to quickly find
 parameters that match a particular string.

```r
# List all parameters with *colour* in name or description
tesseract_params("colour")
```

```
# A tibble: 2 Ã— 3
  param                      default desc                    
* <chr>                      <chr>   <chr>                   
1 editor_image_word_bb_color 7       Word bounding box colour
2 editor_image_blob_bb_color 4       Blob bounding box colour
```

Do note that some of the control parameters have changed between Tesseract
engine 3 and 4.

```r
tesseract_info()["version"]
```

```
[1] "5.4.1"
```

### Whitelist / Blacklist characters

One powerful parameter is `tessedit_char_whitelist` which restricts the output
to a limited set of characters. This may be useful for reading for example
numbers such as a bank account, zip code, or gas meter.

The whitelist parameter works for all versions of Tesseract engine 3 and also
engine versions 4.1 and higher, but unfortunately it did not work in Tesseract
4.0.

![A receipt in English with food and toys for Mr. Duke](../man/figures/receipt.jpg)

```r
file <- system.file("examples", "receipt.jpg", package = "cpp11tesseract")
numbers <- tesseract(options = list(tessedit_char_whitelist = "-$.0123456789"))
cat(ocr(file, engine = numbers))
```

```
0

00068354712539

01.8$31.998
25 -$8.00

00084019961505

03966$44.99

00003558543582

8 $8.93

$

00000002000414

$0.50

$$60$10 -$10.00

$ $68.47

$8.84

$77.31
```

To test if this actually works, look at the output without the whitelist:

```r
cat(ocr(file, engine = eng))
```

```
DOG

000683547 12539

OPEN FARM DOG AG SALMON 1.8KG $31.99 HST
Item discount 25% -$8.00 HST

00084019961505

VE FO GOOG BF NIB 396G LRG KONG $44.99 HST

ACCESSORIES

00003558543582

KONG BRUSH $8.93 HST

STORE USE ITEMS

000000020004 14

GPF CLOTH BAG LARGE $0.50

FPS SPEND $60 SAVE $10 -$10.00

SUB TOTAL $68.47

HST $8.84

TOTAL $77.31
```

This is Mr. Duke:

![Mr. Duke, a dog of the Australian Sheppard kind](../man/figures/mrduke.jpg)

Here is the extracted text:

```r
file <- system.file("examples", "mrduke.jpg", package = "cpp11tesseract")
text <- ocr(file, engine = eng)
cat(text)
```

```
ee
oe e
Ze. <n BR ee
Mr. Duke, 4 years old (2024) 2
```

## Best versus Fast models

In order to improve the OCR results, Tesseract has two variants of models that
can be used. The `tesseract_download()` can download the 'best' (but slower)
model, which increases the accuracy. The 'fast' (but less accurate) model is the
default.

Compare the result with the previous example with Chinese text:

```r
file <- system.file("examples", "chinese.jpg", package = "cpp11tesseract")

# download the best model (vertical script download is to avoid a warning)
dir <- tempdir()
tesseract_download("chi_sim_vert", model = "best", datapath = dir)
tesseract_download("chi_sim", model = "best", datapath = dir)
text <- ocr(file, engine = tesseract("chi_sim", datapath = dir))

# compare the results: fast (text1) vs best (text2)
cat(text)
```

```
å¥¥æž—åŒ¹å…‹è¿åŠ¨ä¼š (å¸Œè…Šè¯;: OMXohrtakot AYWvsc; æ³•è¯: Jeux olympiques; è‹±è¯:
Olympic Games) ï¼Œç®€ç§°å¥¥è¿ä¼šã€å¥¥è¿ï¼Œæ˜¯ä¸–ç•Œæœ€é«˜ç‰çº§çš„å›½é™…ç»¼åˆä½“è‚²è¹‡äº‹ï¼Œç”±å›½é™…
å¥¥æž—åŒ¹å…‹å§”å‘˜ä¼šä¸»åŠžï¼Œæ¯4å¹´ä¸¾è¡Œä¸€æ¬¡ã€‚å†¬å£è®æŠ€é¡¹ç›®åˆ›ç«‹å†¬å£å¥¥æž—åŒ¹å…‹è¿åŠ¨ä¼šåŽï¼Œä¹‹å‰
çš„å¥¥æž—åŒ¹å…‹è¿åŠ¨ä¼šåˆ™æ˜¯åˆç§°ä¸ºã€å¤å£å¥¥æž—åŒ¹å…‹è¿˜åŠ¨ä¼šã€ ä»¥ç¤ºåŒºåˆ†ã€‚ä»Ž1994å¹´èµ·ï¼Œå†¬å£å¥¥
è¿˜ä¼šå’Œå¤å£å¥¥è¿ä¼šåˆ†å¹¶ï¼Œç›¸éš”2å¹´äº¤è”¡ä¸¾è¡Œã€‚å¥¥æž—åŒ¹å…‹è¿åŠ¨ä¼šæœ€æ—©èµ·æºæ–¼å¤å¸Œè…Šï¼Œæ˜¯å½“æ—¶
å„åŸŽé‚¦ä¹‹é—´çš„å…¬å¼€è¾ƒé‡ï¼Œå› ä¸ºçš‹é—ªåœ°åœ¨å¥¥æž—åŒ¹äºšè€Œå¾—åã€‚ä¿¡å¹¸åŸºæ™´æ•™çš„åŠžå¤©çš‡å¸ç‹„å¥¥å¤šè¥¿
ä¸€ä¸–ä»¥å¥¥æž—åŒ¹å…‹è¿åŠ¨ä¼šå´‡æ‹œè€¶ç¨£ä»¥å¤–ç¥žåªä¸ºç”±ï¼Œç¦æ¢å¥¥è¿é¡¹æŠ€ï¼Œæ–¼æ˜¯å¥¥è¿åœ¨ä¸¾åŠžè¶…è¿‡
1,000å¹´åŽæ–¼4ä¸–çºªæœ«åœåŠžï¼Œå¥¥è¿è¿™æ¬¡åœåŠžæŒç»äº†1,503å¹´ï¼Œç›´åˆ°19ä¸–çºªæœªæ‰ç”±å’ŒåŽäººå‘çŽ°
é—è·ã€‚ä¹‹åŽï¼Œæ³•è¾†çš„é¡¾æ‹œæ—¦ç”·è¶…çš®è€¶,å¾·:å¤æŸå¦åˆ›ç«‹äº†æœ‰çœŸæ£å¥¥è¿ç²¾ç¥žçš„çŽ°ä»£å¥¥æž—åŒ¹å…‹è¿
åŠ¨ä¼šï¼Œè‡ª1896å¹´å¼€å§‹æ¯4å¹´å¦åŠžä¸€æ¬¡ï¼Œæ›´ç¡®ç«‹äº†ä¼šæœŸä¸è¶…è¿‡18æ—¥çš„ä¼ ç»Ÿã€‚çŽ°ä»£å¥¥è¿ä¼šåª
åœ¨ä¸¤æ¬¡ä¸–ç•Œå¤§æˆ˜æœŸé—´åˆå…±ä¸æ–è¿‡5æ¬¡(åˆ†åˆ«æ˜¯191 6å¹´å¤å£å¥¥è¿ä¼šã€1940å¹´å¤å£å¥¥è¿ä¼š
Dã€1940å¹´å†¬å£å¥¥è¿ä¼š(1]ã€1944å¹´å¤å£å¥¥è¿ä¼šå’Œ1944å¹´å†¬å£å¥¥è¿ä¼š) [å¹¿ ]]ï¼Œä»¥åŠåœ¨
2020å¹´å› å…¨çƒé˜²ç–«å»¶æœŸè¿‡ä¸€æ¬¡ (2020å¹´å¤å£å¥¥è¿ä¼š2][å¹¿ä¹ ) ã€‚
```

## Contributed models

The `tesseract_contributed_download()` function can download contributed models.
For example, the `grc_hist` model is useful for Polytonic Greek. Here is an
example from Sophocles' Ajax
(source: [Ajax Multi-Commentary](https://github.com/AjaxMultiCommentary))

![polytonicgreek](../man/figures/polytonicgreek.png)

```
file <- system.file("examples", "polytonicgreek.png", package = "cpp11tesseract")

# download the best models
dir <- tempdir()
tesseract_download("grc", datapath = dir, model = "best")
tesseract_contributed_download("grc_hist", datapath = dir, model = "best")

# compare the results: grc (text1) vs grc_hist (text2)
text1 <- ocr(file, engine = tesseract("grc", datapath = dir))
text2 <- ocr(file, engine = tesseract("grc_hist", datapath = dir))

cat(text1)
cat(text2)
```

```
232Â» Î‘ 2 [ 2 Î‘ á½ â€™, á¾¿Â»
á¼ á½² ÎºÎ±á½¶ á¼€Î¼Ï†Î±Î´Î¯Î·Î½, á¼Ï€Îµá½¶ Î¿á½ Ï„Î¹Î½Î¬ Î´ÎµÎ¯Î´Î¹Î¼ÎµÎ½ á¼”Î¼Ï€Î·Ï‚"
Î¿á½ Î³Î¬Ï Ï„Î¯Ï‚ Î¼á½² Î²Î¯á¿ƒ Î³Îµ á¼‘Îºá½¼Î½ á¼€ÎÏ‡Î¿Î½Ï„Î± Î´Î¯Î·Ï„Î±Î¹,

2 á¾¿ 2 â€™ 2 " 2 2 2 Î‘ Â».. Î„ 2 á½‰â€œ
Î¿á½Î´Î Ï„Îµ á¼°Î´ÏÎµÎ¯á¿ƒ, á¼Ï€Îµá½¶ Î¿á½Î´Î„ á¼Î¼á½² Î½Î®ÏŠÎ´Î¬ Î³á¾½ Î¿á½•Ï„Ï‰Ï‚
á¼”Î»Ï€Î¿Î¼Î±Î¹ á¼Î½ Î£Î±Î»Î±Î¼á¿–Î½Î¹ Î³ÎµÎ½ÎÏƒÏ‘Î±Î¹ Ï„Îµ Ï„ÏÎ±Ï†ÎÎ¼ÎµÎ½ Ï„Îµ.
```

```
20 Î• 2 Î£. Î£Î£ Î½
á¼ á½² ÎºÎ±á½¶ á¼€Î¼Ï†Î±Î´Î¯Î·Î½, á¼Ï€Îµá½¶ Î¿á½” Ï„Î¹Î½Î± Î´ÎµÎ¯Î´Î¹Î¼ÎµÎ½ á¼”Î¼Ï€Î·Ï‚Â·
Î¿á½ Î³Î¬Ï Ï„Î¯Ï‚ Î¼Îµ Î²Î¯á¿ƒ Î³Îµ á¼‘Îºá½¼Î½ á¼€ÎÎºÎ¿Î½Ï„Î± Î´Î¯Î·Ï„Î±Î¹,

2 . 2 â€™ 2 Î¿ 2 2 Î£ Î‘ Î£Î£Î£ . 2 á¿³.
Î¿á½Î´Î Ï„Î¹ á¼°Î´ÏÎµÎ¯á¿ƒ, á¼Ï€Îµá½¶ Î¿á½Î´â€™ á¼Î¼á½² Î½Î®ÏŠÎ´Î¬ Î³â€™ Î¿á½•Ï„Ï‰Ï‚
á¼”Î»Ï€Î¿Î¼Î±Î¹ á¼Î½ Î£Î±Î»Î±Î¼á¿–Î½Î¹ Î³ÎµÎ½ÎÏƒÎ¸Î±Î¹ Ï„Îµ Ï„ÏÎ±Ï†ÎÎ¼ÎµÎ½ Ï„Îµ.
```

## Comparison with Amazon Textract

*Note: Amazon and Textract are trademarks of Amazon.com, Inc.*

Textract [documentation](https://aws.amazon.com/blogs/opensource/using-r-with-amazon-web-services-for-document-analysis/) uses page three of the [January 1966 report](https://www.philadelphiafed.org/-/media/frbp/assets/surveys-and-data/greenbook-data/greensheets/greensheets-1966.zip?la=en&hash=8B8ABA92C3F0B2939328D47B6230F3A3) from Philadelphia Fed's [Tealbook](https://www.philadelphiafed.org/surveys-and-data/real-time-data-research/philadelphia-data-set)
(formerly Greenbook).

![tealbook](../man/figures/tealbook.png)

Here is the first element of the list returned by Textract:

```r
# List of 13
# $ BlockType      : chr "TABLE"
# $ Confidence     : num 100
# $ Text           : chr(0)
# $ RowIndex       : int(0)
# $ ColumnIndex    : int(0)
# $ RowSpan        : int(0)
# $ ColumnSpan     : int(0)
# $ Geometry       :List of 2
# .. <not shown>
# $ Id             : chr "c6841638-d3e0-414b-af12-b94ed34aac8a"
# $ Relationships  :List of 1
# ..$ :List of 2
# .. ..$ Type: chr "CHILD"
# .. ..$ Ids : chr [1:256] "e1866e80-0ef0-4bdd-a6fd-9508bb833c03" ...
# $ EntityTypes    : list()
# $ SelectionStatus: chr(0)
# $ Page           : int 3
```

Here is Tesseract's output:

```r
file <- system.file("examples", "tealbook.png", package = "cpp11tesseract")
text <- ocr(file)

cat(text)
```

```
Nemes mm a a ee en e-em n an ae ee
Year SSCâ€”~SSESSC~*Â«C
1965 IV I
Esti- Esti- Pro-
1964 __mated yi/ rr/ rrr! mated _ jected
Gross National Product 628.7 675.7 657.6 668.8 681.5 695.0 707.0
Personal consumption expenditures 398.9 428.6 416.9 424.5 432.5 440.5 447.1
Durable goods 58.7 65.0 64.6 63.5 65.4 66.4 66.6
Nondurable goods 177.5 188.8 182.8 187.9 190.5 194.0 197.6
Services 162.6 174.9 169.5 173,1 176.7 180.1 182.9
Gross private domestic investment 92.9 104.9 103.4 102.8 106.2 107.0 109.1
Residential construction 27.5 27.7 27.7 28.0 27.7 27.3 27.5
Business fixed investment 60.5 69.8 66.9 68.4 70.9 73.1 75.1
Change in business inventories 4.8 7.4 8.8 6.4 7.6 6.6 6.5
Nonfarm 5.4 7.1 9.2 6.6 7.0 5.4 5.5
Net exports 8.6 7.3 6.0 8.0 7.4 7.8 8.1
Gov. purchases of goods & services 128.4 135,0 131.3 133.5 135.4 139.7 142.7
Federal 65.3 66.6 64.9 65.7 66.5 69.4 70.7
Defense 49.9 49.9 48.8 49.2 49.8 51.8 52.7
Other 15.4 16.7 16.1 16.5 16.7 17.6 18.0
State and local 63.1 68,4 66.4 67.8 68.9 70.3 72.0
Gross National Product in Constant 577.6 609.3 597.7 603.5 613.0 622.4 630.1
(1958) Dollars
Personal income 495.0 530.5 516.2 524.7 536.0 544.9 552.0
Wages and salaries 333.5 357.3 348.9 353.6 359.0 367.5 374.1
Farm income 12.0 14.2 12.0 14.5 15.0 15.3 15.3
Personal contributions for
social insurance (deduction) 12.4 13.2 12.9 13.0 13.3 13.6 16.6
Disposable personal income 435.8 465.0 451.4 458.5 471.2 478.7 485.1
Personal saving 26.3 24.6 23.3 22.4 26.8 26.0 25.5
Saving rate (per cent) 6.0 5.3 5.2 4.9 5.7 5.4 5.3
Total labor force (millions) 77.0 78.3 77.7. 78.2 78.5 78.9 79.6
Armed forces " 2.7 2.7 2.7 2.7 2.7 2.8 2.9
Civilian labor force " 74.2 75.6 75.0 75.5 75.8 76,1 76,7
Employed " 70.4 72.1 71.3 71.9 72.4 72.9 73.6
Unemployed " 3.9 3.5 3.6 3.6 3.4 3.2 3.1
Unemployment rate (per cent) 5.2 4.6 4.8 4.7 4.4 4.2 4.0
```

One way to organize the output is to split the text before the first digit on
each line.

```r
text <- strsplit(text, "\n")[[1]]
text <- text[6:length(text)]

for (i in seq_along(text)) {
  firstdigit <- regexpr("[0-9]", text[i])[1]

  variable <- trimws(substr(text[i], 1, firstdigit - 1))

  values <- strsplit(substr(text[i], firstdigit, nchar(text[i])), " ")[[1]]
  values <- trimws(gsub(",", ".", values))
  values <- suppressWarnings(as.numeric(gsub("\\.$", "", values)))

  if (length(values[!is.na(values)]) < 1) {
    next
  }

  res <- c(variable, values)

  names(res) <- c(
    "variable", "y1964", "y1965est", "y1965q1",
    "y1965q2", "y1965q3", "y1965q4est", "y1966q1pro"
  )

  if (i == 1) {
    df <- as.data.frame(t(res))
  } else {
    df <- rbind(df, as.data.frame(t(res)))
  }
}

head(df)
```

```
                           variable y1964 y1965est y1965q1 y1965q2 y1965q3
1            Gross National Product 628.7    675.7   657.6   668.8   681.5
2 Personal consumption expenditures 398.9    428.6   416.9   424.5   432.5
3                     Durable goods  58.7       65    64.6    63.5    65.4
4                  Nondurable goods 177.5    188.8   182.8   187.9   190.5
5                          Services 162.6    174.9   169.5   173.1   176.7
6 Gross private domestic investment  92.9    104.9   103.4   102.8   106.2
  y1965q4est y1966q1pro
1        695        707
2      440.5      447.1
3       66.4       66.6
4        194      197.6
5      180.1      182.9
6        107      109.1
```

The result is not perfect (e.g. I still need to change "Gross National Product
in Constant" to add the "(1958) Dollars"), but neither is Textract's and it
requires to write a more complex loop to organize the data. Certainly, this can
be simplified by using the Tidyverse.