--- title: "Using the Tesseract OCR engine in R" date: "`r Sys.Date()`" output: html_document: toc: true toc_depth: 2 toc_float: true fig_caption: false vignette: > %\VignetteIndexEntry{Using the Tesseract OCR engine in R} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The tesseract package provides R bindings [Tesseract](https://github.com/tesseract-ocr/tesseract): a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results. Keep in mind that OCR (pattern recognition in general) is a very difficult problem for computers. Results will rarely be perfect and the accuracy rapidly decreases with the quality of the input image. But if you can get your input images to reasonable quality, Tesseract can often help to extract most of the text from the image. ## Extract Text from Images OCR is the process of finding and recognizing text inside images, for example from a screenshot, scanned paper. The image below has some example text:  The `ocr()` function extracts text from an image file. After indicating the engine for the language, it will return the text found in the image: ```r library(cpp11tesseract) file <- system.file("examples", "wilde.png", package = "cpp11tesseract") eng <- tesseract("eng") text <- ocr(file, engine = eng) cat(text) ``` ``` Complete Works oF OSCAR WILDE EDITED BY ROBERT ROSS MISCELLANIES ‘AUTHORIZED EDITION THE WYMAN-FOGG COMPANY BOSTON :: MASSACHUSETTS ``` The `ocr_data()` function returns all words in the image along with a bounding box and confidence rate. ```r results <- ocr_data(file, engine = eng) results ``` ``` # A tibble: 18 × 4 word confidence bbox stringsAsFactors <chr> <dbl> <chr> <lgl> 1 Complete 96.5 159,48,304,79 FALSE 2 Works 96.5 320,49,422,74 FALSE 3 oF 66.3 281,102,300,111 FALSE 4 OSCAR 94.9 92,131,274,161 FALSE 5 WILDE 96.6 308,132,490,167 FALSE 6 EDITED 95.2 248,187,303,197 FALSE 7 BY 96.6 314,187,334,197 FALSE 8 ROBERT 96.1 207,212,302,227 FALSE 9 ROSS 96.1 318,212,373,227 FALSE 10 MISCELLANIES 90.8 195,298,389,316 FALSE 11 ‘AUTHORIZED 82.2 200,504,306,515 FALSE 12 EDITION 96.8 315,503,382,514 FALSE 13 THE 93.2 144,664,184,677 FALSE 14 WYMAN-FOGG 90.3 195,663,331,676 FALSE 15 COMPANY 95.8 342,662,438,675 FALSE 16 BOSTON 66.6 144,693,218,706 FALSE 17 :: 81.4 246,697,255,705 FALSE 18 MASSACHUSETTS 90.9 279,691,438,704 FALSE ``` ## Language Data The tesseract OCR engine uses language-specific training data in the recognize words. The OCR algorithms bias towards words and sentences that frequently appear together in a given language, just like the human brain does. Therefore the most accurate results will be obtained when using training data in the correct language. Use `tesseract_info()` to list the languages that you currently have installed. ```r tesseract_info() ``` ``` $datapath [1] "/usr/share/tesseract-ocr/5/tessdata/" $available [1] "chi_sim" "eng" "osd" $version [1] "5.4.1" $configs [1] "alto" "ambigs.train" "api_config" "bigram" [5] "box.train" "box.train.stderr" "digits" "get.images" [9] "hocr" "inter" "kannada" "linebox" [13] "logfile" "lstm.train" "lstmbox" "lstmdebug" [17] "makebox" "page" "pdf" "quiet" [21] "rebox" "strokewidth" "tsv" "txt" [25] "unlv" "wordstrbox" ``` By default the R package only includes English training data. Windows and Mac users can install additional training data using `tesseract_download()`. Let's OCR a screenshot from Wikipedia in Simplified Chinese.  ```r # Download once dir <- tempdir() tesseract_download("chi_sim", model = "fast", datapath = dir) ``` ``` Downloaded: 2.35 MB (100%) [1] "/tmp/RtmpfeKjPP/chi_sim.traineddata" ``` ```r # Load the dictionary file <- system.file("examples", "chinese.jpg", package = "cpp11tesseract") text <- ocr(file, engine = tesseract("chi_sim", datapath = dir)) cat(text) ``` ``` 奥林匹克è¿åŠ¨ä¼š (希腊è¯;: OMXohrtakot AYWvsc; 法è¯: Jeux olympiques; 英è¯: Olympic Games) ,简称奥è¿ä¼šã€å¥¥è¿ï¼Œæ˜¯ä¸–界最高ç‰çº§çš„国际综åˆä½“育蹇事,由国际 奥林匹克委员会主办,æ¯4年举行一次。冬å£è®æŠ€é¡¹ç›®åˆ›ç«‹å†¬å£å¥¥æž—匹克è¿åŠ¨ä¼šåŽï¼Œä¹‹å‰ 的奥林匹克è¿åŠ¨ä¼šåˆ™æ˜¯åˆç§°ä¸ºã€å¤å£å¥¥æž—匹克还动会〠以示区分。从1994年起,冬å£å¥¥ 还会和å¤å£å¥¥è¿ä¼šåˆ†å¹¶ï¼Œç›¸éš”2年交蔡举行。奥林匹克è¿åŠ¨ä¼šæœ€æ—©èµ·æºæ–¼å¤å¸Œè…Šï¼Œæ˜¯å½“æ—¶ å„城邦之间的公开较é‡ï¼Œå› 为皋闪地在奥林匹亚而得å。信幸基晴教的办天皇å¸ç‹„奥多西 一世以奥林匹克è¿åŠ¨ä¼šå´‡æ‹œè€¶ç¨£ä»¥å¤–ç¥žåªä¸ºç”±ï¼Œç¦æ¢å¥¥è¿é¡¹æŠ€ï¼Œæ–¼æ˜¯å¥¥è¿åœ¨ä¸¾åŠžè¶…è¿‡ 1,000å¹´åŽæ–¼4世纪末åœåŠžï¼Œå¥¥è¿è¿™æ¬¡åœåŠžæŒç»äº†1,503年,直到19世纪未æ‰ç”±å’ŒåŽäººå‘现 é—è·ã€‚之åŽï¼Œæ³•辆的顾拜旦男超皮耶,å¾·:夿Ÿå¦åˆ›ç«‹äº†æœ‰çœŸæ£å¥¥è¿ç²¾ç¥žçš„çŽ°ä»£å¥¥æž—åŒ¹å…‹è¿ åŠ¨ä¼šï¼Œè‡ª1896年开始æ¯4å¹´å¦åŠžä¸€æ¬¡ï¼Œæ›´ç¡®ç«‹äº†ä¼šæœŸä¸è¶…过18æ—¥çš„ä¼ ç»Ÿã€‚çŽ°ä»£å¥¥è¿ä¼šåª 在两次世界大战期间åˆå…±ä¸æ–过5次(分别是191 6å¹´å¤å£å¥¥è¿ä¼šã€1940å¹´å¤å£å¥¥è¿ä¼š Dã€1940年冬å£å¥¥è¿ä¼š(1]ã€1944å¹´å¤å£å¥¥è¿ä¼šå’Œ1944年冬å£å¥¥è¿ä¼š) [广 ]],以åŠåœ¨ 2020å¹´å› å…¨çƒé˜²ç–«å»¶æœŸè¿‡ä¸€æ¬¡ (2020å¹´å¤å£å¥¥è¿ä¼š2][å¹¿ä¹ ) 。 ``` Compare with the copy and paste from the Wikipedia. ```r text2 <- readLines(system.file("examples", "chinese.txt", package = "cpp11tesseract")) cat(text2) ``` ``` 奧林匹克é‹å‹•會(希臘語:Ολυμπιακοί Αγώνες;法語 Jeux olympiques;英語:Olympic Gamesç°¡ ç¨±å¥§é‹æœƒã€å¥§é‹ï¼Œæ˜¯ä¸–界最高ç‰ç´šçš„國際綜åˆé«”育賽事,由國際 奧林匹克委員會主辦,æ¯4年舉行一次。冬 å£ç«¶æŠ€é …目創立冬å£å¥§æž—匹克é‹å‹•æœƒå¾Œï¼Œä¹‹å‰ çš„å¥§æž—åŒ¹å…‹é‹å‹•會則是åˆç¨±ç‚ºã€Œå¤å£å¥§æž—匹克é‹å‹•會ã€ä»¥ç¤ºå€ 分。從1994年起,冬å£å¥§ 鋿œƒå’Œå¤å£å¥§é‹æœƒåˆ†é–‹ï¼Œç›¸éš”2年交替舉行。奥林匹克é‹å‹•æœƒæœ€æ—©èµ·æºæ–¼å¤å¸Œè…Šï¼Œ 是當時 å„城邦之間的公開較é‡ï¼Œå› 為舉辦地在奧林匹亚而得åã€‚ä¿¡å¥‰åŸºç£æ•™çš„羅馬皇å¸ç‹„奧多西 一世以奧 林匹克é‹å‹•æœƒå´‡æ‹œè€¶ç©Œä»¥å¤–ç¥žè¡¹ç‚ºç”±ï¼Œç¦æ¢å¥§é‹ç«¶æŠ€ï¼Œæ–¼æ˜¯å¥§é‹åœ¨èˆ‰è¾¦è¶…éŽ 1,000年後於4世紀末åœè¾¦ï¼Œå¥§ é‹é€™æ¬¡åœè¾¦æŒçºŒäº†1,503年,直到19世纪末æ‰ç”±å¾Œäººç™¼ç¾ éºè¹Ÿã€‚ä¹‹å¾Œï¼Œæ³•åœ‹çš„é¡¾æ‹œæ—¦ç”·çˆµçš®è€¶Â·å¾·Â·å¤æŸå¦ 創立了有真æ£å¥§é‹ç²¾ç¥žçš„ç¾ä»£å¥§æž—åŒ¹å…‹é‹ å‹•æœƒï¼Œè‡ª1896å¹´é–‹å§‹æ¯4年舉辦一次,更確立了會期ä¸è¶…éŽ18æ—¥ 的傳統。ç¾ä»£å¥§é‹æœƒåª 在兩次世界大戰期間åˆå…±ä¸æ–·éŽ5次(分別是1916å¹´å¤å£å¥§é‹æœƒã€1940å¹´å¤å£å¥§é‹ 會 [1]ã€1940年冬å£å¥§é‹æœƒ[1]ã€1944å¹´å¤å£å¥§é‹æœƒå’Œ1944年冬å£å¥§é‹æœƒï¼‰[註 1],以åŠåœ¨ 2020å¹´å› å…¨çƒé˜²ç–«å»¶æœŸéŽä¸€æ¬¡ï¼ˆ2020å¹´å¤å£å¥§é‹æœƒ[2][註 2])。 ``` ## Read from PDF files If your images are stored in PDF files they first need to be converted to a proper image format. We can do this in R using the `pdf_convert` function from the `cpp11poppler` package. Use a high DPI to keep quality of the image. ```r library(cpp11poppler) file <- system.file("examples", "bondargentina.pdf", package = "cpp11tesseract") pngfile <- pdf_convert(file, dpi = 600) text <- ocr(pngfile) cat(text) ``` ```r _ LISTING PARTICULARS CONSISTING OF Pricing Supplement, Supplemental Information Memorandum and Supplemental Information Memorandum Addendum dated May 26, 1998 AND Information Memorandum Addendum dated October 17, 1997 | ee . THE REPUBLIC OF ARGENTINA Euro 750,000,000 Interest Strip Notes due 2028 issued under its U.S.$11,000,000,000 EURO MEDIUM-TERM NOTE PROGRAMME | Issue Price: 100.835 per cent. Series No.: 61 Tranche No.: 01 : Dealer ABN AMRO The date of these Listing Particulars is May 26, 1998. The Republic has warranted to the Dealer that, inter alia, these Listing Particulars are true and accurate in all material respects, do not contain any untrue statement of material fact nor omit to state any material fact known to the Republic necessary to make statements herein not misleading and all reasonable enquiries have been made to ascertain such facts and to verify the accuracy of all such statements. The Republic accepts responsibility accordingly. No person has been authorised to give any information or to make any representations, other than those contained in the Listing Particulars, in connection with the offering or sale of the Notes and, if given or made, such information or representations must not be relied upon as having been authorised by the Republic or the Dealer. Neither the delivery of these Listing Particulars nor any sale made hereunder shall, under any circumstances, constitute a representation that there has been no change in the financial position or prospects of the Republic since the date hereof or the information contained herein is correct as of any time subsequent to the date hereof. ``` ## Tesseract Control Parameters Tesseract supports hundreds of "control parameters" which alter the OCR engine Use `tesseract_params()` to list all parameters with their default value and a brief description. It also has a handy `filter` argument to quickly find parameters that match a particular string. ```r # List all parameters with *colour* in name or description tesseract_params("colour") ``` ``` # A tibble: 2 × 3 param default desc * <chr> <chr> <chr> 1 editor_image_word_bb_color 7 Word bounding box colour 2 editor_image_blob_bb_color 4 Blob bounding box colour ``` Do note that some of the control parameters have changed between Tesseract engine 3 and 4. ```r tesseract_info()["version"] ``` ``` [1] "5.4.1" ``` ### Whitelist / Blacklist characters One powerful parameter is `tessedit_char_whitelist` which restricts the output to a limited set of characters. This may be useful for reading for example numbers such as a bank account, zip code, or gas meter. The whitelist parameter works for all versions of Tesseract engine 3 and also engine versions 4.1 and higher, but unfortunately it did not work in Tesseract 4.0.  ```r file <- system.file("examples", "receipt.jpg", package = "cpp11tesseract") numbers <- tesseract(options = list(tessedit_char_whitelist = "-$.0123456789")) cat(ocr(file, engine = numbers)) ``` ``` 0 00068354712539 01.8$31.998 25 -$8.00 00084019961505 03966$44.99 00003558543582 8 $8.93 $ 00000002000414 $0.50 $$60$10 -$10.00 $ $68.47 $8.84 $77.31 ``` To test if this actually works, look at the output without the whitelist: ```r cat(ocr(file, engine = eng)) ``` ``` DOG 000683547 12539 OPEN FARM DOG AG SALMON 1.8KG $31.99 HST Item discount 25% -$8.00 HST 00084019961505 VE FO GOOG BF NIB 396G LRG KONG $44.99 HST ACCESSORIES 00003558543582 KONG BRUSH $8.93 HST STORE USE ITEMS 000000020004 14 GPF CLOTH BAG LARGE $0.50 FPS SPEND $60 SAVE $10 -$10.00 SUB TOTAL $68.47 HST $8.84 TOTAL $77.31 ``` This is Mr. Duke:  Here is the extracted text: ```r file <- system.file("examples", "mrduke.jpg", package = "cpp11tesseract") text <- ocr(file, engine = eng) cat(text) ``` ``` ee oe e Ze. <n BR ee Mr. Duke, 4 years old (2024) 2 ``` ## Best versus Fast models In order to improve the OCR results, Tesseract has two variants of models that can be used. The `tesseract_download()` can download the 'best' (but slower) model, which increases the accuracy. The 'fast' (but less accurate) model is the default. Compare the result with the previous example with Chinese text: ```r file <- system.file("examples", "chinese.jpg", package = "cpp11tesseract") # download the best model (vertical script download is to avoid a warning) dir <- tempdir() tesseract_download("chi_sim_vert", model = "best", datapath = dir) tesseract_download("chi_sim", model = "best", datapath = dir) text <- ocr(file, engine = tesseract("chi_sim", datapath = dir)) # compare the results: fast (text1) vs best (text2) cat(text) ``` ``` 奥林匹克è¿åŠ¨ä¼š (希腊è¯;: OMXohrtakot AYWvsc; 法è¯: Jeux olympiques; 英è¯: Olympic Games) ,简称奥è¿ä¼šã€å¥¥è¿ï¼Œæ˜¯ä¸–界最高ç‰çº§çš„国际综åˆä½“育蹇事,由国际 奥林匹克委员会主办,æ¯4年举行一次。冬å£è®æŠ€é¡¹ç›®åˆ›ç«‹å†¬å£å¥¥æž—匹克è¿åŠ¨ä¼šåŽï¼Œä¹‹å‰ 的奥林匹克è¿åŠ¨ä¼šåˆ™æ˜¯åˆç§°ä¸ºã€å¤å£å¥¥æž—匹克还动会〠以示区分。从1994年起,冬å£å¥¥ 还会和å¤å£å¥¥è¿ä¼šåˆ†å¹¶ï¼Œç›¸éš”2年交蔡举行。奥林匹克è¿åŠ¨ä¼šæœ€æ—©èµ·æºæ–¼å¤å¸Œè…Šï¼Œæ˜¯å½“æ—¶ å„城邦之间的公开较é‡ï¼Œå› 为皋闪地在奥林匹亚而得å。信幸基晴教的办天皇å¸ç‹„奥多西 一世以奥林匹克è¿åŠ¨ä¼šå´‡æ‹œè€¶ç¨£ä»¥å¤–ç¥žåªä¸ºç”±ï¼Œç¦æ¢å¥¥è¿é¡¹æŠ€ï¼Œæ–¼æ˜¯å¥¥è¿åœ¨ä¸¾åŠžè¶…è¿‡ 1,000å¹´åŽæ–¼4世纪末åœåŠžï¼Œå¥¥è¿è¿™æ¬¡åœåŠžæŒç»äº†1,503年,直到19世纪未æ‰ç”±å’ŒåŽäººå‘现 é—è·ã€‚之åŽï¼Œæ³•辆的顾拜旦男超皮耶,å¾·:夿Ÿå¦åˆ›ç«‹äº†æœ‰çœŸæ£å¥¥è¿ç²¾ç¥žçš„çŽ°ä»£å¥¥æž—åŒ¹å…‹è¿ åŠ¨ä¼šï¼Œè‡ª1896年开始æ¯4å¹´å¦åŠžä¸€æ¬¡ï¼Œæ›´ç¡®ç«‹äº†ä¼šæœŸä¸è¶…过18æ—¥çš„ä¼ ç»Ÿã€‚çŽ°ä»£å¥¥è¿ä¼šåª 在两次世界大战期间åˆå…±ä¸æ–过5次(分别是191 6å¹´å¤å£å¥¥è¿ä¼šã€1940å¹´å¤å£å¥¥è¿ä¼š Dã€1940年冬å£å¥¥è¿ä¼š(1]ã€1944å¹´å¤å£å¥¥è¿ä¼šå’Œ1944年冬å£å¥¥è¿ä¼š) [广 ]],以åŠåœ¨ 2020å¹´å› å…¨çƒé˜²ç–«å»¶æœŸè¿‡ä¸€æ¬¡ (2020å¹´å¤å£å¥¥è¿ä¼š2][å¹¿ä¹ ) 。 ``` ## Contributed models The `tesseract_contributed_download()` function can download contributed models. For example, the `grc_hist` model is useful for Polytonic Greek. Here is an example from Sophocles' Ajax (source: [Ajax Multi-Commentary](https://github.com/AjaxMultiCommentary))  ``` file <- system.file("examples", "polytonicgreek.png", package = "cpp11tesseract") # download the best models dir <- tempdir() tesseract_download("grc", datapath = dir, model = "best") tesseract_contributed_download("grc_hist", datapath = dir, model = "best") # compare the results: grc (text1) vs grc_hist (text2) text1 <- ocr(file, engine = tesseract("grc", datapath = dir)) text2 <- ocr(file, engine = tesseract("grc_hist", datapath = dir)) cat(text1) cat(text2) ``` ``` 232» Α 2 [ 2 Α ὠ’, ᾿» á¼ á½² καὶ ἀμφαδίην, á¼Ï€Îµá½¶ οὠτινά δείδιμεν ἔμπης" Î¿á½ Î³Î¬Ï Ï„Î¯Ï‚ μὲ βίῃ γε ἑκὼν á¼€Îχοντα δίηται, 2 ᾿ 2 ’ 2 " 2 2 2 Α ».. ΄ 2 Ὁ“ οá½Î´Î τε ἰδÏείῃ, á¼Ï€Îµá½¶ οá½Î´Î„ á¼Î¼á½² νήϊδά γ᾽ οὕτως ἔλπομαι á¼Î½ Σαλαμῖνι γενÎσϑαι τε Ï„ÏαφÎμεν τε. ``` ``` 20 Ε 2 Σ. ΣΣ ν á¼ á½² καὶ ἀμφαδίην, á¼Ï€Îµá½¶ οὔ τινα δείδιμεν ἔμπης· Î¿á½ Î³Î¬Ï Ï„Î¯Ï‚ με βίῃ γε ἑκὼν á¼€Îκοντα δίηται, 2 . 2 ’ 2 ο 2 2 Σ Α ΣΣΣ . 2 ῳ. οá½Î´Î τι ἰδÏείῃ, á¼Ï€Îµá½¶ οá½Î´â€™ á¼Î¼á½² νήϊδά γ’ οὕτως ἔλπομαι á¼Î½ Σαλαμῖνι γενÎσθαι τε Ï„ÏαφÎμεν τε. ``` ## Comparison with Amazon Textract *Note: Amazon and Textract are trademarks of Amazon.com, Inc.* Textract [documentation](https://aws.amazon.com/blogs/opensource/using-r-with-amazon-web-services-for-document-analysis/) uses page three of the [January 1966 report](https://www.philadelphiafed.org/-/media/frbp/assets/surveys-and-data/greenbook-data/greensheets/greensheets-1966.zip?la=en&hash=8B8ABA92C3F0B2939328D47B6230F3A3) from Philadelphia Fed's [Tealbook](https://www.philadelphiafed.org/surveys-and-data/real-time-data-research/philadelphia-data-set) (formerly Greenbook).  Here is the first element of the list returned by Textract: ```r # List of 13 # $ BlockType : chr "TABLE" # $ Confidence : num 100 # $ Text : chr(0) # $ RowIndex : int(0) # $ ColumnIndex : int(0) # $ RowSpan : int(0) # $ ColumnSpan : int(0) # $ Geometry :List of 2 # .. <not shown> # $ Id : chr "c6841638-d3e0-414b-af12-b94ed34aac8a" # $ Relationships :List of 1 # ..$ :List of 2 # .. ..$ Type: chr "CHILD" # .. ..$ Ids : chr [1:256] "e1866e80-0ef0-4bdd-a6fd-9508bb833c03" ... # $ EntityTypes : list() # $ SelectionStatus: chr(0) # $ Page : int 3 ``` Here is Tesseract's output: ```r file <- system.file("examples", "tealbook.png", package = "cpp11tesseract") text <- ocr(file) cat(text) ``` ``` Nemes mm a a ee en e-em n an ae ee Year SSC—~SSESSC~*«C 1965 IV I Esti- Esti- Pro- 1964 __mated yi/ rr/ rrr! mated _ jected Gross National Product 628.7 675.7 657.6 668.8 681.5 695.0 707.0 Personal consumption expenditures 398.9 428.6 416.9 424.5 432.5 440.5 447.1 Durable goods 58.7 65.0 64.6 63.5 65.4 66.4 66.6 Nondurable goods 177.5 188.8 182.8 187.9 190.5 194.0 197.6 Services 162.6 174.9 169.5 173,1 176.7 180.1 182.9 Gross private domestic investment 92.9 104.9 103.4 102.8 106.2 107.0 109.1 Residential construction 27.5 27.7 27.7 28.0 27.7 27.3 27.5 Business fixed investment 60.5 69.8 66.9 68.4 70.9 73.1 75.1 Change in business inventories 4.8 7.4 8.8 6.4 7.6 6.6 6.5 Nonfarm 5.4 7.1 9.2 6.6 7.0 5.4 5.5 Net exports 8.6 7.3 6.0 8.0 7.4 7.8 8.1 Gov. purchases of goods & services 128.4 135,0 131.3 133.5 135.4 139.7 142.7 Federal 65.3 66.6 64.9 65.7 66.5 69.4 70.7 Defense 49.9 49.9 48.8 49.2 49.8 51.8 52.7 Other 15.4 16.7 16.1 16.5 16.7 17.6 18.0 State and local 63.1 68,4 66.4 67.8 68.9 70.3 72.0 Gross National Product in Constant 577.6 609.3 597.7 603.5 613.0 622.4 630.1 (1958) Dollars Personal income 495.0 530.5 516.2 524.7 536.0 544.9 552.0 Wages and salaries 333.5 357.3 348.9 353.6 359.0 367.5 374.1 Farm income 12.0 14.2 12.0 14.5 15.0 15.3 15.3 Personal contributions for social insurance (deduction) 12.4 13.2 12.9 13.0 13.3 13.6 16.6 Disposable personal income 435.8 465.0 451.4 458.5 471.2 478.7 485.1 Personal saving 26.3 24.6 23.3 22.4 26.8 26.0 25.5 Saving rate (per cent) 6.0 5.3 5.2 4.9 5.7 5.4 5.3 Total labor force (millions) 77.0 78.3 77.7. 78.2 78.5 78.9 79.6 Armed forces " 2.7 2.7 2.7 2.7 2.7 2.8 2.9 Civilian labor force " 74.2 75.6 75.0 75.5 75.8 76,1 76,7 Employed " 70.4 72.1 71.3 71.9 72.4 72.9 73.6 Unemployed " 3.9 3.5 3.6 3.6 3.4 3.2 3.1 Unemployment rate (per cent) 5.2 4.6 4.8 4.7 4.4 4.2 4.0 ``` One way to organize the output is to split the text before the first digit on each line. ```r text <- strsplit(text, "\n")[[1]] text <- text[6:length(text)] for (i in seq_along(text)) { firstdigit <- regexpr("[0-9]", text[i])[1] variable <- trimws(substr(text[i], 1, firstdigit - 1)) values <- strsplit(substr(text[i], firstdigit, nchar(text[i])), " ")[[1]] values <- trimws(gsub(",", ".", values)) values <- suppressWarnings(as.numeric(gsub("\\.$", "", values))) if (length(values[!is.na(values)]) < 1) { next } res <- c(variable, values) names(res) <- c( "variable", "y1964", "y1965est", "y1965q1", "y1965q2", "y1965q3", "y1965q4est", "y1966q1pro" ) if (i == 1) { df <- as.data.frame(t(res)) } else { df <- rbind(df, as.data.frame(t(res))) } } head(df) ``` ``` variable y1964 y1965est y1965q1 y1965q2 y1965q3 1 Gross National Product 628.7 675.7 657.6 668.8 681.5 2 Personal consumption expenditures 398.9 428.6 416.9 424.5 432.5 3 Durable goods 58.7 65 64.6 63.5 65.4 4 Nondurable goods 177.5 188.8 182.8 187.9 190.5 5 Services 162.6 174.9 169.5 173.1 176.7 6 Gross private domestic investment 92.9 104.9 103.4 102.8 106.2 y1965q4est y1966q1pro 1 695 707 2 440.5 447.1 3 66.4 66.6 4 194 197.6 5 180.1 182.9 6 107 109.1 ``` The result is not perfect (e.g. I still need to change "Gross National Product in Constant" to add the "(1958) Dollars"), but neither is Textract's and it requires to write a more complex loop to organize the data. Certainly, this can be simplified by using the Tidyverse.