Title: | Extract Text From PDFs In An R Friendly Way |
---|---|
Description: | Extracts text from PDF into an R dataframe giving the content, size, position and font of any text elements. This information can then be manipulated in R. |
Authors: | Allan Cameron [aut, cre, cph], Eli Pousson [ctb] |
Maintainer: | Allan Cameron <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2024-11-21 04:20:18 UTC |
Source: | https://github.com/AllanCameron/PDFR |
Draws glyphs from a truetype font as grid grobs
draw_glyph(fontfile, glyph)
draw_glyph(fontfile, glyph)
fontfile |
a raw vector representing a font file |
glyph |
the character to be drawn. Can be text or an integer |
no return
## Not run: if(interactive()){ # ttf <- "raw vector with font file" draw_glyph(ttf, "a") } ## End(Not run)
## Not run: if(interactive()){ # ttf <- "raw vector with font file" draw_glyph(ttf, "a") } ## End(Not run)
Returns a list consisting of a named vector representing key:value pairs in a specified object. It also contains any stream data associated with the object.
get_object(pdf, number)
get_object(pdf, number)
pdf |
a valid pdf file location |
number |
the object number |
a named vector of the dictionary and stream of the pdf object
get_object(pdfr_paths$leeds, 1)
get_object(pdfr_paths$leeds, 1)
Get a pdf's xref table as an R dataframe
get_xref(pdf)
get_xref(pdf)
pdf |
a valid pdf file location or raw data vector |
a data frame showing the bytewise positions of each object in the pdf
get_xref(pdfr_paths$leeds)
get_xref(pdfr_paths$leeds)
Used mainly for debugging, this function returns an R dataframe, one row for each byte that may be used as a glyph. It shows the unicode number of each interpreted glyph, as well as its width in text space.
getglyphmap(pdf, page = 1)
getglyphmap(pdf, page = 1)
pdf |
a valid pdf file location |
page |
the page number from which to extract glyphs |
a dataframe of all entries of font encoding tables with width mapping
getglyphmap(pdfr_paths$leeds, 1)
getglyphmap(pdfr_paths$leeds, 1)
Returns contents of a pdf page description program
getpagestring(pdf, page)
getpagestring(pdf, page)
pdf |
a valid pdf file location |
page |
the page number to be extracted |
a single string containing the page description program
getpagestring(pdfr_paths$leeds, 1)
getpagestring(pdfr_paths$leeds, 1)
Plots the bounding boxes of text elements from a page as a ggplot.
pdfboxes(pdf, pagenum)
pdfboxes(pdf, pagenum)
pdf |
a valid pdf file location |
pagenum |
the page number to be plotted |
a ggplot
pdfboxes(pdfr_paths$leeds, 1)
pdfboxes(pdfr_paths$leeds, 1)
Returns contents of all pdf pages
pdfdoc(pdf)
pdfdoc(pdf)
pdf |
a valid pdf file location |
a data frame of all text elements in a document
pdfdoc(pdfr_paths$leeds)
pdfdoc(pdfr_paths$leeds)
Plots the graphical elements of a pdf page as a ggplot
pdfgraphics(file, pagenum, scale = 1)
pdfgraphics(file, pagenum, scale = 1)
file |
a valid pdf file location |
pagenum |
the page number to be plotted |
scale |
Scale used for linewidth and text size. Passed to 'ggplot2::geom_text()' size parameter as scale * size/3 |
a ggplot
pdfgraphics(pdfr_paths$leeds, 1)
pdfgraphics(pdfr_paths$leeds, 1)
Plots the graphical elements of a pdf page as grobs
pdfgrobs(file_name, pagenum, scale = dev.size()[2]/10, enc = "UTF-8")
pdfgrobs(file_name, pagenum, scale = dev.size()[2]/10, enc = "UTF-8")
file_name |
a valid pdf file location |
pagenum |
the page number to be plotted |
scale |
Document scale. Defaults to 'dev.size()[2]/10' |
enc |
Document encoding. Defaults to "UTF-8" |
invisibly returns grobs as well as drawing them
pdfgrobs(pdfr_paths$leeds, 1)
pdfgrobs(pdfr_paths$leeds, 1)
Returns contents of a pdf page
pdfpage(pdf, page = 1, atomic = FALSE, table_only = TRUE)
pdfpage(pdf, page = 1, atomic = FALSE, table_only = TRUE)
pdf |
a valid pdf file location |
page |
the page number to be extracted |
atomic |
a boolean - should each letter treated individually? |
table_only |
a boolean - return data frame alone, as opposed to list |
a list containing data frames
head(pdfpage(pdfr_paths$leeds, page = 1)) head(pdfpage(pdfr_paths$chestpain, page = c(1:2)))
head(pdfpage(pdfr_paths$leeds, page = 1)) head(pdfpage(pdfr_paths$chestpain, page = c(1:2)))
Plots the text elements from a page as a ggplot. The aim is not a complete pdf rendering but to help identify elements of interest in the data frame of text elements to convert to data points.
pdfplot(pdf, page = 1, atomic = FALSE, boxes = FALSE, textsize = 1)
pdfplot(pdf, page = 1, atomic = FALSE, boxes = FALSE, textsize = 1)
pdf |
a valid pdf file location |
page |
the page number to be plotted |
atomic |
a boolean - should each letter treated individually? |
boxes |
Show the calculated text bounding boxes |
textsize |
the scale of the text to be shown |
a ggplot
pdfplot(pdfr_paths$leeds, 1)
pdfplot(pdfr_paths$leeds, 1)
A list of paths to locally stored test pdfs
pdfr_paths
pdfr_paths
A list of 9 pdf files
a pdf constructed in Rstudio
a flow-chart for chest pain management
information about the pdf format
an official adobe document
a table-rich local government document
a document based on svg
a simple pdf test
a simple tex test
a CRAN package vignette
A registered native symbol used in testing
run_testthat_tests
run_testthat_tests
A list of 4 fields
run_testthat_tests
a pointer to this symbol
the compiled file where the symbol is contained
no parameters