top of page

Parsing PDFs in R for analysis

I've been addicted to the quanteda package for a while now. But I've been using it the 'easy' way - with a .csv file, proper document variables, the whole thing. Now I'm facing the need to analyse Daily Logs from Brazilian executive governments. Daily Logs are a log of everything that the government did that day, including legislation, decrees, bureaucratic norms, job openings. More than a list, this is where governments officially declare their business, so it's a full account. The one in the image is the federal government's daily log for 23 June 2022, section 1. There were four additional sections that day. This one was 172 pages long.

It's not only the Union that does this, but also state and municipal governments. Along with my co-authors, I'm running an analysis on specific decrees, by the governments of the state Minas Gerais and the city of Belo Horizonte.

Source: Wikimedia, Raphael Lorenzeto de Abreu

This meant downloading all the pdfs within the timeframe we required, reading them into R, reformatting the multiple column configuration (Belo Horizonte's format uses four columns), then extracting only what we wanted (excluding all other text that wasn't those specific decrees). The Union's log is published in multiple versions, including open data. But our governments of interest do not.

The codes I've frankensteined to do what I needed came from Alex Luscombe and from StackOverflow (specifically, Pierre L).

I managed to parse the 22 documents from the government of Minas Gerais, and 44 from Belo Horizonte. I found all the decrees I needed, saved them to .csv and will now analyse them in quanteda. The markdown is below for replication, but I haven't had a chance to do the one for the batch download of these pdfs, which are not stored in the most friendly manner.

Download PDF • 206KB

As always, I'm barely a coder, but I get through it, and I think everyone can if they want to. So my goal is to make it easier if I can. Happy coding!

109 views0 comments


bottom of page