The problem

Every time mayoral or congressional elections in Peru approach, I wonder:

Wouldn't be nice if there is any tool to do a quick background check on the candidates.

What usually happens is that once the candidates are elected, journalists find out that some had been prosecuted for a variety of crimes such as embezzlement, corruption, and even murder. It would be great if we find out this information before elections so we can choose better.

Some history

During the mayoral elections in Peru in 2014, I was part of the project "Verita" that scraped public information from governmental websites in order to cross check the list of candidates for mayors and governors. Of the 116,000 candidates, we found that around 1,400 had been prosecuted and sentenced for corruption charges, murder and terrorism. The most common offense [text in Spanish] was failing to pay child support for one or more children.

In the recent weeks, my buddy @matiskay and I have started recycling all the software that was produced during the Verita project. The software consists of spiders to crawl governmental websites and scrape the data.

When doing this, we made sure that all the information is public and had been published by the Peruvian government in websites or portals. We don't do any hacking, password breaking or exploitation of vulnerabilities. There is no need as the Peruvian government makes an effort to maintain open data publicly available.

All the software is open source and is hosted on the social network for nerds known as Github. We have spiders to scrape:

There is also software provided by anonymous programmers. The user @escribiendocodigo wrote a spider to scrape data from the Superintendencia de Transporte Terrestre (SUTRAN). Link here.

The user @wesitos wrote a spider para to scrape data from Infogob. Link here.

The software

These spiders scrape data from websites. They visit a website, discover internal links and visits them. In each visited page, the spiders examine and parse content of interest, saving all the data into the local disk.

It is relatively easy to program these spiders as they are constructed using the popular framework Scrapy. Scrapy helps a lot by dealing with cookies, retries for failed pages, redirections and asynchronous execution. It is possible to extract content by using XPath and CSS selectors. Scrapy is written in Python and it is free of use (open source).


This is what we have done with @matiskay:

  • run the spiders from the project Verita,
  • scrape and download all data
  • upload all data to a server in the cloud
  • index all data using elasticsearch for quick retrieval of data
  • implement a search tool in Django as a plugin of our successful web application known as Manolo, buscador de lobistas.

The name of this app is Manolo cazador. When the political parties publish the list of candidates for Congress, anyone could paste the list in Manolo-cazador and do a quick search. If any candidate happens to be registered in our database, you should get a search result.

Thus, it is not necessary to search each person in the numerous databases of the Peruvian government. In Manolo-cazador, you have all databases available in one single place. Searching information for 130 names will not take you more than a few seconds.

We hope that this tool proves useful to citizens and journalists.

Data in Manolo-cazador

Currently, the database of Manolo-cazador contains data from the following sources:

  • debtors of child support (REDAM)
  • debtors of reparation to the Peruvian State.
  • candidates of the mayoral elections in 2014 that stated being sentenced for several crimes.
  • people that receive presidential pardon during 2006-2011 (a.k.a narcoindultos).

If you have suggestions of databases to scrape and include in Manolo-cazador please let us know.


Big Disclaimer:

I work as software engineer at Scrapinghub.
This company created and maintains the framework Scrapy.


Tags: Python scrapy data-journalism