A Message Chain Circulating among Investors Questions Methodology Utilized in Election Polling

Articles claim that electoral surveys have biased samples; Datafolha explains how its samples are selected

São Paulo

Two articles written in English circulating among investors groups on WhatsApp are critical of Datafolha. Two versions, one shorter –associated with American fund management firm NCH Capital– and the other longer –written in first person without a signature–, the chains conclude that investments shouldn’t be based on electoral polls because they are skewed.

Both articles are the work of James Gulbrandsen, investment manager for NCH Capital. He says that the longer version was shared with a friend in order to get his opinion on the subject and ended up getting leaked. While the second version, which carries the company’s signature, was a rough sketch of an internal study and also leaked out.

Brazilian electronic ballot box
Brazilian electronic ballot box - Pedro Ladeira/Folhapress

According to the manager, the articles don’t reflect the opinion of the company nor his own, since the data in question was still under analysis.

The articles claim that the electoral surveys include a biased weighting of the Northeastern population with income of up to two minimum salaries, include smaller relative percentages than the general population of people who define themselves as Catholics and Evangelicals and include a larger percentage of people who identify themselves with left-wing ideas. For Gulbrandsen, these sampling errors result in skewed surveys that favor candidates like Ciro Gomes (PDT Party) and Fernando Haddad (PT Party).

According to Datafolha statisticians, the concepts and techniques utilized in their electoral surveys are taken directly from accepted Sampling Theory. 

“The samples are representative of the populations being surveyed and are selected through statistical criteria, based on official sources such as IBGE (Brazilian Geographic & Economic Statistics Institute) and the TSE (Superior Electoral Tribunal)”, they explain.

After being contacted by a reporter from the Folha, Gulbrandsen released a new version of the article reaffirming that the surveys are based on skewed samples and reiterating the arguments used in the previous articles.

For this investment fund manager, who makes it clear that his opinion isn’t that of NCH Capital itself, it is imprudent for investors to base financial decisions on electoral surveys. “I don’t want to offend the statisticians, but their results may be completely irrelevant”, he says.

Gulbrandsen also questioned the political bias of Datafolha, the Folha and UOL, all companies from the Folha Group, saying that they have “leftist leanings”.

The Folha reiterates that it seeks to practice pluralistic, non-partisan, objective and critical journalism, according to what is prescribed in its Editorial Project, updated in 2017, and outlined in its Editing Manual.

You can see the complete material from the articles reproduced here with explanations provided by Datafolha statisticians point by point:

 

Lies, damn lies and statistics

Every week during the prime time of every presidential election Brazilian market participants await the results of the golden cows of polls: Datafolha and Ibope.

Their results--rumored and real--impact all Brazilian assets, from equities to fixed income and FX.

2018 brings major challenges to polling and as a quant/statistician/mathematician who happens to also be a portfolio manager, I have news for you: ignore Datafolha.

Datafolha is making the same error most of the polls in the US made during the 2016 elections: sampling error.

Sampling errors can occur when you poll a sample of the population that is not representative of the demographics and political distribution of a voter population.

 “Datafolha utilizes concepts and techniques based on Sampling Theory. Samples are representative of the populations being surveyed – in this case the Brazilian electorate 16 years of age or older – and are selected through statistical criteria, based on official sources such as IBGE (Brazilian Geographic & Economic Statics Institute) and the TSE (Superior Electoral Tribunal).

The sample design is obtained through a robust and probabilistic statistical method, in a multi-stage process as follows: in the first phase stratification is performed by geographical region and municipal nature – capital, metropolitan region and countryside. In each stratum a sample selection by three-stage clusters is utilized. First, a random selection of the municipalities that will be part of the sample, with probability proportional to size (PPT). Second, a random selection of interview locations in each municipality. Third, a random selection of the interviewees from the distributions of sex and age range of the population being surveyed. The effectiveness of the methodology utilized is proven by the Institute's performance history."

For example, in the 2016 US presidential polls, many of the polls showing Hillary Clinton ahead by 6, 8 and sometimes 10 plus percentage points of Donald Trump were using population samples that included up to 40% people registered as or identifying as Democrats. At the time, the actual number of registered and identifying Democrats in the US was closer to 32%. There was the potential to up to 6-8% of bias embedded in most of the polls.

We didn't predict a Trump victory, but we did go on the record saying the polling was incorrect and that the results would likely be within the margin of error (they were). My wife, an ardent Democrat, planned an election party, certain that Hillary would win based on the polling. I warned her the party was possibly not going to end well. That, unfortunately for my country and the world, was the case as we now have to live with a madman as the most powerful man in the world.

While Datafolha only publishes the criteria and sampling data used in its penultimate polls on its website (which is ridiculous...where's the transparency?!), making it difficult to ascertain just how biased the current poll is, the amount of bias in past polls in Datafolha's sampling is significant enough to make them, in a word, worthless. In a phrase: not worth the paper they are printed on.

“All electoral surveys conducted by Datafolha for the Folha are made available on the site on the day after their results are published, in a completely transparent fashion. Datafolha has been a pioneer in publishing all of the information gathered in its opinion polls in its entirety, including a detailed profile of the respondents to each question.”

For example, the poll taken in late August had a sample of 53% women, and 47% men. There are 50.8% women in Brazil.

“According to data from the TSE the percentage of women in the Brazilian electorate likely to vote in 2018 is 52.5% (53% when the number is rounded). In 2014 this percentage was 52.2% and in 2010 it was 51.9%. Only going back as far as 2002 can you find a female electorate percentage of 50.88%.”

More importantly, the poll sampled 34% evangelicals whereas 22% of Brazilians identify as evangelicals. Education levels were lower than those identifying as Catholic (only 53% despite 64% of Brazilians identifying as Catholic), likely skewing this poll to the left.

Granted, the IBGE population data I cite is dated but even based on current projections, evangelicals are meaningfully over sampled and Catholics, under represented.

“The data utilized in the article are, apparently, from the 2010 Census (IBGE) and can’t be compared with those researched for electoral surveys conducted in 2018. The primary reason relates to the timeline, due to the fact that Brazilians are passing through a process of religious transition. The 2010 Census had already diagnosed a significant increase of evangelicals in comparison with the previous survey in 2000. The percentage of evangelicals rose from 15.4% to 22.2% during this period. The next Census will only be held in 2020 but there are innumerous studies showing accelerated growth in this segment of the population, which would make any research based on older data erroneous. The Datafolha’s historical series data related to the religion of the Brazilian population is very solid and reveals a transition tendency which will be confirmed in the next Census (2020) and which is used regularly in studies conducted by specialists in the area, which demonstrates its relevance.”

This leftward bias is further confirmed by virtue of the higher distribution of evangelicals in Pernambuco vs. Sao Paulo, for example. There almost certainly are more evangelicals in Pernambuco. But is 38% an appropriate sample? From 2000 to 2010, Pernambuco rose from 13% to 20% evangelical. Applying the same linear growth would imply Pernambuco has just over 27% evangelicals today. This is an absolutely horrible example of over sampling, and to the left.

The anti-Catholic bias in the sample probably hurts Geraldo Alckmin the most. He may actually be in second place outside of the margin of error. Ciro Gomes likely benefits most, particularly in the northeast data cohort.

“Datafolha survey data that includes religious profile (from August 20 and 21 of this year) reveal that, different from what the author of the article claims, Ciro Gomes has a higher index of voter preference among Catholics than among Evangelicals. Alckmin shows little different between the two groups and would have had the same index of preference if the survey had been conducted only among Catholics. The candidate who has the highest index of preference among Evangelicals is Jair Bolsonaro, who finishes in first place in all published polls and in some of them with an even higher voter preference than that measured by the Datafolha.”

I could stop now. The Datafolha poll sample is meaningfully biased, and its conclusions given the pulverization of candidates is likely largely worthless. But there’s more.

We don't know the party affiliation breakdown in the most recent or penultimate polls but in one poll from Datafolha in June, approximately 24% of respondents stated they were affiliated with the PT and PSOL. This compares to only 12% of seats held in the Camara by the PT and PSOL. Is it possible those political demographics are being over sampled as well in the recent polls?

“The author confuses political party affiliation with voting preferences for a specific political party. They are completely different things and it makes no sense to correlate this number with the party composition of the Congress. The historical data series which has been collected by Datafolha since 1992 shows that the PT, one of the parties cited, had an index as high as 29% in voter party preferences at the beginning of 2013 which fell 10 points after the 2013 terms and reached a low of 9% in March of 2015, the height of the so-called Car Wash operation developments and revelations. Or in other words, there is an important correlation with significant moments in Brazilian politics. The PSOL party, by itself, has between 0 and 1 percent of voter intent. It is invalid to add results from the two parties together.”

And let's just be honest. We all know UOL, Folha de Sao Paulo and Datafolha are left-leaning. Polling is like discounted cash flow analysis: you can achieve whatever result you desire by tweaking the inputs. Is Datafolha intentionally biasing their polls? I don't know, only they do, and perhaps their assumptions have basis in some form of logic. But compared to actual Brazilian demographics, there's clear bias embedded in these polls.

Truth is, I don't envy any of the polling services this year. They face a nearly impossible situation. A high degree of uncertain voters, who may, in truth be Bolsonaro voters that feel ashamed to admit as being such. And there is a very open field of candidates.

But if you are making trades on the basis of Datafolha, beware. You're probably making decisions based on biased information.

 

Lies, Damn Lies and Statistics. Part II

Last week, we at NCH Capital highlighted the oversample of sex and religion in DataFolha's presidential polling. 

This week, we look at the terrible oversampling of lower income respondents, primarily in the Northeast. And once again, the conclusion is the same: ignore Datafolha. 

Last night's Datafolha presidential poll used a sample of 46% of respondents with income of up to 2 monthly salaries. That's ridiculous. Per IBGE, 33% of Brazilians make up to 2 monthly salaries, thus there is a meaningful oversample of lower income respondents.

“The percentage cited in the article (33% up to 2 monthly salaries) cannot be found in either the 2010 Census or the PNAD (National Household Sampling Survey). Several different institutions that follow voting preferences come up with percentages comparable to those from Datafolha, although none of them are completely transparent regarding the profile obtained. In the Datafolha survey data cited by the article, the percentage of those interviewed who declare their family income as up to 2 minimum salaries was 46% while a survey contracted by XP Investments found this percentage to be 48% and a survey recently conducted by Ibope found 51%.”

But, yet again, it gets worse. If we look at the income of respondents from the very left-leaning Northeast, 64% earn up to 2 minimum monthly salaries. However, according to IMF data, if we look at data from Bahia, Maranhao, and Pernambuco as representative of the Northeast, an appropriate sample of lower income respondets would be approximately 45% of up to 2 minimum salaries.

“It makes no sense to compare any number referring only to Bahia, Maranhão and Pernambuco with all of the Northeast. The region is made up of nine states with different characteristics and realities from each other, although the income in the region is significantly lower than that measured in the country as a whole”.

That's an over sample of about 20% which given that 27% of respondents were from the Northeast, could result in 5% error in the polling. In a year when getting 10-15% of the first round votes could send a candidate to the second round and possibly victory, that is an absolutely stunning skew in the Datafolha results.

So come Monday, once again, if you're trading on the basis of Datafolha, you're likely trading on biased data. Again.