Author: Samuel Tan

Au Revoir

I concluded my 4 weeks at Full Fact a few days back, and as cliche as it sounds, I had enjoyed every minute – including the daily bus commute – of it.

We had built on my predecessor’s work, integrating together the three different stages of the automated factchecking process that we had decided on. Though 4 weeks isn’t a long period of time – considering that Imperial’s academic terms are 11 weeks long (from week 0 – yes, week 0 – through week 10) – it was interesting to see how our project had developed from start to finish.

The ‘strong and stable’ bridge which I had mentioned in my previous post(s) was crucial in integrating the first and third stage of the automated factchecking process. Crucial communication was key in this area. I needed to know what my comrade’s outputs (in terms of Five Year Plans) from the first stage were. This stage involved using Natural Language Processing (NLP) to parse the key terms in a sentence, and my comrade and I saw eye-to-eye – surprising given that he’s a head taller than me – on the key terms that the NLP programme should extract.

Up next were the outcomes from the third (and last) stage. After several daily discussions, we were on the same frequency – 60.231 Hertz to be exact – regarding the output for this stage. We decided on returning a dictionary – not the Merriam-Webster type, but rather of Pythonic form – of data relevant to the claim. For example, the relevant data for the claim ‘GDP rose in 2015’ would be the absolute GDP in 2014 and 2015, as well as the resulting percentage increase.

We had fine-tuned the factchecking process for several of the more common claims involving ‘GDP’, although more work needs to be done (especially on the NLP end) for more complicated claims, such as, ‘GDP growth during the Thatcher years wasn’t as good as it was during Cameron’s time.’ Also, this has to be scaled to other topics, such as inflation, immigration etc., although many of the common claims in these different areas have the same sentence structure. For example, ‘GDP grew in 2015’ and ‘Inflation rose in 2015’ could be interpreted in a similar way.

What does this sentence mean?

If you understand what you have read so far in this post, then you would definitely know the difference between the sentences ‘GDP rose in 2015’ and ‘GDP rose consistently from 2010 to 2015.’ So would a Natural Language Processing (NLP) programme. Some NLP programmes might even one-up us mere mortals by giving the ‘dependency parsing’, ‘parts of speech tags’, ‘named entities’ and other sentences attributes that only learned and esteemed linguistic practitioners like Noam Chomsky, George Orwell and Donald Trump would understand.

Well, NLP programmes (or at least the one we’re currently using) might be able to parse a claim like ‘GDP growth averaged 7.3% under the previous Labour administration’ (warning: fake news; please don’t take this statistic for truth) and flood you with a deluge of sentence attributes. But they are currently unable to understand what this claim entails, and more importantly, the data that should be sought to verify this claim. NLP has yet to advance to the point whereby it can take in any sentence ever conceivable by humans and spit out all the intricacies and subtleties in the sentence. And so we currently have to make do with humans to bridge the gap.

We looked through a database of claims that were made (solely) regarding GDP and identified the ones whose sentence structures were more common. These sentences were then parsed by the NLP programme that we used, which would output the words in the sentence that corresponded to certain parts of speech / categories. For example, ‘GDP rose consistently from 2010 to 2015’ would give us ‘GDP’ as the ‘topic’, ‘rose’ as the ‘verb’ (or type of flower), ‘consistently’ as the ‘checking_modifier’ (a more glorified term for ‘adverb’) and the years 2010 and 2015 as ‘time’. We could then link certain outputs to specific data that we had to obtain to factcheck the claims. As with any other human endeavor, we are making progress in this area. Our current idea is not to factcheck all claims regarding GDP that are made by Jeremy Corbyn, The Sun or Lord Buckethead, but rather the claims that happen more often in the media.

Fingers crossed, we should get an initial working prototype by next week.

 

Is this sentence structure simple?

Sentence structure is central to human language. We understand the difference between the sentences: “Sam is happy because he won the Lottery.” and “Won the Lottery, Sam is happy.” The former follows the rules of the English language; the latter is more likely to be spoken by Yoda in Star Wars.

We are able to understand such simple sentences as well as more complicated ones. However, how do we ensure that a computer (or SkyNet) is able to do so?

Well, this is the job of Natural Language Processing, or NLP for short. My job at Full Fact involves improving their automated factchecking process, and this entails using NLP to process whatever claims that politicians, journalists etc. might make.

If you haven’t read my previous post, our factchecking process can be narrowed down to 3 stages: the first involves using NLP to process the claim while the second involves going to the relevant websites, such as the Office for National Statistics, to get the relevant data. At the last stage, we would present the simplified data in a way that is easy for all of mankind to understand.

This week, much of our focus was on the bridge between the first and the second stage. While it might seem feasible on paper, NLP presently has yet to reach the capabilities of Jarvis, Tony Stark’s ultracapable artificial intelligence in Iron Man. We focused on obtaining the keywords from sentences such as ‘GDP rose in 2015’ and then linking these keywords to claims of a certain type. The latter gives us an idea of what data to obtain from the ONS website and then present.

We are still working on this ‘bridge’. While our NLP programme understands a simple sentence like ‘GDP rose in 2015’, it encounters trouble for more complicated sentences like ‘GDP has been rising consistently from 2010 to 2015’. Hopefully this ‘bridge’ in the future would as strong and stable as London Bridge.

My First Week at Full Fact

Full Fact, the charity I’m currently working at, is an independent factchecking charity that “…provide free tools, information and advice so that anyone can check the claims we hear from politicians and the media.” They do factchecks in a variety of areas from the NHS to student debt, and factcheck claims made during the Prime Minister’s Questions (PMQs), among others.

While Full Fact factchecks claims in many different areas, they have yet to touch claims/questions made regarding the metaphysical realm, such as “What is life?” or “To be, or not to be?” or “I think, therefore I am.” Such questions are best left to the reader to consult Quora.com, consult a Philosophy professor, or ponder about over lunch.

Full Fact currently has two factchecking tools: Live – which monitors TV subtitles and other sources and then factchecks (near instantaneously) claims for which reliable data do exist – and Trends – which seeks to determine the sources for inaccurate claims that have been repeated.

I given the task of improving on the automated factchecking process that Full Fact employs. Automated factchecking can be broken down into 3 essential stages: understanding the claim, obtaining the relevant data and finally, presenting the required data. The first stage involves Natural Language Processing (NLP), which incorporates, among other things, linguistics. The second stage entails getting facts (and not opinions) from official and impartial sources of information such as the Office for National Statistics (ONS). These facts would then be presented in the last stage, in a simple and unambiguous manner.

I spent the first week reading up a bit on NLP, and delving into the second and third stages; my focus was on GDP data from the ONS’s website and the different claims that could be made regarding such data. I wrote several Python functions for the different ways in which we could interpret these claims, and ran the GDP data through these functions/scripts for evaluation. Given a sequence of real data, it was interesting to see how one could present different intepretations of it.

My biggest takeaway so far isn’t the heavy lunch that I had at the nearby Pret a few days back, but rather, the fact that implementing an idea might not always be as easy as it seems. And that Google – and not Dogs – are man’s best friend.