The FT has been selling its highly sought after content as a dataset since 2017 and to say it has been a learning curve is a serious understatement.
For over 125 years we have been curating information for the most influential people across the globe as the key audience, but since the advent of natural language processing (NLP), neural networks and quantitative investing, the new audience of “The Machine” has forced us to think very differently.
The challenges of using NLP in systematic trading
For those who have been using NLP in systematic trading strategies for some time, news data has always posed two major challenges:
Many news providers don’t provide tickers or mapping tools to help connect a given news event to the tradeable asset.
Most news articles are published and republished numerous times as corrections are made and many news providers only archive the latest version of any article. Therefore, how can I have confidence that the data reflects the exact data first consumed by the market at the Point Of Publish (POP)?
The challenge posed by these problems is so significant that it can stop organisations from incorporating even the best quality journalism into their trading algorithms. More time needed to map datasets and process real time opportunities equals lost money, and if confidence in the data is too low then the risk simply cannot justify the return.
Easier integration, greater confidence
Fortunately the FT picked up on this early and immediately started to annotate its content with Financial Instrument Global Identifiers (FIGI) enabling low latency traders to adjust their prices much faster in response to impactful news events. Shortly after in March 2018 we made the quick decision to start storing all of the versions of every published event, including amendments to headlines and even sentiment changes resulting in major article republishes.
With history dating back to March 2018, we now have over 2 years worth of Point In Time (PIT) data - more than enough to cater for the average backtest required for developing strategies around systematic equities. Not only does this data contain all the text versions of the articles published, but UTC timestamps for each version and the PIT ticker data required to ensure correct mapping of organisations to the right assets. This translates to much easier integration of the data and a high level of confidence in time series analysis. Comparisons can even be made to snapshot data for the same period of time, in order to improve confidence going further back in time if the strategy requires it.
Emerging opportunities to explore
Another exciting evolution is that we are exploring ways to commercialise the anonymised usage data (Demand Data) for each article throughout its lifecycle. This opens up great opportunities for the measurement of impact of huge global news brands on asset performance and the spread of important narratives across different time horizons.
Please get in touch with Stephen Thomas to discuss a free trial of the dataset or to explore opportunities around demand data.