Sunday, 17 April 2016

Proof's in the data scientist pudding

15:25 Posted by The Thalesians (@thalesians) 1 comment

I am defenseless in the face of those titans of confectionery, chocolate and cake: the sweetness of sugar, the butteriness of butter, the milkiness of milk (I was attempting to choose words to make you, the reader, as hungry as possible). The cause I presume, is my sweet tooth. I realise that this is a somewhat circular argument, yet it nevertheless helps to absolve myself of a certain modicum of responsibility.

Whilst I am more of an expert at consuming sweets, I also occasionally dabble in their creation, with varying levels of success. Usually, I stick to the easier to bake items, such as cookies or brownies. Admittedly, I have yet to master the more visual element of baking, a particularly polite of saying whatever I bake does not really look that nice. However, the end result of my baking efforts seem at least to be successful from a taste perspective.

Does that mean, that I could try my hand at baking macarons, with an automatic guarantee of success? I know the answer is no. The complexity of baking macarons is far greater than that of the humble brownie (from personal experience). Whilst, there are common skills in baking, at the same time there are often specific skills that need to be honed for specific bakes. In other words, these skills are very domain specific.

Data science is a fashionable new term for a mixture of several disciplines, including statistics and programming, as well the ability to display results in an innovative manner, using visualisation tools. Very often data scientists can end up working with unstructured datasets, which take time to clean up and process. Data science is precisely like baking (well in some ways, just bear with me for a few sentences). A few days ago, I tweeted what I thought a data scientist was, namely someone who is both excellent at statistics, but is also adept at coding. There can be a misconception that a data scientist, can simply get by with a bit of stats and the ability to cobble up a bit of Python. I strongly disagree with that notion!

However, in response, one my Twitter followers (@macroarb) noted that data scientists also need some domain specific knowledge, a point that I had casually overlooked. Thinking about this a bit more, if anything, domain specific knowledge is perhaps the most important part of a data scientist's toolkit. After all, it is domain specific knowledge which enables you to ask the right questions from your dataset. In my case, my domain specific knowledge is centred towards systematic trading.

Hence, before even indulging the number crunching of a specific dataset, I form a hypothesis of what I am trying to find in it. Of course, sometimes my hypothesis can be totally discounted by some statistical work, which can actually be an important result. It's far better to know that a trading strategy doesn't work, than mistakenly thinking it is profitable and end up losing money on it. On other occasions, I will be able to find results, which can confirm my initial hypothesis.

If you have no hypothesis, where do you even begin to start when analysing data? Of course, you can keep searching through the data, and perhaps you'll eventually find something. However, is that result going to be robust? I suspect not. If you have no domain specific knowledge, it can be difficult to ask the right questions! Just because I can bake brownies, it doesn't imply I can make macarons successfully!

So next time you try your hand at baking, remember, in some ways you're exactly like a data scientist, the proof's in the data scientist pudding!

Like my writing? Have a look at my book Trading Thalesians - What the ancient world can teach us about trading today is on Palgrave Macmillan. You can order the book on Amazon. Drop me a message if you're interested in me writing something for you or creating a systematic trading strategy for you! Please also come to our regular finance talks in London, New York, Budapest, Prague, Frankfurt, Zurich & San Francisco - join our group for more details here (Thalesians calendar below)

20 Apr - London - Jacob Bartram - Can option trading strategies enhance CTA/trend following
12 May - New York - Luis Seco - Are Negative Hedge Fund Fees on the Horizon?
13 May - Budapest - Saeed Amen/Paul Bilokon - Thalesians workshop on algo trading at Global Derivatives
20 May - London - Martin Bridson - Knots and what not

1 comment:

  1. Wonderful blog & good post.Its really helpful for me, awaiting for more new post. Keep Blogging!

    Data Visualization Training Institutes in Chennai Trichy