Blog

The Guerrilla Analytics Principles

I designed the principles to help avoid the chaos introduced by the dynamics, complexity and constraints of data projects. You will find the principles helpful if you work in Data Science, Data Mining, Statistical Analysis, Machine Learning or any field that uses these techniques.

The Guerrilla Analytics Principles have been applied successfully to many high profile and high pressure projects in domains including Financial Services, Identity and Access Management, Audit, Fraud, Customer Analytics and Forensics.

military-662863_1920

There is now a page on guerrilla-analytics.net giving an overview of the 7 Guerrilla Analytics Principles.

I designed the principles to help avoid the chaos introduced by the dynamics, complexity and constraints of data projects. You will find the principles helpful if you work in Data Science, Data Mining, Statistical Analysis, Machine Learning or any field that uses these techniques.

The Guerrilla Analytics Principles have been applied successfully to many high profile and high pressure projects in domains including Financial Services, Identity and Access Management, Audit, Fraud, Customer Analytics and Forensics.

You can read more about the Guerrilla Analytics Principles in my book Guerrilla Analytics: A Practical Approach to Working with Data. Here you will find almost 100 practice tips from across the Data Science life cycle showing you how to implement these principles in real-world situations.

Do you have your own data science experiences and principles? Let me know by getting in touch!

‘Similarity’ Approximate String Matching library is now on GitHub

In a Guerrilla Analytics environment, available tooling is often limited. There is either not enough budget, time or IT flexibility to get all the tools you want.

On many jobs, I find myself using Microsoft SQL Server as the project RDBMS. Out of the box, SQL Server does not yet have a fuzzy match capability. You need to install additional tools such as SSIS to avail of fuzzy matching. Even then, SSIS is a GUI-driven application which contradicts a key Guerrilla Analytics Principle. In a Guerrilla Analytics environment, you would much rather have fuzzy match capabilities available in SQL code. This is where the following Similarity library comes in handy.

Words

Guerrilla Analytics Challenge

In a Guerrilla Analytics environment, available tooling is often limited. There is either not enough budget, time or IT flexibility to get all the tools you want.

On many jobs, I find myself using Microsoft SQL Server as the project RDBMS. Out of the box, SQL Server does not yet have a fuzzy match capability. You need to install additional tools such as SSIS to avail of fuzzy matching. Even then, SSIS is a GUI-driven application which contradicts a key Guerrilla Analytics Principle. In a Guerrilla Analytics environment, you would much rather have fuzzy match capabilities available in SQL code. This is where the following Similarity library comes in handy.

Introducing Similarity

Similarity is a wrapper around the SimMetrics string matching library created by Sheffield University and funded by an IRC sponsored by EPSRC, grant number GR/N15764/01.

SimMetrics includes approximate string comparison algorithms such as:

  • Levenshtein
  • Jaro
  • Jaro-Winkler
  • Needleman
  • and many more

The Similarity wrapper makes these SimMetrics algorithms available in-line in SQL Server so you can call them from SQL code.

The approach for creating this wrapper was inspired by this blog post. I’ve added to the original code to produce a primitive Windows end-to-end build process that creates a SimMetrics C# DLL library and loads it into a Microsoft SQL Server database.

Go check it out

There is more information, installation instructions and the latest version at GitHub. Your contributions and comments are welcome!

Guerrilla Analytics: Tactics for Coping with Data Science Reality

man-65049_1920

Here are the slides from a talk I gave today to the Information Technology Department at the National University of Ireland, Galway. Thanks to Michael Madden for the opportunity to speak.

The talk was about how Guerrilla Analytics principles and practice tips help you do Data Science in circumstances that are very dynamic, constrained and yet required traceability of what you do.

There were plenty of questions afterwards which is always encouraging. I’ll try to address these questions in subsequent blog posts so please do follow me @enda_ridge for all the latest posts.

Here are some of the questions from today.

  • what are the key skills to focus on if you want to work in data analytics / data science?
  • is programming ability a pre-requisite for doing data science? This question came up before at Newcastle University.
  • do the guerrilla analytics principles map to research projects?
  • do the guerrilla analytics principles map to ‘big data’ projects?

Since NUI Galway is a bi-lingual university, you can find my broken Gaelic version below!

As Gaeilge

Seo h-iad na sleamhnáin ó léacht a bhí agam inniú sa Roinn Teicneolaíocht Fáisnéise in Ollscoil na h-Éireann, Gaillimh. Buíochas le Michael Madden as an deis labhairt.

Bhain an léacht le cén chaoi is féidir leis na  prionsabail agus noda Guerrilla Analytics cabhair leat agus tú ag déanamh Data Science i ndálaí atá dinimic, srianta ach fós tá sé riachtanach go bhfuil inrianaitheacht ann.

Bhí mórán ceisteanna tar éis an léacht agus is maith an rud é. Freagróidh mé iad i mblag eile agus bígí cinnte mé a leanacht ag @enda_ridge don scéal is déanaí.

Seo h-iad roinnt de na ceisteanna.

  • céard iad na scilleanna is tábhachtaí agus tú ag iarraidh obair mar data scientist?
  • an gá duit bheith in ann ríomhchlárú le h-aghaidh obair mar data scientist?
  • an bhfuil baint ann idir na prionsabail agus tionscadail taighde?
  • an bhfuil baint ann idir na prionsabail agus ‘Big Data’?

Data Science Workflows – A Reality Check

Data Science projects aren’t a nice clean cycle of well defined stages. More often, they are a slog towards delivery with repeated setbacks. Most steps are highly iterative between your Data Science team and IT or your Data Science team and the business. These setbacks are due to disruptions. Recognising this and identifying the cause of these disruptions is the first step in mitigating their impact on your delivery with Guerrilla Analytics.

workflow

Data Science projects aren’t a nice clean cycle of well defined stages. More often, they are a slog towards delivery with repeated setbacks. Most steps are highly iterative between your Data Science team and IT or your Data Science team and the business. These setbacks are due to disruptions. Recognising this and identifying the cause of these disruptions is the first step in mitigating their impact on your delivery with Guerrilla Analytics.

The Situation

Doing Data Science work in consulting (both internal and external) is complicated. This is for a number of reasons that have nothing to do with machine learning algorithms, statistics and math, or model sophistication. The cause of this complexity is far more mundane.

  • Project requirements change often, especially as data understanding improves.
  • Data is poorly understood, contains flaws you have yet to discover, IT struggle to create the required data extracts for you etc.
  • Your team and the client’s team will have a variety of skills and experience
  • The technology available due to licensing costs and the client’s IT landscape may not be ideal.

The discussion of Data Science workflows does not sufficiently represent this reality. Most workflow representations are derived from the Cross-Industry Standard Process for Data Mining (CRISP-DM) [1].

CRISP-DM_Process_Diagram

Others report variations on CRISP-DM such as the blog post referenced below [2].

rp-overview

It’s all about disruptions

These workflow representations correctly capture the high level stages of Data Science, specifically:

  • defining the problem,
  • acquiring data,
  • preparing it,
  • doing some analysis and
  • reporting results

However, a more realistic representation must acknowledge that at pretty much every stage of Data Science, a variety of set backs or new knowledge can return you to any of the previous stages. You can think of these set backs and new knowledge as disruptions. They are disruptions because they necessitate modifying or redoing work instead of progressing directly to your goal of delivery. Here are some examples.

  • After doing some early analyses, a data profiling exercise reveals that some of your data extract has been truncated. It takes you significant time to check that you did not corrupt the file yourself when loading it. Now you have to go all the way back to source and get another data extract.
  • On creating a report, a business user highlights an unusual trend in your numbers. On investigation, you find a small bug in your code that when repaired, changes the contents of your report and requires re-issuing your report.
  • On presenting some updates to a client, you together agree there is no value in the current approach and a different one must be taken. No new data is required but you must now shape the data differently to apply a different kind of algorithm and analysis.

The list goes on. The point here is that Data Science on anything beyond a toy example is going to be a highly iterative process where at every stage, your techniques and approach need to be easily modified and re-run so that your analyses and code are robust to all of those disruptions.

The Guerrilla Analytics Workflow

Here is what I term the Guerrilla Analytics workflow. You can think of it like the game of Snakes and Ladders where any unlucky move sends you back down the board.

image

The Guerrilla Analytics workflow considers Data Science as the following stages from source data through to delivery. I’ve also added some examples of typical disruptions at each of these stages.

Data Science Workflow Example Disruptions
Extract: taking data from a source system, the web, front end system reports
  • incorrect data format extracted
  • truncated data
  • changing requirements mean different data is required
Receive: storing extracted data in the analytics environment and recording appropriate tracking information
  • lost data
  • file system mess of old data, modified data and raw data
  • multiple copies of data files
Load: transferring data from receipt location into an analytics environment
  • truncation of data
  • no clear link between data source and loaded datasets
Analytics: the data preparation, reshaping, modelling and visualization needed to solve the business problem
  • changing requirements
  • incorrect choice of analysis or model
  • dropping or overwriting records and columns so numbers cannot be explained
Work Products and Reporting: the ad-hoc analyses and formal project deliverables
  • changing requirements
  • incorrect or damaged data
  • code bugs
  • incorrect or unsuccessful analysis

This is just a sample of the disruptions that I have experienced in my projects. I’m sure you have more to add too and it would be great to hear them.

Further Reading

You can learn about disruptions and the practice tips for making your Data Science robust to disruptions in my book Guerrilla Analytics: A Practical Approach to Working with Data.

References

[1] Wikipedia https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining, Accessed 2015-02-14

[2] Communications of the ACM Blog, http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext

Building Guerrilla Analytics Teams

I recently had the opportunity to present a webinar on ‘Building Guerrilla Analytics Teams’ as part of the BrightTalk ‘Business Intelligence and Analytics’ series. You can access the full recorded webinar and slides here and the slides are embedded below.

Some really interesting questions came up at the end of the session. I’ve listed them here and will pick them up in subsequent blog posts.

  • How do you build a business case to resource and set up a data science team?
  • What is the number one tip for someone putting together a completely new data science team?
  • What role is most important when setting up a data science team?
  • What are the typical challenges faced when setting up a Guerrilla Analytics team?

You can learn more about building a Guerrilla Analytics capability in my book Guerrilla Analytics: A Practical Approach to Working with Data which has chapters devoted to getting the right people in place, giving them the right technology and controlling everything with a minimal lightweight process.

Introduction to Guerrilla Analytics at Newcastle University

I was recently invited to give a talk introducing Guerrilla Analytics and the principles described in the book. The talk covers some examples of how these principles are applied. It concludes by identifying some key research and development areas for doing this type of analytics in real-world projects.

 
This was a great opportunity to engage with a cross-disciplinary audience including computer scientists, computational biologists and engineers and to have a sounding board for some of the key research and development areas I think need to be addressed to enable practical data science work.
A key take-away for me was the gap between the advanced data science being studied in academia and the lack of simple, practical methodologies that hold back the implementation of this research.

3 Lessons I Learned From Writing a Data Science Book – ‘Guerrilla Analytics – a practical approach to working with data’

One of the biggest challenges with writing a significant piece like a book chapter or entire book is to estimate how long it will take and plan accordingly. My best reference was my PhD which was still significantly shorter than the book’s target 90,000 words. This blog post is about the book writing process as I experienced it. I hope it helps other authors setting out on such an endeavour.

Since ‘Guerrilla Analytics: A Practical Approach to Working with Data’ is about operational aspects of agile data science, I recorded some data on the book writing process itself. Specifically, every time I finished a writing session, I recorded the number of words I’d written on that date.

Writing

One of the biggest challenges with writing a significant piece like a book chapter or entire book is to estimate how long it will take and plan accordingly. My best reference was my PhD which was still significantly shorter than the book’s target 90,000 words. This blog post is about the book writing process as I experienced it. I hope it helps other authors setting out on such an endeavour.

Since ‘Guerrilla Analytics: A Practical Approach to Working with Data‘ is about operational aspects of agile data science, I recorded some data on the book writing process itself. Specifically, every time I finished a writing session, I recorded the number of words I’d written on that date.

My 3 Lessons

  • Progress tapers off. You’ll get more work done in the first half of your project. Don’t expect this rate of progress to be sustained all the way to your deadline.
  • Be realistic about how much you can write in a session. I found it difficult to write more than 1,500 words. Anything more was the exception for me. Track your progress and re-plan accordingly.
  • Weekends are better than weekdays. Obvious maybe! Expect to set aside your free time on weekends to get your project over the line. It is difficult to get significant amounts of work done on weekdays.

Progress tapers off

  • Here is my progress towards my goal of 90,000 words over an 8 month period. The plot shows the words written per session and the total word count.writing_log_progress

    I began writing in late September and finished in June the following year. The line shows my total words written and the bars show the number of words written in individual writing sessions. Two things stand out:

  • Progress is faster in the first half of the project. This was because it is easier to get all your ideas ‘onto paper’ early in the writing. Once you have about 3 quarters of your manuscript complete, you need to be more careful about consistency of language and flow of content. This slows you down.
  • Time off work is really productive. There are two clear bursts of productivity as shown by the dense groups of grey bars where a large number of words was written in many successive sessions. The two periods are Halloween (when I took a week off work) and Christmas when I worked for a week from my family home.

How much did I write in a typical session?

Here’s how much I wrote in each writing session.

Words per session

I typically wrote about 1,000 words with the odd session where I wrote over 3,000 words. This is important when you plan your project. If you’re anything like me, writing more than 1,000 words will be an exception. If you only write on weekends then you’re looking at only 2,000 words per week. That’s well under 100,000 words in a year allowing for holidays and other disruptions.

Are you thinking about writing something and have questions? Feel free to get in touch and best of luck!

Big Data Debate: The Controversial Questions at Google campus

I was recently invited to take part on the panel at the Big Data Debate (@bigdatadebate) at Google’s campus near Old Street in London1.

Big Data Debate 2

It was a great opportunity to meet like minded folks such as Christian Prokopp @prokopp Rangespan, Paul Bradshaw @paulbradshaw, Duncan Ross @duncan3ross Teradata, Daniel Hulme Satalia, Michael Cutler @cotdp TUMRA, Andy Piper @andypiper Pivotal and Will Scott Moncrieff  from DueDil. Overall it was an interesting debate with some interesting contributions from the panel and the packed house.

Big Data Debate 3

We spent perhaps half of the panel hour and most of the audience questions on data privacy. I guess this is revealing in itself if such concerns are at the forefronts of the public’s mind as opposed to the opportunities presented by data analytics.

Christian did start one controversial question with me. Paraphrasing, it was around the dangers that arise when we have the potential to mine vast quantities of data looking for patterns. My answer, as it has been since my PhD days is that this is simply poor methodology *whatever* the volumes of data you are analysing. A data science methodology should allow us to answer questions (test hypotheses) about a problem (as described by data) while reducing bias as far as possible. Think about that. If you go trawling for an effect that you expect to exist in data you will eventually find it. Instead, your approach should be:

  • understand the problem (talk to the business, formulate a scientific theory)
  • turn the problem into hypotheses (our campaign increased sales, a fraudulent user has a log pattern that is different from his peers etc)
  • decide what effect is practically significant
  • then you go and apply an appropriate statistical test with the correct sample size, and power. You check the test’s assumptions.
  • when you don’t find what you were looking for, you can’t keep changing your effect sizes and revisiting the data! That’s cherry picking or a confirmation bias.

So I suppose the answer to Christian’s question is a ‘yes’ but it has nothing to do with ‘Big Data’. Big Data is dangerous because new tools and hype can lead to folks forgetting that garbage in results in garbage out. You have to understand the data and the rigorous analysis you are applying – just like any scientist.

Here are some recent good reads:

[1] I am employed by KPMG, one of the event sponsors

Guerrilla Analytics – the book! Book contract signed for Autumn 2014

Great news! I will be publishing a book on Guerrilla Analytics with Morgan Kaufmann in Autumn 2014. After lots of proposal crafting and contract negotiations the contracts have finally been signed and I can begin work. It will be about 90,000 words on Guerrilla Analytics covering topics such as:

  • what is data analytics and where does guerrilla analytics fit within that?
  • the principles of guerrilla analytics
  • worked examples at each stage of the data analytics workflow from data extraction and receipt through to delivery of work products. All of these examples will be supported by practice tips, case studies and war stories. This will be a real practitioners book that will help you survive real analytics projects in fast paced dynamic environments

You’ll find this book useful if you are:

  • a Senior Manager and you want to know that you have the right team and technology in place to deliver reproducible, tested analytics that stand up to audit and scrutiny and can be handed over easily when resources roll off your project
  • an analytics Manager who has several reports. You do want your team to be independent and agile without having to micro manage their work. You want to keep it simple so that everybody on the team can maintain data provenance and understand one another’s work without repeated inefficient hand-overs and explanations
  • a data analyst who wants to do high quality work, interact in a team but not be burdened with unnecessary process and team rules.

I’m looking forward to getting started! Stay tuned for more updates and some snippets of the book as it evolves.

Guerrilla Analytics talk at Enterprise Data World, San Diego 2013

@edwardacurry and I did a talk at Enterprise Data World 2013 in sunny San Diego. The slides are below. In this longer talk we were able to take the audience through some worked examples to illustrate how guerrilla analytics is applied in practice. Feedback was positive. There was plenty of empathy from audience members with teams that struggling with the challenges that Guerrilla Analytics addresses.