Nowadays everyone seems to be heading to the hills to try and cash in on the new gold rush. Data! You have all heard that data is the new gold. Some call it the new oil, as that is their favorite fantasy, that’s all good. But there are drawbacks to this.
The cost of data and data waste
Like any resource you mine it does come with associated costs. A cost that has to be covered by the value you derive from it. That value has to exceed the money spend to gather, process and consume it. That can be an expensive business.
On top of that you have to deal with “the waste” it creates as a byproduct. Waste can be toxic. Mining data tends to produce nuclear data waste, the bad kind where “safe levels” are hard to determine. In the rush to grab the gold many forget a couple of important lessons from history. We should know by now we need to act proactively to avoid waste. That is the cheapest option in the long run and mitigates many of the risks. We should also know that not all data is gold, some of it just glitters, but it isn’t valuable. Fool’s data, like fool’s gold, is essentially worthless no matter how much money you have spent. Even worse, it’s still produces nuclear data waste asset than can get you into (legal) trouble.
Data storages and backups
How much data to you need to get to the gold and at what cost. Storage capabilities as well as storage capacity grows fast and cost seems under control for now. But will this last forever? And even if so, what’s the ratio of data gold versus raw data stored? Can we improve this ratio? Because even when things are cheap, why even do it if it is not needed?
Protecting the data and the waste
And then there is the cost with protecting that data as well as the governance around it. The sad reality with data is that once you have it the probability that it will get you into trouble is real. Data, sooner or latter will get lost, misplaced, sold, hacked, leaked, … it’s almost guaranteed. Ask any real InfoSec professional (not the standard issue, policy quoting security officers, those are just windows dressing) and they will open your eyes to the reality of the risks. It’s very sobering.
The gold rush
As with any hype or gold rush we can avoid costly mistakes buy looking at history. Think about who benefited and who lost out. Think about why this happened and how. Can you see any parallels?
- Many people are drawn to the data gold fields. Very few strike a gold vein.
- There is a lot of money to be made selling the tools, supplies and gear to mine the data, process it, store and protect it.
- Gathering raw data and processing it can be highly toxic
- Storing and protecting the gold is expensive and hard.
Let’s dive a bit deeper into these issues, what these mean and how they materialize. They all have one thing in common for sure and that is that the fear of missing out is one of the driving factors.
The reality is that many people that now become “data scientists” are not all highly skilled mathematicians and experts at statistics analysis. It’s a new hype, just like OLAP tools and data mining where before. We now have BI, big data and data science. That’s where the gold can be found so that’s where gold diggers flock to. Some have the skills, abilities and the luck to derive wealth from that. Most will just have the job digging.
There is a lot more data than there is science in the hype created around data scientists. Data scientists should be great at math and statistics. Those are not very fields of human endeavor that do not scale well. They are not even popular. Attaching “scientist” to something doesn’t make it a science. Be sure of the quality of your gold and make sure it is not fool’s gold. But the gold rush is on. There’s money to be made. In an era where science is viewed by many as “an opinion” the urge to derive some credibility from adding “science” to any endeavor is a paradox. It is on the rise. Clearly, this shows the value of real science even when some only seem to like it when it suits their agenda.
But the field is exploding as companies want people working on all the raw data they collect. As one statistician stated: “I used to be a boring, underpaid geek with glasses, now that I’m a data scientist I’m cool, in demand and paid very well even if the work is less scientific.” That’s the nature of the beast. Her employer got a real statistician, but as the mines require a lot more bodies, many will make due with less.
The “sure” money is in the supply chain. Storage, networking, compute … no matter where it is (cloud, fog or on-premises computing). There is money to be made with tools to process the data, protect it (backups) and secure it against unwanted prying eyes and theft. If you’re selling any of those business is a booming.
Everyone seems obsessed with collecting data. Luckily storage costs are down per GB and we can store ever more. We also need to protect more. But whoever deletes data? Who dares push the button? A lot of data is collected “just in case”. We might find gold in there later and if we don’t have it we cannot for sure. The fear of missing out in action. That is great if you’re selling stuff. Data lakes, data ponds, storage blobs or tables, Mongo DB or SQL PaaS, storage arrays, data processing technology and data protection. These can be products or services, it doesn’t matter, there is money to be made. And while you’re selling you’re not asking the buyers if the really need it. You don’t question them, you praise their insights and help them protect their investment. Everyone is doing it, so must you. The copy/paste strategy in action.
Nuclear data waste
While the vast growth in data is spectacular. A lot of it is crap. But there is very little effort put into being selective. It’s too cheap right now to collect and store it. No one want to say “we don’t need it” and be the one to blame if you don’t have it.
But in the age of data leaks, hackers, privacy concerns and ever more legislation around data protection it’s worth making sure you don’t store data just because you can. Storing data holds inherent risks. Risk of losing it, corrupting it, deriving faulty information from it, leaking it, have it stolen or abused. It
In the age of GDPR and many other rightful privacy and data protection concerns collecting data should be treated like nuclear power. The value it brings is undeniable. But you don’t need vast amounts of nuclear fuel to deliver that value. You do need very good processes, fail safes, regulation, capable people and technology.
We should start looking at data as nuclear fuel and as such, after use and processing, part of it is left as toxic nuclear data waste. It’s a long-term toxic by product of the process of deriving information form data. Minimize the collection, storage and of data to achieve your goals at minimum cost and risk. Luckily, we have a very good solution for toxic data waste. You can delete it and wipe it securely.
We have to stop thinking that more is better when much of it is junk. The overhead of caring for that junk is ridiculous. We might have to do so for nuclear waste out of need. But for data there are alternatives. Destroy it if you don’t need it. It’s the safest way to handle the legal and reputation risks related to it. That will take a conscious effort.
Efforts and costs
Critical thinking about collecting data is lacking. That is understandable. There is a lot of the money to be made in data mining is in providing the tools to collect, process, store and protect the data. Even with many people that warn us of the security issues and legal responsibilities around it is often about selling services and products. For many all this might turn out to be a lot like the other gold rushes. There were a lot more suppliers of the tools that got rich than actual finders of profitable gold mines. This means there is also a lot of pressure and incentive to feed the “data is the new gold” beast.
Where a SQL database or a data warehouse at least meant you had to put effort into collecting the data, the rise of unstructured data technologies means way too often we don’t care and we’ll figure it out later. Imagine doing that to nuclear fuel! For now, the technical advances in storage and data technologies has allowed us to act without too much deliberation on the sanity of our choices. That might change, it might be wise to avoid the cold shower when it does and benefit from minimizing toxic data risks today.
Now true data gold is very valuable, but make sure you can recognize it. Just going through the motions and buying the tools, copying “in the know” statements from the internet isn’t going to cut it. That is called pretending. Sure, it’s fun. It is also a very dangerous and costly mistake when things get real. At best you look like an idiot with money. Many sales people will separate you from your money very efficiently.
The smarter organizations already have a data strategy that includes waste avoidance, reduction and management. Many don’t unfortunately. Collecting data for those is the main goal, driven by the tyranny of action over strategies. You have to be seen acting and being in charge. The buzz words have to be present and you have to come across as a “can do sir, yes sir” person. Well that is what will kill you. The late Norman Schwarzkopf knew this all too well.
Take care of your weaknesses, figure them out before they hurt you and before they destroy your ability to exploit your strengths. That people is a strategy exercise. I can do that for you and it will cost you a lot of money. But remember, strategies are not products you can buy, they are not commodities and as such buying them is a paradox. A strategy is what will give you the edge over your competitors.If you have others determine your strategy, your competitors will pay them more to find out . So, roll up your sleeves and put in the effort yourself. In the end, it’s all about common sense and this is true for data-mining, AI and BI as well.