I frequently get asked by many young graduates, my mentees and business people, about the best way to get one’s hands dirty, with Data Science.
My answer varies, but often, there are few things consistent, with my advice, which I will try to explain in this article.
Most times, I also, find that the discussion veers off and goes on, to discuss machine learning, statistics, data analysis and the likes of artificial intelligence.
If I had my way, I would give all these technologies, one name, considering how juxtaposed they are.
It is important to have these discussions, but if not handled well, the information becomes overwhelming and confusing.
For this article, I will confine myself, more to Data Science and Machine Learning.
I will also be providing a broader look, on the concepts, which I will revisit and explain, in more details with subsequent write-ups.
What, then, is Data Science? Why is it important? In the simplest terms, Data Science is, the study of the data, to glean insight from it and using that insight, for many things, among them, detecting patterns, decision making and identifying trends, business opportunities, or, threats.
For example, business opportunities could be the identification of products, to cross-sell, while a user is still in an online shopping cart and business threats could be an identification of, would-be loan defaulters, or, customers, switching service providers.
Data science uses a combination of technologies, such as statistics, machine learning, programming and visualization, to attain its purposes.
One way, in which Data Science projects get formalised, is, to follow, what is known as Data Science lifecycle, or, sometimes, as Data Analytics Lifecycle.
This process has many stages, which in themselves, can be stand-alone Data Science disciplines.
As stated, I will endeavour, in future articles, to delve deeper into the various stages of this lifecycle.
Data Science is, important because, with the proliferation of data-generating agents and the very diverse formats that, this data takes necessary and appropriate sciences, must be developed, to enable any such data, to be consumable, in a useful and efficient way, most data formats that, would not have been analysable, say a decade ago.
New techniques, inherent in the Data Science ecosystem, can aid in the ingestion and analysis of such data formats.
For example, most website data are not structured, in the traditional way, such as organized in rows and columns, but it comes in HTML, with varying structures, in other words, no predefined schema.
It can be scraped for insight, such as sentiment of a stock market, using tools available in Python programming language.
Turning to Machine learning, what is it and why is it important? In its simplest form, machine learning is a subset of Data Science and it is a discipline that, trains machines on a set of data, such that, if they are fed with another set of data that, they have not seen, they will be able to make informed decisions, or, carry out tasks, based on the training they received.
In other words, these are systems that learn, from experience. Problems that can be solved by Machine learning include among other things, predictions, classifications, recommendations, or, pattern recognition.
In future articles, I will discuss supervised, unsupervised and reinforcement machine learning and individual problems that, machine learning seeks to solve and the many Machine Learning algorithms available.
The motivations for wanting Data Science, also, varies, but over the years, I have come to synthesize them, into about four major categories, namely: Career progression, smart start-up solutions, research and business opportunities and threats.
It is clear from various job websites that, a Career in Data Science, has great prospects, of a higher salary, than, most other data-driven jobs, in many jurisdictions, around the globe.
The demand for the skills of Data Scientist seems to outstrip, the available supply.
All over the place, the buzz word, in the many technology conferences, is, “Start-up” and synonymous with this term, is computer-based smart solutions, be it apps, or, solutions embedded, in electronic gadgets.
Another common motivation is, for scholarly research. Most of the students who peruse such projects, for their degrees, end up being in that field, after graduation.
Most businesses will seek intelligent solutions, to identify opportunities and threats. They, usually, hire consultants, to build models, which will be able, to identify new opportunities, in the market place, as well as, identify and advise on the possible solutions, to thwart possible threats.
How does a would-be Data Scientist, get to speed, in obtaining their ambitions, or, shorten the learning curve? I am going to identify the low-hanging fruits, in this learning curve, which can serve, as a prerequisite, for more “exotic”, or, advanced aspects, of the field.
Below is a list of things that, I believe, are low-hanging fruits, in the Data Science learning curve.
Comprehension of Statistics is, important for several roles that can be identified, in the whole Data Science Life Cycle, or, ecosystem.
For example, one critical early preparatory stage in the ecosystem is, to visualize the data and draw some basic statistics, to identify such measures of central tendency, dispersion and correlations.
Most Data Science, or, machine learning experiment, results will need some statistical interpretation, of some kind, for example, a confusion matrix that explains the number of false-positive, or, negatives, are the results, statistically, significant and IS NULL hypotheses rejected?
The accuracy of the Data Science models are also measured, with values that need statistical interpretation, to make sense.
Expertise, in a subject matter is, in itself, not a prerequisite, but it helps to navigate the requirements of the Data Science experiment, at hand
As an example, if one is designing a Data Science project that scrapes the internet, for stock market sentiment, it would help, if that Scientist has an understanding, of things, such as, opening, closing, lows, heights of stock prices and other aspects of stock market technical, or, fundamental analysis.
I have been able to create a poultry Data set, which I have since published on Kaggle.com, based somewhat, on my knowledge of poultry.
As in any other science, the need for appropriate apparatus cannot be overemphasised. The cost of the tools of the trade are often cited, as an entry barrier.
This can be mitigated, by opting for open-source tools, such as python, or, R programming languages, capable of delivering end-to Data Science projects.
These are open-source languages, with a lot of frameworks that, aid in, among other things, machine learning, web development, game development and data visualization.
A word of caution, on Open-source, is that these are released, under conditions that, can allow, would-be users, the privilege to use, only, under certain conditions.
Microsoft has come to the party, with Azure Machine learning studio, with no payment required, for basic configuration for its trial versions, for as long as, terms and conditions, are observed.
Azure Machine Learning Studio is a Windows-based, user-friendly, intuitive tool, for interactive building and deploying machine learning solutions.
With proper research, an appropriate ensemble of open source tools can be organised, to aid “Start-ups”, or, even, established businesses, can do most work, which requires, software technology, such as, end-to-end, product development for, almost no cost at all.
Knowledge of machine learning is, a must-have, for one to qualify as a data Scientist.
A data scientist should be able to match a problem to appropriate technology, or, Machine learning algorithms.
For example, if the task at hand is, to predict binary choices, then, the most suitable algorithm to use, would be logistic regression
Data Visualisation is, important in the Data Science lifecycle because, the results of most Data science experiments must be presented in ways that convey a meaningful message, to the intended audience.
Some common commercial visualisation tools are Qlikview, Power BI, MicroStrategy, Pentaho and Tableau. Apart from visualisation, most of these tools, also, offer other capabilities, such as, (Extract Transform and Load), ETL.
Seeing that, cost has been identified, as an entry barrier, in many other instances, most of these software provide trial versions, for their software, for certain periods after which, they expect the user to purchase.
Other vendors provide an endless trial version, for registered students, with limited functionality and for, as long as, terms and conditions, are observed.
For example, solution arising, from a trial version, cannot be sold, or, shared. Some open-source tools, also, have visualisation capabilities. For example, Python has several frameworks that, aid in visualisation.
These five steps are not, in any way prescriptive, but mastering them, will surely, “short-circuit”, the long journey into the area of Data Science and its greater ecosystem.
Another question that, gets to be asked is how these skills, can be obtained, in a way that, are easy, cost-effective and in a reasonable time.
The list below gives some of the resources that, I have found helpful, in my own Data Science journey
I find the internet, to be the best, “low-fee” University of our times. The quality of the material varies.
There are lots of materials, online, which could be overwhelming for someone searching, for the right material, but with time, one gets to filter out the noise.
Several reputable Universities provide free courses, through Massive Open Online Courses, (MOOCs).
These courses are mainly offered online, by the very tutors, who teach the full-time fee-paying students.
I have found some offered by Harvard and MIT, particularly, very useful.
Udemy.com provides very good myriad of courses, at reasonable prices.
Most providers of these courses are, also, leading faculty members, at leading Universities, or, people, who have created their successful businesses.
kaggle.com provides, very rich material, particularly, for those that, have a bias, of using Python, as the tool of the trade.
You can also find good material, on YouTube, if you refine your search, well enough. I learned Python Web, scraping on YouTube, presented by one Youtuber, from India.
The video I used was being presented, in a mixture of Hindi and English. Despite not understanding a single word in Hindi, I was able to follow the lecture
Here, in South Africa, I find the part-time offering, from the “Enterprise Workplace Skills Plan, (WSP)”, at the University of Pretoria, to have, particularly, good part-time courses, relating to Data Science.
One such course that, I once attended is, a, six-day course, titled, “Applied Machine Learning”
Having argued that, cost is a barrier, I would also like to state that, certain investment efforts, must be made, if this desire is, to become a Data Scientist is to be realised.
Considering that this is, an investment into one’s future earnings, then, some expense must be incurred. There are many resources that can be purchased, from many sources.
I have bought considerably useful material kindle books’ from Amazon, for as much as $2 that, have served me well.
As Jim Rohn, once said, “investment in education; formal education, will make you a living; self-education, will make you a fortune”
In conclusion, I hope that I have shared some light in the ways one can become a Data Scientist.
I am of the considered, view that, these steps, are the low-hanging fruits that, would be easy, to pluck, on one journey to becoming a Data Scientist.
In future write-ups on this subject, I will be going into greater details on selected aspects of the field.
About The Author
Phuzo Soko is a senior Business Intelligence Manager / Data Scientist at an Insurance Company in Johannesburg, South Africa. He is interested in Business Intelligence, Machine Learning, Data Science and Artificial Intelligence. He holds a BTech: in Software Development from the Tshwane University of Technology, a Certificate in Cyber Security from the University of Johannesburg and a Certificate in Finance and Investment from the University of the Witwatersrand.
Featured Image: imarticus.org
Don’t miss important articles during the week. Subscribe to cfamedia weekly newsletter for updates.