Data Science Jedi: mapreduce

Viser innlegg med etiketten mapreduce. Vis alle innlegg

fredag 17. juni 2016

How to Become a Data Scientist (Part 2/3)

Having read Chapters One and Two (i.e. Part One), you should now have a good comprehension of what commercial data science entails, the different forms it takes, and what is required to be a success in the profession. And having thought deeply about your motivations, you should have a clear picture of your goals, and ultimately – the type of data scientist you want to become. So give yourself a pat on the back, because you are now ready to begin the real fun: learning.

In this chapter, we will explore the options at your disposal – but first – we will begin proceedings by discussing an important notion that concerns data science and learning.

Continual Learning

Just like a doctor has to stay abreast of medical developments, learning never stops for a data scientist. The field (and the technology) is evolving so quickly; what you learn now might not be relevant in the years to come. Look at the rise of deep learning, to take just one example. This is what Sean McClure was alluding to in his post emphasising the importance of problem solving (highlighted in Chapter One).

Quite simply, if you are not passionate about the field and do not enjoy learning, then data science is not for you. Conferences and networking with the data science community are effective ways of keeping on top of the latest developments. And regularly reading books and papers is very important (on this: if you do not have a research background, it is worth learning how to read academic papers properly).

Play. Build. Experiment.

Going back to the message we touched on in Chapter One, there is only one-way to develop your capability as a data scientist: experience, experience, experience. I could launch into a lengthy discussion on this, but I happened to come across two excellent posts that cover the points I wanted to make, so have a read of Brandon Rohrer: A One-Step Program for Becoming a Data Scientist and Rossella Blatt Vital: The Scary Rise of the 'Fake Data Scientists'.

This is what should you take from these: data science is an expert field, it takes a long time to master, and you will only do so through practical experience. As James Petterson summarised:

“Nothing beats experience. You can read as much as you want, you can do all the Coursera courses, but unless you get your hands dirty, you won’t learn”

The good news is there are some great avenues to gain practical experience, and we will turn our attention to these now.

Kaggle / Open-Source / Freelancing

If you haven’t heard of Kaggle, Google it... NOW! Kaggle is an incredible platform where you can play around, develop your expertise and learn, of course. James put it this way:

“If I hadn’t competed in Kaggle competitions, I would have finished my PhD without knowing the tools that people use in industry. For example, a lot of the methods used in industry are based on ensembles or decision trees, like random forests. They are really powerful and are my first choice in both competitions and industry, but I wasn't exposed to them during my PhD”

There you have it: you can improve your skills while learning the techniques that are commonly applied in industry. And if you start doing well in the competitions, it provides evidence of your capability, as we will see in Chapter Four.

Outside of Kaggle, another option is to contribute to open-source projects. A simple search on GitHub should reveal some projects you can start to sink your teeth into, and gain practical experience while doing so.

Finally, if you can get freelancing work, this is a great way to build a track record and demonstrates that you can operate in a commercial environment. And rather conveniently, you could even utilize the Experfy platform for that purpose.

To PhD or not to PhD

Do you need a PhD to be a data scientist? Not necessarily, but there are many advantages, as Sean Farrell noted:

“The process of obtaining a PhD is a filter for creative problem solving skills [and it] shows you can master a particular field in a short space of time and become a world expert, which proves you’ll be able to do it again and again”

And apart from anything, it provides you with the time to study and to develop your skills. Furthermore, if you are interested in specialising within a specific area like image processing or natural language processing, then PhD research is certainly worth considering.

But going down this path is not the only way to data science. James did a PhD in Machine Learning (focused on researching a very specific type of method) and he feels that a lot of PhD research is not always applicable to industry, i.e. if your job is to apply machine learning rather than research it, you don’t necessarily need a PhD. As such, I asked him whether he thinks people should choose a PhD based on its relevance to industry and he said:

“If possible, but that’s really hard because most of what we do in industry is not state of the art, we use methods that have been around for years and apply them to different problems. There are exceptions of course: you might work at Google in research, for example. But most of the knowledge I use day-to-day, I learnt working [at Commonwealth Bank] and by competing in Kaggle. Of course, doing a PhD, you learn about the whole process, spend a lot of time doing experiments and learning how to do them properly, and that is valuable. But I wonder if you could learn that from other means?”

Given the right motivations and armed with an informative guide on how to become a data scientist (where could you find one of those I wonder?), I have no doubt it is possible to learn by yourself. But it is worth making the point again: there are no shortcuts; it requires a lot of self-study and getting your hands dirty – whatever path you take.

There is also the employability aspect to consider: are you more employable as a PhD graduate vs. spending the same time on self-study? I do not have sufficient evidence to comment, but either way, it is more important whether you have truly spent the time building up expert capability (and how you can evidence this). PhD’s are certainly valuable but there are great data scientists with PhD’s and great ones without.

Other University Degrees

So a PhD is not for you – perhaps it is the cost, or perhaps you have not yet developed the expertise necessary for research of this nature. Whatever the reason – there is no need to panic – because many universities are now offering Bachelors, Masters and Diplomas specifically designed for data science, where both computer science and mathematics/statistics are on the curriculum (the attentive reader will remember this from Chapter Two).

Courses like these will certainly take you in the right direction, but take note: they won't be enough to convert you into a ready-made data scientist, because as we know – that takes experience.

Online Courses

In a similar sense – even if you come from another quantitative field – a few online courses will not make you an expert, and remember: this is an expert field. But even if an online course was enough to master a chosen subject, you will still face competition, who – in all likeliness – will have far more practical and commercial experience in these areas. This is really important to be conscious of, and so we will return to this in Chapter Four.

All this being said, online courses are incredibly useful tools to help kick-start your journey, or begin learning a new area (like deep learning, for example). The most popular courses are found via Coursera, Udacity and edX, with Dylan Hogg describing Andrew Ng’s Machine Learning on Coursera as “an absolute pre-requisite for anyone who does not have a research background”.

The following is by no means a complete list, or mandatory for that matter, but these also stood out to Dylan and some of the other data scientists we have met so far:

Machine Learning: Intro to Machine Learning (Udacity)
Deep Learning: Deep Learning (Udacity), Neural Networks for Machine Learning (Coursera)
Spark: Big Data Analysis with Spark (edX), Distributed Machine Learning with Spark (edX)

During my interactions with the Experfy team, I've found out that they were also launching a training platform. You can see a preview here.

Books

Needless to say: good books are an invaluable resource and our favourite data scientists advocated the following:

Pattern Recognition and Machine Learning by Christopher Bishop
Machine Learning: a Probabilistic Perspective by Kevin P. Murphy
Why: A Guide to Finding and Using Causes by Samantha Kleinberg (if you want to know why this is important, take a look at Yanir Seroussi’s blog post on: Why You Should Stop Worrying About Deep Learning and Deepen Your Understanding of Causality Instead)
An Introduction to Statistical Learning by James, Witton, Hastie and Tibshirani, which, according to Dylan: “is a great introduction to statistical learning and is an accessible version of the more advanced classic”: Elements of Statistical Learning
And for a different suggestion, Will Hanninger recommended The Pyramid Principle by Barbara Minto. It does not cover data science specifically, but is valuable for problem solving and presenting

Presenting / Communicating

If you do not have a natural disposition for communicating – especially with non-technical people – this is something you will need to work on (see Chapter One for why). Gaining practice and obtaining feedback is the best way to improve your soft skills, although Yanir also recommended the classic book by Dale Carnegie: How to Win Friends and Influence People.

onsdag 15. juni 2016

The Professionalization of Data Science

There has been much discussion and debate about the definition of data science and the new rare breed of sexy bird called the data scientist. The Data Science Association defines "Data Science" as the scientific study of the creation, validation and transformation of data to create meaning; and the "Data Scientist" as a professional who uses scientific methods to liberate and create meaning from raw data.

While these definitions may appear overbroad, think about the definitions of a lawyer or physician. A lawyer is a legal professional who can help prevent or solve legal issues and a physician is a health professional who can help prevent or cure health issues. Like the professionalization of law and medicine in the past hundred years, data science is at the very beginning of becoming a profession - with competency standards and a Data Science Code of Professional Conduct.

This means that data science will evolve into a profession where data scientists specialize in different areas - like lawyers and physicians. When you need to hire a lawyer you usually consider the special area of law that a lawyer practices. If you have a tax problem you hire a tax lawyer, not a divorce lawyer. If you have a heart problem you do not hire a gynecologist.

The simple truth is that data science is a vast and complicated field and - like law and medicine - much too big and complex for a person to master in one lifetime. My colleague Gary Mazzaferro has been exploring the concepts and ideas surrounding data science and definitions as formalizations aligning with knowledge economies and the knowledge / science / technology maturity models. Gary has (to date) defined the following data science specializations and types of data scientists:

Data Science: A field of systematic interdisciplinary study to elucidate relationships across and within Formal, Social Natural and Special Sciences phenomenon through the application of scientific methods. Interdisciplinary areas include analytical processes, mathematics, probability and statistics, logic, modeling, machine learning, algorithms, communications, traditional sciences, business, public policy and philosophy.

Blue Sky Data Science: A purely curiosity driven exploratory branch of Data Science oriented towards the development and establish understanding about relationships across and within phenomenon with no focus on specific goals and immediate application.

Basic Data Science: A branch of Data Science research focused on clearly defined goals and oriented towards the development and establish understanding about relationships across and within phenomenon.

Applied Data Science: A branch of Data Science oriented toward the development of practical applications, technologies other interventions including engineering practices. Applied Data Science bridges the gap between Basic Data Science and the engineering domains to provide predicable, usable tools to industries including standard methods and practices.

Data Science Practice: The regular performance of Applied Data Science activities and methods for private and public organizations. May practice externally or internally. Practice may necessitate additional disciplines based on the needs of the organization including domain expertise and communications supporting presentation and reporting activities.

Data Scientist: A person that studies or has expert knowledge of the interdisciplinary field of Data Science.

Blue Sky Data Scientist: A person that studies or researches in the branch of Blue Sky Data Science.

Basic Data Scientist: A person that studies, researches or has expert knowledge in the branch of Basic Data Science.

Applied Data Scientist: A person that studies or researches in the branch of Applied Science.

Note that this is a preliminary list and is not complete. The profession of data science will evolve to create many specializations. After all, it took law and medicine over one hundred years to evolve as professions with different specialties.

mandag 6. juni 2016

10 emerging technologies for Big Data

Thoran Rodrigues interviewed Dr. Satwant Kaur about the 10 emerging technologies that will drive Big Data forward.

I've recently had the opportunity to have a conversation with Dr. Satwant Kaur on the topic of Big Data (see my previous interview with Dr. Kaur, "The 10 traits of the smart cloud"). Dr. Kaur has an extensive history in IT, being the author of Intel's Transitioning Embedded Systems to Intelligent Environments. Her professional background, which includes four patents while at Intel & CA, 20 distinguished awards, ten keynote conference speeches at IEEE, and over 50 papers and publications, has earned her the nickname, "The First Lady of Emerging Technologies." Dr. Kaur will be delivering the keynote at the CES show: 2013 IEEE International Conference on Consumer Electronics (ICCE).

While the topic of Big Data is broad and encompasses many trends and new technology developments, she managed to give me a very good overview of what she considers to be the top ten emerging technologies that are helping users cope with and handle Big Data in a cost-effective manner.

Dr. Kaur:
Column-oriented databases
Traditional, row-oriented databases are excellent for online transaction processing with high update speeds, but they fall short on query performance as the data volumes grow and as data becomes more unstructured. Column-oriented databases store data with a focus on columns, instead of rows, allowing for huge data compression and very fast query times. The downside to these databases is that they will generally only allow batch updates, having a much slower update time than traditional models.

Schema-less databases, or NoSQL databases
There are several database types that fit into this category, such as key-value stores and document stores, which focus on the storage and retrieval of large volumes of unstructured, semi-structured, or even structured data. They achieve performance gains by doing away with some (or all) of the restrictions traditionally associated with conventional databases, such as read-write consistency, in exchange for scalability and distributed processing.

MapReduce
This is a programming paradigm that allows for massive job execution scalability against thousands of servers or clusters of servers. Any MapReduce implementation consists of two tasks:
The "Map" task, where an input dataset is converted into a different set of key/value pairs, or tuples;
The "Reduce" task, where several of the outputs of the "Map" task are combined to form a reduced set of tuples (hence the name).

Hadoop
Hadoop is by far the most popular implementation of MapReduce, being an entirely open source platform for handling Big Data. It is flexible enough to be able to work with multiple data sources, either aggregating multiple sources of data in order to do large scale processing, or even reading data from a database in order to run processor-intensive machine learning jobs. It has several different applications, but one of the top use cases is for large volumes of constantly changing data, such as location-based data from weather or traffic sensors, web-based or social media data, or machine-to-machine transactional data.

Hive
Hive is a "SQL-like" bridge that allows conventional BI applications to run queries against a Hadoop cluster. It was developed originally by Facebook, but has been made open source for some time now, and it's a higher-level abstraction of the Hadoop framework that allows anyone to make queries against data stored in a Hadoop cluster just as if they were manipulating a conventional data store. It amplifies the reach of Hadoop, making it more familiar for BI users.

PIG
PIG is another bridge that tries to bring Hadoop closer to the realities of developers and business users, similar to Hive. Unlike Hive, however, PIG consists of a "Perl-like" language that allows for query execution over data stored on a Hadoop cluster, instead of a "SQL-like" language. PIG was developed by Yahoo!, and, just like Hive, has also been made fully open source.

WibiData
WibiData is a combination of web analytics with Hadoop, being built on top of HBase, which is itself a database layer on top of Hadoop. It allows web sites to better explore and work with their user data, enabling real-time responses to user behavior, such as serving personalized content, recommendations and decisions.

PLATFORA
Perhaps the greatest limitation of Hadoop is that it is a very low-level implementation of MapReduce, requiring extensive developer knowledge to operate. Between preparing, testing and running jobs, a full cycle can take hours, eliminating the interactivity that users enjoyed with conventional databases. PLATFORA is a platform that turns user's queries into Hadoop jobs automatically, thus creating an abstraction layer that anyone can exploit to simplify and organize datasets stored in Hadoop.

Storage Technologies
As the data volumes grow, so does the need for efficient and effective storage techniques. The main evolutions in this space are related to data compression and storage virtualization.

SkyTree
SkyTree is a high-performance machine learning and data analytics platform focused specifically on handling Big Data. Machine learning, in turn, is an essential part of Big Data, since the massive data volumes make manual exploration, or even conventional automated exploration methods unfeasible or too expensive.

Big Data in the cloud
As we can see, from Dr. Kaur's roundup above, most, if not all, of these technologies are closely associated with the cloud. Most cloud vendors are already offering hosted Hadoop clusters that can be scaled on demand according to their user's needs. Also, many of the products and platforms mentioned are either entirely cloud-based or have cloud versions themselves.
Big Data and cloud computing go hand-in-hand. Cloud computing enables companies of all sizes to get more value from their data than ever before, by enabling blazing-fast analytics at a fraction of previous costs. This, in turn drives companies to acquire and store even more data, creating more need for processing power and driving a virtuous circle.

Advertisement

fredag 17. juni 2016

How to Become a Data Scientist (Part 2/3)

onsdag 15. juni 2016

The Professionalization of Data Science

mandag 6. juni 2016

10 emerging technologies for Big Data

GoorooThink by Gooroo.io

Bloggarkiv

Advertise Sidebar

Advert

Advertise

Advertise