Advertisement

mandag 6. juni 2016

Life Cycle of a Data Science Project

When working with big data, it is always advantageous for data scientists to follow a well-defined data science workflow. Regardless of whether a data scientist wants to perform analysis with the motive of conveying a story through data visualization or wants to build a data model- the data science workflow process matters. Having a standard workflow for data science projects ensures that the various teams within an organization are in sync, so that any further delays can be avoided.


The end goal of any data science project is to produce an effective data product. The usable results produced at the end of a data science project is referred to as a data product. A data product can be anything -a dashboard, a recommendation engine or anything that facilitates business decision-making) to solve a business problem. However, to reach the end goal of producing data products,data scientists have to follow a formalized step by step workflow process. A data product should help answer a business question. The lifecycle of data science projects should not merely focus on the process but should lay more emphasis on data products. This post outlines the standard workflow process of data science projects followed by data scientists.







Are you interested in learning how to implement the practical aspects of a data science project?


Write to: kontakt@beyondit.no, mob: 004794875183


Data science projects do not have a nice clean lifecycle with well-defined steps like software development lifecycle(SDLC). Usually, data science projects tramp into delivery delays with repeated hold-ups, as some of the steps in the lifecycle of a data science project are non-linear, highly iterative and cyclical between the data science team and various others teams in an organization. It is very difficult for the data scientists to determine in the beginning which is the best way to proceed further. Although the data science workflow process might not be clean, data scientists ought to follow a certain standard workflow to achieve the output.





If you would like more information about Data Science careers, please click the orange "Request Info" button on top of this page.


People often confuse the lifecycle of a data science project with that of a software engineering project. That should not be the case, as data science is more of science and less of engineering. There is no one-size-fits-all workflow process for all data science projects and data scientists have to determine which workflow best fits the business requirements. However, there is a standard workflow of a data science project which is based on one of the oldest and most popular-CRISP DM. It was developed for data mining projects but now is also adopted by most of the data scientists with modifications as per the requirements of the data science project.


According to a recent KDnuggets poll on – “What main methodology are you using for your analytics, data mining, or data science projects?” CRISP-DM remained the top methodology/workflow for data mining and data science projects with 43% of the projects using it.


Every step in the lifecycle of a data science project depends on various data scientist skills and data science tools. The typical lifecycle of a data science project involves jumping back and forth among various interdependent data science tasks using variety of data science programming tools. Data science process begins with asking an interesting business question that guides the overall workflow of the data science project.


CLICK HERE to get the Data Scientist Salary Report for 2016 delivered to your inbox!
Standard Lifecycle of Data Science Projects


Data science project lifecycle is similar to the CRISP-DM lifecycle that defines the following standard 6 steps for data mining projects-
Business Understanding
Data Understanding
Data Preparation
Modelling
Evaluation
Deployment


Lifecycle of data science projects is just an enhancement to the CRISP-DM workflow process with some alterations-
Data Acquisition
Data Preparation
Hypothesis and Modelling
Evaluation and Interpretation
Deployment
Operations
Optimization


1) Data Acquisition

For doing Data Science, you need data. The primary step in the lifecycle of data science projects is to first identify the person who knows what data to acquire and when to acquire based on the question to be answered. The person need not necessarily be a data scientist but anyone who knows the real difference between the various available data sets and making hard-hitting decisions about the data investment strategy of an organization – will be the right person for the job.
Data science project begins with identifying various data sources which could be –logs from webservers, social media data, data from online repositories like US Census datasets, data streamed from online sources via APIs, web scraping or data could be present in an excel or can come from any other source. Data acquisition involves acquiring data from all the identified internal and external sources that can help answer the business question.
A major challenge that data professionals often encounter in data acquisition step is tracking where each data slice comes from and whether the data slice acquired is up-to-date or not. It is important to track this information during the entire lifecycle of a data science project as data might have to be re-acquired to test other hypothesis or run any other updated experiments.


2) Data Preparation

Often referred as data cleaning or data wrangling phase. Data scientists often complain that this is the most boring and time consuming task involving identification of various data quality issues. Data acquired in the first step of a data science project is usually not in a usable format to run the required analysis and might contain missing entries, inconsistencies and semantic errors.
Having acquired the data, data scientists have to clean and reformat the data by manually editing it in the spreadsheet or by writing code. This step of the data science project lifecycle does not produce any meaningful insights. However, through regular data cleaning, data scientists can easily identify what foibles exists in the data acquisition process, what assumptions they should make and what models they can apply to produce analysis results. Data after reformatting can be converted to JSON, CSV or any other format that makes it easy to load into one of the data science tools.
Exploratory data analysis forms an integral part at this stage as summarization of the clean data can help identify outliers, anomalies and patterns that can be usable in the subsequent steps. This is the step that helps data scientists answer the question on as to what do they actually want to do with this data.
“Exploratory data analysis” is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there. — said John Tukey, an American Mathematician


3) Hypothesis and Modelling

This is the core activity of a data science project that requires writing, running and refining the programs to analyse and derive meaningful business insights from data. Often these programs are written in languages like Python, R, MATLAB or Perl. Diverse machine learning techniques are applied to the data to identify the machine learning model that best fits the business needs. All the contending machine learning models are trained with the training data sets.


4) Evaluation and Interpretation

There are different evaluation metrics for different performance metrics. For instance, if the machine learning model aims to predict the daily stock then the RMSE (root mean squared error) will have to be considered for evaluation. If the model aims to classify spam emails then performance metrics like average accuracy, AUC and log loss have to be considered. A common question that professionals often have when evaluating the performance of a machine learning model is that which dataset they should use to measure the performance of the machine learning model. Looking at the performance metrics on the trained dataset is helpful but is not always right because the numbers obtained might be overly optimistic as the model is already adapted to the training dataset. Machine learning model performances should be measured and compared using validation and test sets to identify the best model based on model accuracy and over-fitting.
All the above steps from 1 to 4 are iterated as data is acquired continuously and business understanding become much clearer.


5) Deployment

Machine learning models might have to be recoded before deployment because data scientists might favour Python programming language but the production environment supports Java. After this, the machine learning models are first deployed in a pre-production or test environment before actually deploying them into production.


6) Operations/Maintenance

This step involves developing a plan for monitoring and maintaining the data science project in the long run. The model performance is monitored and performance downgrade is clearly monitored in this phase. Data scientists can archive their learnings from a specific data science projects for shared learning and to speed up similar data science projects in near future.


7) Optimization

This is the final phase of any data science project that involves retraining the machine learning model in production whenever there are new data sources coming in or taking necessary steps to keep up with the performance of the machine learning model.
Having a well-defined workflow for any data science project is less frustrating for any data professional to work on. The lifecycle of a data science project mentioned above is not definitive and can be altered accordingly to improve the efficiency of a specific data science project as per the business requirements.

DeZyre’s Data Science training in Python and R programming course, helps you learn about the entire lifecycle of data science projects right from data acquisition to model evaluation.















50 kommentarer:

  1. Denne kommentaren har blitt fjernet av forfatteren.

    SvarSlett
  2. Thank you for explaining the life cycle in a clear manner.its very help so keep posting such a nice articles.
    Click here:
    Data Science Online Training

    SvarSlett
  3. Thank you for sharing your article. Great efforts put it to find the list of articles which is very useful to know, Definitely will share the same to other forums.
    Data Science Training in chennai at Credo Systemz | data science course fees in chennai | data science course in chennai velachery | data science course in chennai omr

    SvarSlett
  4. After reading this blog i very strong in this topics and this blog really helpful to all.
    Big Data Hadoop Online Course

    SvarSlett
  5. This is most informative and also this post most user friendly and super navigation to all posts... Thank you so much for giving this information to me.. 
    Data Science training in Chennai
    Data science training in Bangalore
    Data science online training

    SvarSlett
  6. Pleasant Tips..Thanks for Sharing….We keep up hands on approach at work and in the workplace, keeping our business pragmatic, which recommends we can help you with your tree clearing and pruning in an invaluable and fit way.
    Data science training in pune
    Data Science Interview questions and answers
    Data science training in bangalore

    SvarSlett
  7. Thanks for sharing the information.it is very useful and interesting.
    <a href="https://bigclasses.com/aws-online-training.html>aws online training</a>

    SvarSlett
  8. Nice one! This post is amazing and very important. Thanks for sharing.

    PPC Management Pricing
    Social Media Packages

    SvarSlett
  9. pmp certification by 360DigiTMG is the best one in Hyderabad and is a Registered Education Provider (R.E.P.) by PMI to conduct training for this globally recognized certification.
    pmp certification
    pmi acp certification

    SvarSlett
  10. I have to search sites with relevant information on given topic and provide them to teacher our opinion and the article.

    machine learning course

    artificial intelligence course in mumbai

    SvarSlett
  11. Thanks for Educating people Through this Articles and Blogs...The Blogs about data analytics course is Good...Keep doing this Good work...
    Java training in chennai | Java training in annanagar | Java training in omr | Java training in porur | Java training in tambaram | Java training in velachery

    SvarSlett
  12. I want you to thank for your time of this wonderful read!!! I definately enjoy every little bit of it and I have you bookmarked to check out new stuff of your blog a must read blog!data science bootcamp malaysia

    SvarSlett
  13. Very nice blog and articles. I am realy very happy to visit your blog. Now I am found which I actually want. I check your blog everyday and try to learn something from your blog. Thank you and waiting for your new post.data science course in delhi

    SvarSlett
  14. Thanks for sharing this information. I really like your blog post very much. You have really shared a informative and interesting blog post .data science certification

    SvarSlett
  15. I have checked this link this is really important for the people to get benefit from.data science course in malaysia

    SvarSlett
  16. Set aside my effort to peruse all the remarks, however I truly delighted in the article. It's consistently pleasant when you can not exclusively be educated, yet in addition, engaged!
    data science course delhi

    SvarSlett
  17. I looked at some very important and to maintain the length of the strength you are looking for on your website
    data science course

    SvarSlett
  18. Nice post it is really an interesting article we are also providing the web design services in bangalore. We are the leading
    Web Design Company in Bangalore
    Website Developers in Bangalore

    SvarSlett
  19. Really good information to show through this blog. I really appreciate you for all the valuable information that you are providing us through your blog.
    visit : Digital Marketing Training in Chennai || Digital Marketing Course in Chennai

    SvarSlett
  20. First You got a great blog .I will be interested in more similar topics. i see you got really very useful topics, i will be always checking your blog thanks.
    data scientist training and placement in hyderabad

    SvarSlett
  21. Very clear explanation about life cycle of Data Science project. Well-written and informative blog.
    Data Science Course in Hyderabad

    SvarSlett
  22. Nice blog and absolutely outstanding. You can do something much better but i still say this perfect.Keep trying for the best.
    data scientist training in malaysia


    SvarSlett
  23. I just wanted to comment on this blog to support you. Nice blog and informative content. Keep sharing more blogs with us. All the best for your future blogs.
    Data Science Course Training in Hyderabad
    Data Science Course and Placements in Hyderabad

    SvarSlett
  24. Your work is very good and I appreciate you and hopping for some more informative posts.
    data scientist course

    SvarSlett
  25. Dies ist ein großartiger, inspirierender Artikel. Sie stellen wirklich sehr hilfreiche Informationen. Danke für das Teilen... Umzugsunternehmen Berlin

    SvarSlett
  26. The data science lifecycle includes key phases: problem definition, data collection, data cleaning, and exploration. It continues with model building, evaluation, and deployment. Post-deployment, continuous monitoring and optimization ensure the model stays relevant, driving insights and decision-making in dynamic environments.
    Data science courses in Gurgaon

    SvarSlett
  27. Understanding the life cycle of a data science project is crucial for successful outcomes. Your insights into each phase will guide practitioners in managing their projects effectively. Great job!
    Data Science Courses in Singapore





    SvarSlett
  28. I can relate to your experiences! When I first started in digital marketing, I faced many of the same challenges. Your advice on [specific strategy] is spot on!

    Data science courses in Gujarat

    SvarSlett
  29. Data Science is the future technology. Excellent article which trends towards the future. Was very interesting to read and also very informative. Loved reading it.
    Data science courses in Kochi

    SvarSlett
  30. Great overview of the data science project life cycle! I appreciate how you broke down each phase, making it easy to understand the flow from problem
    Data science courses in Bhutan

    SvarSlett
  31. Must appreciate how well you have explained the topic of Life Cycle of a Data Science Project very interesting to read. Looking forward for more such blogs.
    Online Data Science Course

    SvarSlett
  32. Superb explanation on life cycle of data science project. It taught me a lot. Great content.
    Online Data Science Course

    SvarSlett
  33. The post on Data Science Jedi about the life cycle of a data science project is very insightful! It breaks down each stage of the process, from problem definition to deployment, making it easier for practitioners to understand the workflow. The emphasis on iterative development and collaboration is particularly valuable for those working in data science teams. Thanks for sharing such useful information!

    Data science courses in Bangalore.

    SvarSlett
  34. The life cycle of a data science project typically includes several stages: problem definition, data collection, data cleaning, and exploratory data analysis. It starts with clearly defining the problem to ensure the analysis is focused and effective. Finally, the model is deployed for real-world use, followed by ongoing monitoring and optimization.
    Thank you for the post.
    Data science Courses in Germany

    SvarSlett
  35. This post provides a clear and insightful overview of the data science project lifecycle. The step-by-step approach makes it easy to understand each phase of the process. It's a great resource for anyone looking to dive deeper into data science and improve their project workflows!

    Data science course in Gurgaon

    SvarSlett
  36. Thank you for this insightful post on the data science project life cycle! Your clear breakdown of each stage provides an excellent roadmap for professionals navigating complex data-driven initiatives.
    Data science course in Lucknow

    SvarSlett
  37. This Post on Life cycle project of data science is so straight and to the point.
    Data science courses in chennai

    SvarSlett
  38. Great information ! Thank you so much for sharing everything very well.

    Digital marketing courses in mumbai

    SvarSlett
  39. This article provides a thorough and insightful breakdown of the data science project lifecycle, clearly highlighting the importance of a structured yet flexible workflow.digital marketing courses in delhi

    SvarSlett
  40. Amazing blog with detailed information regarding the Data Science ,it's importance and the responsibilities of Data Scientists .
    technical writing course

    SvarSlett
  41. Thanks for Sharing….We keep up hands on approach at work and in the workplace, keeping our business pragmatic, which recommends we can help you with your tree clearing and pruning in an invaluable and fit way.
    digital marketing course in coimbatore

    SvarSlett