21 questions we ask our clients: Starting a successful ML project
Emilius Richter
Automating processes using machine learning (ML) algorithms can increase the efficiency of a system beyond human capacity and thus becomes more and more popular in many industries. But between an idea and a well-defined project there are several points that need to be considered in order to properly assess the economic potential and technical complexity of the project.
Especially for companies like dida that offer custom workflow automation software, a well-prepared project helps to quickly assess the feasibility and the overall technical complexity of the project goals -which, in turn, makes it possible to deliver software that fulfills the client's requirements. In this article, we discuss which topics should be considered in advance and why the questions we ask are important to start a successful ML software project.
Importance of project preparation
A detailed analysis of a project idea has many advantages for you and your company, because it gives you a better understanding of the challenges, technical requirements and potential results and thus a more concrete outline of the project than a loose and little tangible idea.
If you have a good description and outline of your project idea, you have probably already dealt with essential questions, e.g. what are the economic implications for my business and what kind of data needs to be processed. This allows you to have fruitful discussions with data scientists, developers and ML consultants from the ML software contractor, who in turn can better prepare themselves based on a more detailed project description.
So, let’s go through the questions that we ask our clients and that will help you to prepare your project and bring you one step closer to automate your workflow with ML software.
Business and workflow
1. In a few sentences, can you write a high-level description of the general idea?
This is probably a point that you have already dealt with. On the one hand, the question will help you to elaborate the main points of the project and to abstract your project idea. On the other hand, the answer will give us a rough and general outline of the project and a basis for further discussions.
2. Which of the following objectives are most important to accomplish within the project?
Saving of personnel costs or reduction of manual effort
Engagement of employees with activities of higher added value
Creation of a standardized product for commercialization
Competitive advantage through automation of a core process
Creation of a machine learning pilot project within the company
Of course, multiple points can apply to your company here as well as others that are not listed, but these points are usually the most important aspects of ML projects. We ask this question to encourage you to analyze in more detail what the core business objectives of your project are and to reflect with the business owners on the objectives. It can also give us a first notion of your expectations of and commitment to the project.
3. Can you estimate the economic value or the importance of a successful project implementation for your company?
Giving a concrete monetary value that captures the increase in efficiency and the economic impact for your company - even if it is sometimes difficult to estimate in the beginning - allows you to estimate the return on investment (ROI), which is a good measure of the profitability of the project and the efficiency of the investment. This gives us an idea of the maximum amount your company might be willing to spend on the project and might help in internal discussions with the business owners of the project. This evaluation can also lead you to investigate the processes and stakeholders involved and how they benefit and influence each other.
4. Who are the internal and external stakeholders in the process?
This information is helpful to get an overview of the people, groups, organizations, etc. involved in or affected by the realization of the project. This allows discussions to be held with all stakeholders and to include their interests and expectations in order to start a project that is satisfactory to all stakeholders. It also helps to identify the departments and people who will benefit most from the project and therefore are most likely to support it.
5. Have you already started initiatives to realize the project?
Initiatives already launched can be a very good starting point for us to understand what results the project is supposed to deliver and where the technical challenges are before deciding to reach out to external contractors. This can help to better evaluate possible expectations and technical feasibility. It also shows the importance of the project for your company and that you already have a detailed understanding of the problem and its challenges.
6. Does the workflow receive input from the global process?
One of the most important issues for an ML solution is the reliability and quality of the data. In most cases, the input data for the workflow to be automated depends heavily on a higher level process. Understanding this process can be critical. It might give clues to the reliability of the training and test data and to possible variations within the production data.
This can influence decisions about which ML models are best suited to solve the problem and highlight potential challenges regarding future support. Embedding the workflow in a global process might also result in alterations in technical specifications and restrictions that need to be considered when processing the input data.
7. Does the output of the workflow influence the global process and if so, how?
The influence on the global process can provide valuable insights into which requirements the workflow automation must meet, e.g. in terms of response time, accuracy, error tolerance, format, etc. Especially if subsequent processes are security-relevant or dependent on high-quality data, this information influences the decision for the most suitable ML models and allows to identify challenges and technical specifications and restrictions.
8. Which steps of the workflow require complex information extraction, integration and processing by a human expert?
Some processes that are to be automated currently require the control or interaction of human experts to ensure a low error rate and possibly better traceability. Automating such steps of the workflow can be quite complex. Being aware of this, one can research more advanced approaches and evaluate whether human control and interaction can be integrated into the automated process.
9. How grave are the consequences of errors or failure of some of the steps?
Some processes and tasks are sensitive to errors and failures, either because they require high-quality input or are security-relevant. Thus, it is very important to identify them in order to pay special attention to them and to find solution approaches that ensure that these processes work with good performance and are backed with sufficient control mechanisms. This, in turn, may affect technical feasibility, since some performance levels that are mandatory may not be achievable with current ML methods.
Technical aspects
It's halftime of our questionnaire. So far, the questions have mainly been related to analyzing the business goals and aspects in more detail and to better understand the workflow to be automated. The next points refer to the technical part of the use case and will help to evaluate the technical feasibility.
10. What is the data type?
This is obviously an important question. Knowing what kind of data we have to deal with, e.g. PDF, TXT, JPG, TIFF, CSV or XML files, etc. allows us to determine whether we have to use computer vision, natural language processing or other data science techniques.
Together with the description of the project idea, we may already have a solution approach or a similar case study from the past we can present.
11. What is the amount of available data?
The performance of an ML algorithm strongly depends on the amount of available data. Thus, this information can provide a first rough estimate of the performance that the model is able to achieve. The minimum amount of data should be around 500-2000 examples depending on the number and distribution of classes. If there is not enough data available, solution approaches can be sought to overcome the lack of data and based on the findings it must be evaluated whether the project is feasible at all.
12. Is there a well-defined output for a single input data point?
This question refers to the task that the model should accomplish, i.e. what kind of output the model should generate. Common ML tasks are a classification, segmentation, clustering, etc. of the input. Usually, the answer can already be roughly inferred from the project description, but specifying it - ideally together with some sample output - is very useful for preparing solution approaches and discussions.
13. What is the amount of available input - output data pairs?
By input-output data pairs, we mean labeled data. Since data labeling can be enormously time-consuming depending on the amount of data, input data type, and the desired output, this question is essential to estimate the project’s time frame, costs and required personnel. If not sufficient input-output pairs are available, in most cases, labeled data needs to be acquired for training, either by the client or the contractor. As this process is often time consuming, we recommend to store as much input-output data as possible before starting to reach out to ML contractors.
14. Do you have a measure for the performance/quality of the workflow?
If the workflow is embedded in a global process, the model must not only perform well during the training and validation phases, i.e. when generating the output, but also with regard to subsequent processes. There may be a measure that is already in use for the workflow and can be used to evaluate the model globally. This is important to consider already during the implementation to be able to optimize the model subsequently and communicate project progress to internal business owners. This metric ideally combines technical and business aspects. Examples of metrics might be “% of defects detected” in a manufacturing process or “number of minutes to process an incoming order” for a logistics department.
15. Do you have a perception of the accepted error tolerance?
Many processes embedded in a larger workflow affect the performance of downstream processes and thus need to perform within an error tolerance. Of course, this error tolerance should also apply to the automated process. If this is the case for your project, this metric would help us to evaluate the technical feasibility given this error tolerance and to identify potential restrictions and limitations. This evaluation may be unsatisfactory, e.g., in terms of not meeting expectations or performance values, but it prevents that such problems get discussed only when the project has already started. Furthermore, expectation levels and technical feasibility can thereby be aligned.
16. Can human quality control be integrated into the automated process?
Even though Machine Learning models are able to automate many processes with high accuracy, the results of the model can sometimes be difficult to retrace. Especially for highly sensitive tasks, e.g. in the healthcare sector, human quality control is still required. For some processes, combining ML and human control, rather than relying entirely on manual or machine processing, can be either more secure or more efficient.
17. Is there an existing data pipeline?
The data pipeline is the basis for integrating the ML model and its outputs into the workflow, as it is responsible for the data flow and enables interaction with and access to the data. An existing pipeline would raise further discussions about the necessary data access points and the deployment of the ML model in the system. If there is no existing pipeline, the implementation of it will have an impact on the project planning and cost estimation that will need to be evaluated in more detail.
18. What is the amount of data which is going to be processed for a given time period?
This question helps to assess the system requirements regarding the scalability of the ML model and its deployment and to estimate possible hardware requirements and model response times. Your DevOps team's input would be required in this step, allowing you to foresee DevOps preparation needs and required resources early in the process.
19. Are there any hardware restrictions for the production environment?
Having this information in the beginning of the project or even before allows the developers to plan to cope with these restrictions during the implementation of the algorithm. In addition, possible limitations arising from the restrictions can be estimated and addressed at an early stage.
20. Is there an internal IT team responsible for the deployment into the ecosystem?
Sure, your external contractor can take care of the first deployment and integration into the IT architecture, but it might make sense for the long-term maintenance to be taken over by the company’s internal IT team. Even for the first deployment, having an internal IT team experienced in and responsible for the IT architecture makes the deployment easier. However, this should be planned in advance and integrated into the IT department’s roadmap.
21. Is there an internal Data Science team that can take over the support of the algorithms?
Some machine learning algorithms must be constantly developed and trained further in order to incorporate new data into the model predictions under the condition of similar accuracy and computational efficiency. If this is necessary and there is no internal Data Science team that can be responsible for this, it must be agreed that the external contractor takes over the long-term maintenance and support of the algorithms.
Summary
Machine learning is capable of solving many automation problems and thus the desire to do so is increasing. But the road between the idea and the actual implementation is long, so there are some points to consider before starting a ML project. In this article, we have covered the questions dida asks its potential clients. Answering these questions will help you and your company and the potential ML software contractor to
better recognize your business objectives and intentions,
understand the workflow to be automated and its distinct steps,
estimate model performance and error tolerance expectations and
learn about important properties of the input and output data.
With this information, your ML software contractor will be able to
learn about the data-flow and model deployment requirements,
narrow down the amount of suitable algorithms for further research,
come up with first potential solution approaches and
decide which projects the contractor has implemented in the past are similar from a technical perspective.
Some questions are also asked in online questionnaires that dida has prepared for its potential clients and cover the business aspects and technical aspects of their use case.
I hope this article is a helpful guideline for preparing your ML project and to enter into discussions with potential ML software contractors.