dida's tech stack
Published on March 29th, 2021 by Natural Language Processingin
This article provides an overview of our tech stack at dida. Of course we adapt the tools we use to what is required by a given project, but the ones we have listed here are our go-to tools whenever we are free to choose.
I will first describe the tools that shape our software development process, and then our favourite Python libraries and software tools for machine and deep learning.
Software development and deployment infrastructure
By default we write our software in Python. In the last years Python has asserted itself as the number one programming language for machine and deep learning applications.The reasons for this are the following:
- It's relatively simple and easy to learn and thus allows developers to concentrate on machine learning instead of implementation issues.
- Python offers a vast range of helpful libraries facilitating development, training and deployment of machine learning models. Some of the most popular deep learning frameworks like PyTorch are written specifically for Python.
- By now there is a large and very active Python community sharing their knowledge on conferences, in blog posts and on internet platforms like Stack Overflow, which can be of tremendous help for any developer.
Although Python is not a peformance-orientated programming language and therefore by design rather slow, this does not present an obstacle for computationally expensive deep learning applications: Most frameworks run the costly computations in a faster language like C and only provide a Python-wrapper as "user interface".
At dida we make extensive use of Jupyter Notebooks (and the more comprehensive JupyterLab). Relying on IPython, Jupyter Notebooks are a kind of document in which you can write and run Python (but also Julia and R) code and display the results. Jupyter Notebooks can not only display command line outputs, but also more complex objects like images, pandas.DataFrames or LaTeX formulas.
The code is organized into cells, which can be run individually. Ipython's built-in "magic commands" extend the functionality of cells beyond just running Python code.
The above features make Jupyter Notebook and JupyterLab great tools for prototyping and in general explorative work as well as showcasing. They do not provide the full range of functionality of a full-fledged IDE like VS Code or PyCharm, though. By the way, one can use notebooks both in VS Code and the premium version of PyCharm.
Google Cloud Platform
Google Cloud offers a vast range of services that is impossible to describe here. We use it mainly for two applications: We use its storage buckets to save data remotely and make it accessible everyone working on a given project (see section DVC) and its compute engines as GPU servers.
The GPU servers can be easily configured to exactly meet the performance requirements needed and thus present a useful alternative to local servers.
Amazon Web Services (AWS) and Microsoft Azure offer a similar range of services. We have found that generally they seem to be on a par with Google Cloud, so which of them to choose depends on the specifics of one's needs. To be fair, I must add that we haven't evaluated them to the same extent as Google Cloud, since the latter is our default setting.
Just like its probably more famous competitor GitHub, GitLab provides version control using Git in remote code repositories. GitLab enhances the basic features of Git by adding GUIs (for example for solving merge conflicts) and a range of useful tools for software development (e.g. definition of continuous integration pipelines) and project management (e.g. issue boards).
For most of our projects, the lean project management tools offered by GitLab are exactly what we need, although in some cases we also use dedicated software like Jira.
Git repositories like GitLab keep track of the code developed in a project. In theory they could also be used as remote repositories for other kinds of data, but that's not what they are optimized for.
DVC (short for Data Version Control) is designed to handle large files, like image data and in general machine learning data sets. It works similar to Git: local changes are pushed to a remote repository, and all changes get tracked.
DVC correponds to the software Git, not to the service GitLab. This means that DVC requires an external remote storage which get initialized as DVC repositories - in our case usually Google Storage Buckets.
Imagine we have developed software successfully solving a customer's problem. It runs flawlessly on our machines where we have full control over the hardware, operating system, Python installation. How can we make sure it runs just as well on different systems?
We could ask the customer to run the software on a full-blown virtual machine built according to our specifications, but that generates a lot of overhead, which is not really necessary here, because Docker offers a better solution: a so-called Docker container is essentially a virtual machine intended to run a single application. For more information on Docker see our blog post.
Tools for machine / deep learning
SciPy is a Python library for scientific computing. It provides optimized implementations of various useful functions from linear algebra, interpolation, FFT, signal and image processing and many others.
We use SciPy whenever we want to use domain knowledge to exhance the algorithms we develop by deterministic computations.
Scikit-learn is a Python library that covers classical machinal learning algorithms for regression, classification and clustering like random forests, support vector machines, linear or logistic regressions and k-means. It is also possible to define and train simple neural networks.
Scikit-learn comes with a lot of handy tools for preprocessing data, defining complex model pipelines and evaluating the performance of trained models.
At dida we often use scikit-learn models to establish robust baselines which can be compared to the performances of more sophisticated models developed with designated deep learning libraries.
Pytorch, TensorFlow and Keras
PyTorch and TensorFlow are deep learning libraries for Python which facilitate the definition, configuration, training and application of neural networks. PyTorch is developed and maintained by Facebook, TensorFlow by Google. Both are open source and free to use, and in most cases it is a matter of taste which of them to choose (especially since the release of TensorFlow 2.0). Although there are other machine and deep learning libraries for Python, those two have increasingly divided the market among themselves in the last couple of years.
Keras is another deep learning library that used to be a wrapper for various deep learning backends (including TensorFlow) and provides a high level interface making the development of models clearer and simpler. In 2020 Keras has stopped supporting other backends than TensorFlow and has now been integrated into TensorFlow.
Deep learning libraries like the above mentioned make it relatively easy to experiment with different models and various hyperparameter configurations. Anyone who has ever done this knows that you can lose track of what you have already tried out and how it has performed very quickly.
MLflow is an open source platform that we use to record our experiments and keep the overview. Apart from experiment tracking it offers functionality for model deployment and model storage.
Of course there are some alternatives to MLflow: TensorBoard (it comes with TensorFlow, but also works with PyTorch) aims mainly at inspecting and visualizing experiments. Sacred is another tool that focuses on experiment tracking.