March 7th, 2019
Infrastructure Efficiency Gains for Modern Machine Learning Projects
By Rob Newman

Introduction

Machine learning projects consist of complex inter-related ecosystems of data, models, and results code that change in size and shape over a project’s lifetime. Some of the engineering pain associated with preserving state and history of each component has been addressed in the recent year and this blog post documents some of the more innovative open-source projects that have gained attention and traction in 2018 and 2019.

Existing Problems

Managing data like code

Developers use Git and Github/Gitlab/Bitbucket for distributing multiple versions of their code. What about version control of the data used in a machine learning project? For anyone that has ever accidentally added a large image or text file to their repository: you know how repository performance is impacted by large files! Git is not the product to version control data.

A key goal of machine learning projects is reproducibility across an engineering team and technology stack. Therefore, you need a way of saving and iterating (1) your data ingestion and cleansing processes, (2) your data transformations including feature engineering, and (3) your models, their tuned hyperparameters and performance metrics. These processes have been coined “DataOps” (related to/borrowed from “DevOps” concepts and CI/CD processes), which is an emerging field as more companies embrace machine learning as a differentiator.

Several VC-backed companies are actively addressing this problem. Y-Combinator graduate, Quilt Data, developed the concept of data packages to allow simple version control of data and data engineering processes (in the form of Jupyter notebooks) using the native Python library import syntax. You create a Quilt data package for your dataset(s) using their command line tools, version the data, then import into your notebook:

import quilt.username.packagename as pkg

You can then access your data as a native Python package using a variety of methods.

Quilt recently released T4—a team data hub that adds functionality to Amazon’s Simple Storage Service (S3)—transforming a standard S3 bucket into a shareable, version-controlled data repository. You can add data (and version it and share it) in S3 with using their Python API.

Data engineering performance improvements

Pandas is an open source Python project that is the de-facto library for data engineers and data scientists when building ML solutions. However, Pandas can be slow (and run out of memory) when applying transformations to larger datasets (> 1 million rows). This, in part, is due to what Pandas was originally built for (data analysis, but not big data), and how users often incorrectly apply Panda’s methods.

Several projects are addressing Pandas performance issues:

Modin is an open-source project that speeds up your Pandas notebooks by parallelizing data processing across all your physical cores. Pandas, by default, uses only one core at a time during computation. Modin utilizes Ray under the hood and works seamlessly with existing Pandas projects. To use it, you simply change the initial import statement (after installation) from:

import pandas as pd

to:

import modin.pandas as pd

You now have a faster, parallelized data processing engine!

Dask is a Python library for parallel computing and composed of two parts: (1) dynamic task scheduling (similar to Airflow, Luigi, Celery or Make) and (2) “Big Data” collections that extend common interfaces (including Pandas) to larger-than-memory or distributed environments. To use it, you simply change (after installation):

import pandas as pd
df = pd.read_csv(‘data.csv’)

to:

import dask.dataframe as dd
df = dd.read_csv(‘data.csv’)

This allows you to scale your Pandas data processing to much larger (and distributed) datasets without any loss in performance.

Finally, as with any large project, there are good (efficient) and bad (inefficient) ways of applying Pandas methods to your data. Incorrect usage of a Panda’s data transformation method (such as pd.apply()) can impair performance or cause your notebook environment to crash. However, there is an excellent series available free online, called Modern Pandas, which clearly illustrates the most efficient methods of data manipulation, visualization, and scaling.

Conclusion

Many of the problems commonly encountered in Machine Learning projects can be readily solved using the open source project listed above.

As always, we are happy to help! If you have questions or need help getting started with your data, ML, or AI projects, schedule a consultation or email us at info@1Strategy.com.