Implementing Version Control for Big Data Projects

Ugur Selim Ozen
Geek Culture
Published in
6 min readFeb 23, 2022

--

0. Introduction

A version is a particular form of something that differs in certain respects from an earlier form or other forms of the same kind. In the dev world, on the other hand, it refers to versions related to resources such as software or data. If there is a change in a resource’s structure, content, or state, one may consider creating a new version.

Manipulating datasets, correcting errors, bugs, or inserting extra data may require creating a new version. It is also a practical method of tracking interrelated changes in dynamic data.

1. What Are Version Control and Data Versioning?

Version control is a mechanism that allows you to follow and monitor every stage of a project. It enables teams to check modifications on the source code and ensure that changes are not overlooked in the code.

Data versioning is the process of storing corresponding versions of data that were created or modified at different time intervals.

There are many valid reasons for making changes to a dataset. Data specialists can test the machine learning (ML) models to increase the success rate of the project. For this, they need to make important manipulations on the data. Datasets may also update over time due to the continuous inflow of data from different resources. In the end, keeping older versions of data can help organizations replicate a previous environment.

2. Why Is Data Versioning Important?

Usually, in the software development lifecycle, the development process is spread over a large period, and tracking changes in the project can be difficult. Therefore, versioning software projects makes this process easy and saves teams from having to regularly name every version of the script.

Additionally, with the help of data versioning, historical datasets are saved and kept in the databases. This aspect provides some advantages as follows.

3. Benefits of Data Versioning

Keeping the Best Model while Training

The main objective of data science projects is to contribute to the business demands of the company. Therefore, data scientists need to develop many ML models as per customer or product requests. This situation requires inserting new datasets into the ML pipeline for every attempted modeling. However, data specialists need to be careful not to lose the dataset that gives them the best score in modeling. This is achieved with the help of a data versioning system.

Providing A New Business Metric in Growth

In today’s digital world, it is a fact that companies that make decisions and develop strategies using data will survive. Therefore, it is important not to lose historical data.

Consider an e-commerce company that serves daily necessities. Every transaction in the application changes the sales data. People’s needs and demands may change over time. Therefore, keeping all the sales data can be beneficial for gaining insights into new trends about customer demands and determining the right strategies and campaigns.

In the end, it gives companies a new business metric to measure their success or performance.

Protection Regarding Data Privacy Issues

As digital transformation is accelerating in today’s world, the amount of data produced is increasing rapidly. However, this situation has brought with it concerns about protecting personal data. Consequent data protection regulations set by governments force companies to store a certain amount of data.

Data versioning can help in such situations by ensuring that data is stored at specific times. It can also assist organizations in meeting the requirements of such regulations.

4. A Quick Demo with an Open Source Data Versioning Tool

There are many data versioning tools in the market. They offer similar features for data storage, but some of them have important advantages over others.

LakeFS is an open-source platform that enables data analytics teams to manage databases or data lakes like they manage the source code. It runs parallel ML pipelines for testing and CI/CD operations for the whole data lifecycle. This provides flexibility and ease of control in the form of object storage in data lakes.

With LakeFS, every process — from complex ETL processes to data analytics and machine learning steps — can be transformed into automatic and easy-to-track data science projects. Some prominent features of lakeFS are:

  • Supports cloud solutions like AWS S3, Google Cloud Storage, and Microsoft Azure Blob Storage
  • Works easily with most modern big data frameworks and technologies such as Hadoop, Spark, Kafka, etc.
  • Provides Git-like operations like a branch, commit, and merge, which enables scaling of petabytes of data with the power of cloud solutions
  • Gives options for deployment in the cloud or on-prem and using any API compatible with S3 storage

Installing and Running The LakeFS Environment

To run a LakeFS session on your local computer, please make sure you install Docker and Docker Compose with a version of 1.25.04 or higher. To run LakeFS with Docker, type the following command :

$ curl https://compose.lakefs.io | docker-compose -f - up

After that, check your installation and running session from http://127.0.0.1:8000/setup

User Registration and Creating Repository

  • To create a new repository, register as an admin user from the following link http://127.0.0.1:8000/setup.
  • For this step, determine the Username and save your credentials, which are Key ID and Secret Key. Log in to your admin user profile with this information.
  • Click the Create Repository button in the admin panel and enter the Repository ID, Storage Namespace, and Default Branch values. After that, press Create Repository. Your initial repository has been created.

Adding Data to A New Repository

LakeFS sessions can be used with AWS CLI because it has an S3 compatible API. But please make sure the AWS CLI is installed on your local computer.

  • To be able to configure a new connection using the LakeFS credentials with AWS CLI, type the following command in the terminal. After that, please enter your Key ID and Secret Key values into the terminal.
$ aws configure --profile local
  • To see whether the connection works and to list all the repositories in the workspace, type the following command in the terminal:
$ aws --endpoint-url=http://localhost:8000 --profile local s3 ls# output:
# 2022-01-30 22:57:02 demo-repo
  • Finally, to add new data to the repository by writing it to the main branch, type the following command in the terminal:
$ aws --endpoint-url=http://localhost:8000 --profile local s3 cp ./tweets.txt s3://demo-repo/main/
# output:
# upload: ./tweets.txt to s3://demo-repo/main/tweets.txt
  • Now, the tweets.txt file has been written to the main branch of the demo-repo repository. Please check it on the LakeFS UI.

Committing Changes on Added Data

Thanks to LakeFS, the changes made to data can be committed using LakeFS’s default CLI client lakectl. Please make sure you have installed the latest version of the CLI binary on your local computer.

  • To configure the CLI binary settings, type the following command in the terminal:
$ lakectl config
  • To verify the configuration of lakectl, you can list all the branches in the repository with the following command:
$ lakectl branch list lakefs://demo-repo
  • To commit about added data to the repository, please type the following command in the terminal:
$ lakectl commit lakefs://demo-repo/main -m 'added my first tweets data to repo!'
  • Finally, to check the committed message, type the following command in the terminal:
$ lakectl log lakefs://demo-repo/main

5. Conclusion

Today’s world generates more and more data. Therefore, data-driven companies must store data correctly and safely. Enterprises need to make sure that they utilize the right data versioning platforms/tools to aid their business growth.

--

--

Ugur Selim Ozen
Geek Culture

Data Engineer who is developing himself in Big Data Processing, Data Analytics, Machine Learning and Cloud Technologies. https://ugurselimozen.github.io/