Top 5 reasons why IPFS is a powerful tool for Machine Learning
With a focus on distributing large datasets and models across services as a pain point.
6 min read
IPFS (InterPlanetary File System) is a new promising protocol to replace good old HTTP in many parts of the Internet. It is a peer-to-peer alternative to server-client architectures that's popular today. I'm not even trying to say that HTTP is going to go extinct. As we see in the history Web, HTTP is here to stay for small file exchanges, like SMTP is still a popular messaging protocol. IPFS shines for use cases such as large file transfers and content streaming. That's why I think it's pitched appropriately as "Hypermedia transfer protocol" contrasting to Hypertext transfer protocol. To learn more about IPFS, I recommend reading this article on HackerNoon.
Here, we will look at five compelling reasons why Machine Learning engineers should consider IPFS in their technology stack. We will focus on distributing large datasets and models across services as a pain point in this discussion.
1. there is no single point of failure
Your DevOps team is always figuring out the best ways to distribute services to ensure high availability. There are a variety of ways that can be used to accomplish this. Dynamically duplicating nodes across loads and geographies is the most common method. But, transporting huge datasets and machine learning models using HTTP becomes a pain. This is because HTTP-based pipelines necessitate the following actions: the resolution of URL in a network, data transfer reliability, and active URL configuration across services - a true pain.
IPFS solves this with two easy solutions - Content Addressing and Bitswap. To summarise, you don't have to resolve the active internal URL every time the network changes. The CID on IPFS is universal, regardless of where a file is located. To ensure reliable transportation, IPFS downloads a file in chunks from various sources. Because CID is universal, your services do not need to be re-configured to the latest URL. Imagine services like HuggingFace and TensorflowHub start serving pre-trained models through IPFS. You'll be able to download models from anyone who uses the same models as you, not only the original service. Imagine how cheap these services can then become as a result.
2. upgradable with backward compatibility
At the heart of software engineering are version control and backward compatibility. Failing to do so leads to the accumulation of technical debts. This is one of the concerns that prevent developers from incorporating IPFS into their tech stack. Needless to say, IPFS, being a well-designed protocol, addresses these concerns on its own. Take a look at "IPFS multiformats" to learn more about it. Now that it's taken care of, we only have to consider the technical debt we might accumulate. And well, AI is notoriously famous for that. IPFS can help you solve a handful of similar challenges that your project can run across.
For the sake of explanation, I'm presuming you're familiar with Git. For small files and repositories, Git is ideal. It is not yet a scalable solution for large files. As a side note, I'd like to draw attention to another project called Git-LFS, which aims to address this issue as a workaround. In summary, Git isn't the best option for distributing huge datasets and model checkpoints. Furthermore, we still don't have a batteries-included library/solution for dealing with versioned distribution in a cloud-friendly (microservice) manner with Git. With native libraries, API endpoints, and access gateways, IPFS has it all covered.
Consider the examples from the preceding sections to put things in context. HuggingFace, one of the most popular and resource-intensive NLP services, has switched to Git-LFS for model and dataset distribution (they previously used URL-friendly S3 buckets). I'm hoping that, after reading this post, they will adopt IPFS in future editions because of its superiority. For file downloads, TensorflowHub still uses HTTP. As a side note, here are some other ideas to consider in this context:
3. network optimizations and load-balancing
We discussed how content on IPFS is discovered and transported between nodes in the first section. In this section, we are interested in the importance of controlling the data flow for a variety of reasons. I said it's important because, for the most part, any system designer should be able to predict and efficiently control the data flow in a system.
IPFS, like any other data management system, allows you to restrict the direction and accessibility of data transfers within a private network. There are two options for doing so. One alternative is to connect only with peers who share a secret key. This solution requires an external orchestration service such as Kubernetes. The second method, which is IPFS-specific, is to create an IPFs cluster.
Content lookup time is an important metric to consider. In the worst situation, O(log n) lookups are required for a DHT, where n is the number of nodes in the DHT (do not get confused with the number of nodes in the network). However, this is not much of a concern in a well-managed network. Pinning services are one such solution there to help with this. Adding public/popular gateways to the "peers list" is another approach.
The "Pinning" feature of IPFS can ensure that you have access to a file even if no one else is actively using it.
4. untrusted storage
This is an inherently obvious point so far. There is no need for a trusted third party to store data on IPFS. This is a powerful idea when it comes to storing enormous amounts of data. Even though the average cost of storage is decreasing, the cost of bandwidth to access it is still not keeping up with demand. When safety is a concern, the effect is amplified.
Instead of dedicated services, IPFS gives users an option to choose potentially unsafe storage on the face value of the reliable bandwidth they could get. Basic encryption techniques on top of this can offer better privacy. Here are a few pointers: Use a Linux-native, lightweight, encrypted transport media like Wireguard (Linux 5.6), encrypt data before storing it, and use fragmented scrambled storage of your data. IPFS is also packed with several other privacy features.
5. native to neural networks?
Instead of discussing any practical use cases in this section, I'd like to provide a small thought experiment. I wanted to give a special mention to the IPLD project that's part of IPFS. In DeepLearning we're dealing with graph structures. We're dealing with graph structures in DeepLearning. Whether it's a vanilla/graph neural network design, a knowledge graph database, or basic text data in NLP. The point is, every change we're making to these existing data structures as part of learning or knowledge update, is local. It changes only a part of a large data structure leaving the rest of it unaltered. Besides, our present tools don't allow elegant propagation of these changes. Instead, with each update we make, we distribute the full data structure.
5+ There's more
We have discussed the most compelling reasons for adopting IPFS in your ML projects. However, the possibilities are only limited by your creativity. Take a look at what the IPFS community is encouraging ML engineers to do. Big thanks to Zachary Whitley.
At Aquila Network, we're building a federated search engine to ensure unbiased content discovery for independent content creators. We're engineering the next generation of distributed search engines and honestly, we think the tech stack is cool. We're also evaluating and leveraging IPFS in different parts of it. If you are interested in integrating IPFS into your tech stack and looking for a helping hand please let us know.
We're building in public. Take a look at our Github repositories. Also, if you would like to help us with a star in Github, do it on AquilaDB.