Learning Library

← Back to Library

PyTorch Basics: Data Prep and Modeling

11m • Unknown Channel • ai-ml • interview • beginner • Watch on YouTube ↗

Key Points

PyTorch is an open‑source machine‑learning and deep‑learning framework hosted by the PyTorch Foundation (part of the Linux Foundation) that offers a community‑driven, openly governed ecosystem.
It streamlines the typical training workflow—data preparation, model building, training, and testing—by providing built‑in utilities for each stage.
For data handling, PyTorch supplies Dataset and DataLoader classes that efficiently download, batch, shuffle, and iterate over potentially massive datasets (from gigabytes to petabytes).
Model construction is simplified with a rich library of pre‑defined layers (e.g., linear, convolutional) and activation functions, allowing users to define complex deep‑learning architectures with minimal friction.

Sections

Full Transcript

# PyTorch Basics: Data Prep and Modeling **Source:** [https://www.youtube.com/watch?v=fJ40w_2h8kk](https://www.youtube.com/watch?v=fJ40w_2h8kk) **Duration:** 00:11:56 ## Summary - PyTorch is an open‑source machine‑learning and deep‑learning framework hosted by the PyTorch Foundation (part of the Linux Foundation) that offers a community‑driven, openly governed ecosystem. - It streamlines the typical training workflow—data preparation, model building, training, and testing—by providing built‑in utilities for each stage. - For data handling, PyTorch supplies Dataset and DataLoader classes that efficiently download, batch, shuffle, and iterate over potentially massive datasets (from gigabytes to petabytes). - Model construction is simplified with a rich library of pre‑defined layers (e.g., linear, convolutional) and activation functions, allowing users to define complex deep‑learning architectures with minimal friction. ## Sections - [00:00:00](https://www.youtube.com/watch?v=fJ40w_2h8kk&t=0s) **Introducing PyTorch and Its Core Workflow** - In this segment, an expert outlines PyTorch as an open‑source machine‑learning framework backed by the Linux Foundation and walks through its primary features, including data preparation, model building, training, and testing. - [00:03:38](https://www.youtube.com/watch?v=fJ40w_2h8kk&t=218s) **Nonlinearity and Loss Functions in Training** - The speaker explains that adding nonlinearity prevents a model from reducing to a straight line, then outlines the training loop—randomly initializing parameters, performing a forward pass, computing loss against the target, and using PyTorch’s loss functions to guide the model toward the desired output. - [00:07:10](https://www.youtube.com/watch?v=fJ40w_2h8kk&t=430s) **PyTorch: Easy, Flexible, Multi‑Platform** - The speaker highlights PyTorch’s beginner‑friendly, Pythonic design, comprehensive documentation, and its flexibility to run on CPUs, GPUs, distributed clusters, and mobile devices, while also addressing how users can become contributors. - [00:10:46](https://www.youtube.com/watch?v=fJ40w_2h8kk&t=646s) **PyTorch Community Highlights & Benefits** - The speaker promotes PyTorch’s latest improvements—storage, compiler optimizations, benchmarking, documentation, and multi‑GPU distributed support—while encouraging viewers to join the active community. ## Full Transcript

0:00PyTorch has emerged as the de facto standard for machine learning and deep learning. 0:06And I know a little bit about PyTorch, but I've brought in an expert, Sahdev Zala, 0:11to teach us all more about PyTorch. 0:14So, Sahdev, what is PyTorch? 0:16Hi Brad! So it's a framework for machine learning and deep learning. 0:21And what I mean by that is 0:26you can use PyTorch to build your models 0:30because it provides you all of the building blocks. 0:33It provides you all the functionalities to run faster training on that model. 0:38And it's an open source project under PyTorch Foundation, which is part of the Linux Foundation. 0:43So there is a dynamic community behind the project. 0:46Oh, great! So it's got an ecosystem and it's in the Foundation. 0:50So that means you're going to have open governance and a level playing field. That's wonderful. 0:54Well, Sahdev, can you tell me about the key features of PyTorch? 0:58Yeah, sure. That's a great question. 1:01So let me just mention the common steps of model training. 1:05So first, you need to prep your data, 1:08your data set for training. 1:13And, ideally you also want to do it for testing. 1:17And then, the other steps is you're going to build your model. 1:25And you're going to train it. 1:29And as I mentioned, you're going to test. 1:34Okay. So those look like some pretty straightforward features. 1:37Why don't you tell me about the first one? What do you mean by prepping the data? 1:41Right. So, the data says are you going to use for your model, maybe small as you're learning it, 1:46but for larger models, these data sets can be huge--10 terabytes, petabytes wide. 1:52So how do you use this data to train your model and then test it? 1:59So PyTorch provides you two things here. 2:02Data sets and data loader classes that help you to easily feed this data for your training and testing. 2:14Okay. How does this help me? Does it speed things up? 2:17Well, that's a good question. So, it helps you to download the data to make it accessible for your training and testing. 2:22And this data loader, it provides you iterator over this data so that you can use them to train in a batch. 2:30Because you're not going to just feed one data at a time. You're going to train using the batch sizes that you want. 2:37It also provides you other things like shuffling the data. 2:41You don't want to just feed the data in an order so that your model, it's only memorizing the data versus versus it's learning. 2:49So this will shuffle for you as well. It has other features as well. 2:53Very nice. Well, it also helps you to build models? 2:56Absolutely. So, once you think you're ready with your preparation 3:01using PyTorch because it takes you all the takes care of all the complexity, the next step would 3:05be and building the model to define your models and for that what you need is layers because it's 3:12a deep learning, it's made of multiple layers. So you need different layers like linear layer or 3:17combination layer. And there are many others that are provided by partners to you. And that are also 3:23things like, besides layers, that are activation functions that you'll be using to add nonlinearity 3:31to your model-- that's also provided to you by PyTorch. So you don't have to do anything but 3:36just to call those functions. 3:38What do you mean by nonlinearity? 3:40So, in general, when you train the model and then-- it's a mathematical term, right, linear as well, 3:46but it will if you don't get nonlinear, you basically just get like a one straight line. 3:51And in real life, not everything is just changing in X will be same as changing your Y 4:00output. So it adds you that nonlinearity for you. And the next step would be training and there 4:08basically I can talk more about the training side, Brad. 4:12Well, so tell me about features-- what does it do to help you train? 4:15So training will require to use the loss function. And loss function 4:19is basically to find out the loss that you going to have. As when you run this model like 4:27a forward pass from the input and you get some output. Well, I'm not going to have the correct output 4:34every time magically, there's no magic there. So you can have lots of parameters in between, you are 4:39just going to randomize them in the beginning, you got some output, but then you're going to 4:43have a loss function to calculate the loss from the desired output. 4:47So your want your model to reach a certain expectation. And typically during the training process that model's falling short 4:55and you're seeing how much it's falling short from where you want it to be. 4:58Exactly. So that loss functions, there are multiple loss functions and PyTorch provides it to you again. You again 5:04you call them according to your need for the model. Once you have the loss function used, 5:13the next big thing is finding the gradient of this loss with regards to your parameters. So PyTorch 5:20provides the backward propagation for you, or, auto-grade features. That is by far one of the 5:32most popular feature of PyTorch, that it will calculate the gradient for you. 5:37So if we all think back from our calculus days, gradients are this piece that helps you to tweak and 5:46get the model the way you want it and it's got it built-in for you. 5:49Exactly. So once you got the gradient, you basically run the optimizer function just to step over, which is again 5:55provided by PyTorch to you. And like you exactly said, you're going to tweak the parameters, you're 6:00going to optimize it to reach to a level in a number of iterations. So that you basically define 6:07those iterations. But the number of iterations you're going to reach to a level where we're like, 6:12you know what, that should be enough training. I do like 3x, 5x iterations, and at that point 6:19you are ready to test it. 6:22And is that a big deal for these models 6:25to have to do testing or I just test once I'm done? Or is it more more than that? 6:28Yeah. So the next step would be no. From here to the test side. You need to test it. Ideally it's optional. But as part of testing, PyTorch provides 6:39a function, an eval evaluation. So you can evaluate your model. And at that point, you're not 6:45going to calculate the gradient, you're not going to find the loss function. You basically just do 6:52the forward pass. You see what you're getting. And if you're happy with it, then pretty much ready 6:59to use the model. If you're not, then you're going to do the further training. And again, 7:03this data sets, which I mentioned earlier, that would be used for training useful test, white or 7:08black are two different datasets. 7:10So as part of the testing. I'm getting to decide, hey, is my model good enough? I think I'm ready to go with it. 7:15Pretty much, yeah. 7:17Well, it all seems a little complicated to me. Is PyTorch really easy to use? 7:22Well, yes, that's one of the best things I love about PyTorch. It's easy to get started. It's easy to install. 7:30It's easy to use because it's Pythonic; the "Py" in PyTorch is for Python. So you know how much 7:39data scientists just love Python. Absolutely. PyTorch is in Python. And it's been easily I use 7:47by data scientists. And if someone if they don't know Python, they can learn it quickly as well. 7:53PyTorch.org provides a lot of good documentation, tutorials that will help you to get started very 8:00quickly and it's also flexible. So I mentioned the training on your right. You can run training, 8:07you can run your PyTorch on CPU just using the tensor that PyTorch uses in 8:14data structure (multi-dimensional arrays). They can be run on CPUs, 8:19they can be run on GPUs, they can do the training on multiple CPU and GPU on a single machine, 8:25you can do that on a distributed environment on multiple machines, multiple GPUs. And you can 8:33like say part of that you can just run PyTorch on your laptop and play with it. There is also like 8:39a mobile development going on to to help PyTorch on your mobile devices. 8:44So yeah, it's a lot of options. Supports a lot of platforms. GPUs. CPUs. Well, what if I want to be a contributor? 8:53Well, that's great question. Something I love as an contributor myself, so it's actually very easy. 8:59PyTorch is part of PyTorch Foundation as I mentioned. There's a dynamic community behind it, 9:05very friendly. Lots of people are going to help you to get started, to contribute. As long as you 9:13sign the CLA, follow the code of conduct, these are things to do. You are ready to contribute. 9:19The community also provides weekly office hours. 9:22Office hours, that's huge. I can come in as a new person and say, hey, could you help me out 9:27or can you give me an easy first item to work on? I could do that in an office hours. 9:30Yes, exactly. And there are things like you can easily find the good first issues. You can find the document issues to 9:35get started with and you can ask questions. And the office hours, through their Slack channel is another one. 9:42And one of the classic tips is when you join a new project, ask for a mentor and ask them to put you to work on something. Because 9:51when they put you to work on something, they're going to be very interested in what you're doing 9:55and they're going to give you timely reviews and answer all your questions. So tell me more about 9:59how IBM is contributing to PyTorch. 10:01Yeah, sure. Well, IBM is contributing to PyTorch in a big way, 10:06like IBM always do. By using PyTorch, so we are going to contribute to help the community, grow 10:13the community. And a part of that, we working on many different things, something called like FSDP, 10:20Fully Sharded Data Parallel, well, an advanced topic, but it helps you to shard the model parameters 10:28across multiple GPUs across multiple machines for fast training and for your 10:35large models they may not fit in like a single GPU or CPU. And so we are contributing there. There's 10:41really good blog posts out there. Just search for it, "IBM FSDP PyTorch" wiil find it quickly. 10:46Highly recommend to read it. We also provide improvements in the storage site for training, 10:53compiler optimizations. And besides that benchmarking, test side improvements and documentation. 11:02And we have multiple developers working in the community. 11:04So it sounds like there's lots of nice features to help it support those large foundation models, supporting multiple GPUs 11:12and running in a distributed fashion. And a lot of work being done for benchmarking, 11:16seeing how fast things are running and obviously a lot of work in the documentation to help others 11:21get started. It's a fabulous. It is. 11:23It's amazing. I'm so glad to be part of the community. 11:27Well, thank you, Sahdev. I've learned a lot today. This is fabulous. We hope that you've learned a lot 11:34about PyTorch and we encourage you to come join the community. We really enjoy working 11:41on PyTorch and pushing forward with your deep learning/machine learning initiatives. 11:46Thanks for watching our video. And don't forget, if you liked it, remember to hit like and subscribe.