Learning Library

← Back to Library

PCA Simplifies Multi-Dimensional Loan Risk

Key Points

  • Principal Component Analysis (PCA) compresses high‑dimensional data into a few “principal components” that preserve most of the original information.
  • In risk management, loans have dozens or hundreds of attributes (e.g., amount, credit score, age, debt‑to‑income), making it hard to compare them directly.
  • Reducing dimensions with PCA speeds up machine‑learning training and inference and simplifies visual analysis, turning complex data into 2‑ or 3‑dimensional plots.
  • Simple visualizations (one‑dimensional line, two‑dimensional scatter, three‑dimensional axes) show clear loan clusters, but adding more dimensions quickly becomes unwieldy without dimensionality reduction.
  • PCA therefore provides a systematic way to identify and retain the most important features while discarding less‑informative ones, enabling effective clustering, modeling, and visualization of loan risk.

Full Transcript

# PCA Simplifies Multi-Dimensional Loan Risk **Source:** [https://www.youtube.com/watch?v=ZgyY3JuGQY8](https://www.youtube.com/watch?v=ZgyY3JuGQY8) **Duration:** 00:08:45 ## Summary - Principal Component Analysis (PCA) compresses high‑dimensional data into a few “principal components” that preserve most of the original information. - In risk management, loans have dozens or hundreds of attributes (e.g., amount, credit score, age, debt‑to‑income), making it hard to compare them directly. - Reducing dimensions with PCA speeds up machine‑learning training and inference and simplifies visual analysis, turning complex data into 2‑ or 3‑dimensional plots. - Simple visualizations (one‑dimensional line, two‑dimensional scatter, three‑dimensional axes) show clear loan clusters, but adding more dimensions quickly becomes unwieldy without dimensionality reduction. - PCA therefore provides a systematic way to identify and retain the most important features while discarding less‑informative ones, enabling effective clustering, modeling, and visualization of loan risk. ## Sections - [00:00:00](https://www.youtube.com/watch?v=ZgyY3JuGQY8&t=0s) **PCA for Loan Risk Analysis** - The speaker explains how PCA compresses many loan attributes into a few principal components to identify similarities and assess risk. - [00:03:07](https://www.youtube.com/watch?v=ZgyY3JuGQY8&t=187s) **Visualizing PCA Dimensionality Reduction** - The passage explains how PCA compresses multi‑dimensional data into two principal components for scatter‑plot visualization, outlines its historical roots, and highlights its role in combating the curse of dimensionality for machine learning. - [00:06:19](https://www.youtube.com/watch?v=ZgyY3JuGQY8&t=379s) **PCA Concepts and Real-World Applications** - The speaker explains how PC1 captures the greatest variance and must be uncorrelated with PC2, then outlines practical PCA uses—including image compression, data visualization, noise filtering, and medical diagnosis exemplified by a breast‑cancer dataset. ## Full Transcript
0:00Principal component analysis, or PCA, reduces the number of dimensions in large data sets 0:05to principal components that retain most of the original information. 0:09And let me give you an example of why that matters. 0:12So consider a risk management scenario. 0:15We want to understand which loans have similarities to each other for the purposes of understanding 0:20which type of loans are typically paid back, and which type of loans are going to be more risky. 0:25Now take a look at this table here, which shows data for six loans. 0:30Now these loans contain multiple dimensions, 0:33like how much the loan is for the credit score of the person applying for the loan and stuff like that. 0:39And while we're showing four dimensions here, 0:42a loan consists of many, many more dimensions than this. 0:46So for example, I can think of age of borrower would be another one. 0:49Debt to income ratio is another one as well. 0:52And that's just for starters. 0:53There were could potentially be hundreds or even thousands of dimensions. 0:59And PCA is a process of figuring out the most important dimensions or the principal components. 1:06Now, intuitively, I think we know that some dimensions are more important than others when considering risk. 1:11So, for example, I'd imagine and I'm not a financial analyst, 1:15but still, I'd imagine that credit score is probably more important than the years a borrower has spent in their current job. 1:23Probably. 1:24And if we get rid of these non important or less important dimensions, we'll see two big benefits. 1:29One is faster training and inference in machine learning is there'll be less days to process fewer dimensions. 1:36And then secondly data visualization becomes easier if there are only two dimensions. 1:42And let me show you what I mean by that. 1:45So if we only measure one dimension, let's take loan amount, 1:48we can plot that on a number line that shows us that loans one, two, and three have relatively low values, 1:55and then loans 4 or 5 and six have relatively high values. 2:08So this tells us that loan one is more similar to loan two than it is to say loan six when we consider just the dimension of loan amount. 2:09Okay. 2:09Now let's bring in a second dimension of credit score. 2:13So now loan amount spans the x axis and credit score is on the y axis. 2:18And we can see two clusters loans one two and three cluster on the lower left and loans four, five and six. 2:24They cluster on the top right. 2:27Cool. 2:28What about adding a third dimension to our scatter plot of annual income? 2:32Well, that gives us a z axis. 2:34And now we're looking at data in 3D. 2:37We'll still see some clustering here with loans 4 or 5 and six closer to the front of the z axis, indicating relatively high income amounts. 2:45Now, if I want to keep going, adding a fourth dimension, well, things are going to get complicated. 2:51Perhaps we could use color coding or different shapes, but it's becoming unwieldy. 2:56And what if we want to add another one or 2 or 100 dimensions to our visualization on top of that? 3:02Well, thankfully, this is where principal component analysis comes in. 3:07PCA can take four or more dimensions of data and plot them. 3:12This results in a scatter plot with the first principal component, which we call PC1 on the x axis, and the second principal component, which we call PC2 on the y axis. 3:21The scatter plot shows the relationships between observations, the data points, and the new variables & principal components. 3:27The position of each point shows the values of PC1 and PC2 for that observation. 3:32Effectively, we've kind of squished down potentially hundreds of dimensions into just two, and now we can see correlations and clusters. 3:41But how does this all work? 3:44Well let's take a closer look at principal component analysis. 3:48Now PCA is not exactly new. 3:52It's actually credited to Carl Pearson with the development of PCA back in 1901. 4:01But it has gained popularity with the increased availability of computers that could perform statistical computations at scale. 4:08Now, today, PCA is commonly used for data pre-processing for use with machine learning algorithms and applications. 4:21So we've come from 1901 down to machine learning. 4:24It can extract the most informative features while still preserving large data sets with the most relevant information from the initial data set. 4:33Because after all the more dimensions in the data, the higher the negative impact on model performance. 4:40And that impact actually has a pretty cool name. 4:43It's called the curse of dimensionality, and PCA can help us make sure that we can limit that very curse. 4:56Now by predicting a high dimensional data set into a smaller feature space, PCA also helps with something else and that is called overfitting. 5:08So with PCA we can minimize the effects of overfitting. 5:12And what is overfitting? 5:13Well, this is where models will generalize poorly to new data that was not part of their training. 5:19Now there's a good deal of linear algebra and matrix operations behind how PCA works, and I'll spare you from that in this video. 5:27But at a high level, what PCA is doing is summarizing the information content of large data sets into a smaller set of uncorrelated variables, known as principal components. 5:39These principal components are linear combinations of the original variables that have the maximum variance compared to other linear combinations. 5:46Essentially, these components capture as much information from the original dataset as possible. 5:51Now the two major components are calculated in PCA are called, first of all, the first principal component, which we abbreviate to PC1, and then the second principal component PC2. 6:06Now the first principal component, PC1, is the direction in space along which the data points have the highest or the most variance. 6:16It's the line that best represents the shape of the projected points. 6:19The larger the variability captured in the first component, the larger the information retained from the original data set, and no other principal component can have a higher variability than PC1. 6:30Now, PC2 accounts for the next highest variance in the data set, and it must be uncorrelated with PC1. 6:38So the correlation between PC1 and PC2 that equals zero. 6:42All right. So where is PCA useful? 6:45Let's talk about a couple of use cases. 6:51Now I think one use case where we've seen a lot of PCA use is in an area related to image compression. 7:00So PCA reduces image dimensionality while retaining essential information. 7:05So it effectively helps create compact representations of images making them easier to store and transmit. 7:12Now we've already seen how this can also be used for data visualization. 7:18PCA helps to visualize high dimensional data by projecting it into a lower dimensional space, like a 2D or 3D plot graph. 7:26And it's also very useful in noise filtering. 7:30And by noise. 7:31Here I'm talking about noise in data. 7:34This is a common use case where PCA can remove noise or redundant information from data, by focusing on the principal components that capture the underlying pattern. 7:43PCA also has applicability within the healthcare area as well. 7:51Now, for example, it's assisted in diagnosing diseases earlier and more accurately. 7:56Now, one study used PCA to reduce the dimensions of six different data attributes in a breast cancer dataset. 8:02So things like the smoothness of noise and perimeter of lump. 8:06Then a supervised learning classification algorithm, a logistic regression was applied to predict whether breast cancer is actually present. 8:14Look, essentially if you have a large data set with many dimensions and you need to identify the most important variables in the data, take a good look at PCA, because it might be just what you need in your modern machine learning applications, which is not at all bad for a technique first developed in 1901. 8:35If you like this video or want to see more like it, please like and subscribe. 8:40If you have any questions or want to share your thoughts about this topic, please leave a comment below.