I've joined a fintech startup tasked with improving customer loan offers. Using the Bondora P2P Loans dataset, I'll build build insights about what factors help determine a person's interest rate. I will work your way to creating a predictive model to estimate loan interest rates, which will guide your company in personalizing loan terms efficiently.
The features used in this project are:
VerificationType: Method used for loan application data verification Age: Age of the borrower (years) AppliedAmount: Amount applied Amount: Amount the borrower received Interest: Interest rate LoanDuration: The loan term Education: Education of the borrower EmploymentDurationCurrentEmployer: Employment time with the current employer HomeOwnershipType: Home ownership type IncomeTotal: Total income ExistingLiabilities: Borrower's number of existing liabilities RefinanceLiabilities: The total amount of liabilities after refinancing Rating: Bondora Rating issued by the Rating model NoOfPreviousLoansBeforeLoan: Number of previous loans AmountOfPreviousLoansBeforeLoan: Value of previous loans
Before I start working on the dataset, it is good practice to import all libraries at the beginning of my code.
Now that I have imported the right libraries, I can use Pandas to load the data from the CSV.
Before starting my analysis, take some time to familiarize myself with the dataset to understand the available information. This is also a good opportunity to clean the data by removing missing values and adjusting the index.
Explore the dataset to find the number of rows, columns, and data types of each feature. Identify any missing values.
To recommend loan offers, take a moment to understand the loan amounts and ratings. I’ll also want to get a rough idea of the interest rates being paid.
Customers with a high debt-to-income ratio and less job stability may have more difficulty repaying loans, making them riskier.
I want to identify these customers and be able to add a flag to their loans. I consider borrowers to have less job stability if they have been on the current job for less than 1 year (including those in the trial period). In this scenario, a loan-to-rate ratio above 0.35 is considered risky.
-
Risky loans mean interest rate: 28.86% Non-risky loans mean interest rate: 26.99%
-
Risk premium: risky loans are about 1.87 percentage points higher than non-risky loans (28.86% − 26.99% ≈ 1.87%). Does this reflect a riskier nature?
-
Yes. The higher rate for riskier loans is consistent with risk-based pricing: lenders charge more to compensate for the higher likelihood of default. Since risky loans only make up about 15.88% of the data, their impact on the overall average rate is present but modest. The total dataset rate should sit between the two subgroup means and be closer to the non-risky rate due to the larger share of non-risky loans.
As a fintech analyst, understanding customer profiles allows me to identify patterns. I want to understand how different factors of the borrowers influence loan applications.
To help my identify the different profiles, I decide to use my visualization skills to uncover actionable patterns for tailoring loan offers.
I also want to investigate the relationship between certain numerical features and the interest rate. For that, I decide to use scatter plots, along with the correlation between features.
-
AmountOfPreviousLoansBeforeLoan (r = -0.175) This is the only feature among those four with a modest negative linear relationship to interest rate. In practical terms: Borrowers with more prior loans tend to receive slightly lower interest rates on new loans.Having more prior loans could signal reliability or familiarity with debt management, which modestly mitigates default risk, hence a cheaper rate.
-
The negative sign aligns with risk-based pricing: more established credit behavior tends.
-
The effect is small, but it’s the strongest signal among the four features you reported.
I noticed that there are two similar columns, "AppliedAmount" and "Amount", in the dataset. This implies that sometimes borrowers get loaned a different amount than what they asked for.
If more than 5% of loans are approved for less than requested, the team may need to revise how loan amounts are communicated to applicants. Estimate this proportion using a confidence interval to support your recommendation.
The proportion of loans where the requested amount differs from the given amount is pretty small, so it should be safe to only analyze one of those columns.
To make personalized loan offers, I decide to go one step further in my analysis and build a model to predict interest rates using different customer features. This will help me both be able to predict interest rates for new customers, and observe which features are actually statistically significant in determining the interest rates.
To get my first model going I begin creating a simple linear regression. Based on the correlation analysis I did before, a good candidate for the independent variable is "AmountOfPreviousLoansBeforeLoan", which presented the strongest correlation with the target variable "Interest".
Use this next cell to make some further analysis on my model. I are already given the line of best fit graph, but can add as many visualizations as I wish.
The predictor AmountOfPreviousLoansBeforeLoan has a statistically significant negative association with Interest (coefficient ≈ -0.0006, p < 0.001), but the model’s R-squared is only 0.031. So it’s unlikely this single predictor meaningfully explains variability in interest.
Since predicting the interest rate using a single variable didn’t yield strong results, I decide to take a more comprehensive approach. This time, I’ll build a more complex model that includes multiple variables—possibly even some categorical ones.
This time I would use VerificationType, NoOfPreviousLoansBeforeLoan, AmountOfPreviousLoansBeforeLoan, Rating as my predictors.
The model explains about 60% of the variation in Interest (R-squared ≈ 0.605). The strongest drivers are the rating categories (Rating_AA through Rating_HR) and verification indicators, with large, highly significant coefficients for mid-to-high ratings (D–HR). The intercept is stable at around 15.1, representing the baseline Interest at the reference levels. This setup achieves a good balance between explanatory power and interpretability: the ratings and verification status provide clear, actionable signals about pricing.