Leonardo Stabile

Data Scientist | ML & Data Engineer

Building and operationalizing machine learning systems end-to-end, from robust data pipelines and engineering foundations to scalable models and MLOps workflows that deliver reliable, production-ready insights.

Languages

Python

Rust

Fortran

Libraries

PySpark

Pandas

NumPy

Scikit-Learn

TensorFlow

Matplotlib

FastAPI

Databases

PostgreSQL

SQL Server

MySQL

MongoDB

Tools & Cloud

Airflow

Docker

Git

Linux

Power BI

My Projects

Data Engineering

Delta Lakehouse: NYC Property Sales

Designed and implemented a scalable Delta Lakehouse architecture, consolidating fragmented NYC real estate datasets into a curated, analytics-ready single source to support market trend analysis and investment decision-making.
Developed PySpark pipelines in Databricks for data ingestion, cleansing, and transformation, improving data quality, enforcing schema consistency, and enabling reliable downstream BI consumption.
Modeled a high-performance Snowflake schema in the Gold layer (AWS S3 + Unity Catalog), optimizing fact/dimension structures to reduce query latency and support self-service analytics.
Created curated SQL views on top of the Gold layer, providing a stable and analytics-ready layer that can be directly consumed by BI tools (e.g., Power BI), enabling self-service reporting and reducing dependency on engineering teams.

Python PySpark Databricks AWS S3 SQL

See Project Repository (GitHub)

Data Science

Laptop Price Estimation

Built a production-style machine learning pipeline to predict laptop prices from hardware specifications.
Trained and optimized Random Forest and XGBoost models using RandomizedSearchCV, achieving an RMSE of $195 on unseen test data.
Designed a modular and reproducible architecture with custom transformers, pipelines, and a serialized model for independent predictions.

Python Pandas Scikit-Learn XGBoost

See Project Repository (GitHub)

Data Science

Credit Risk Analysis and Prediction

Conducted a complete pipeline for credit risk modeling in a dataset with strong class imbalance.
Performed exploratory data analysis (EDA) and outlier treatment to identify predictive variables for default risk.
Developed and compared machine learning models for binary classification using Precision, Recall and F1-Score.

Python Pandas Scikit-Learn

See Project Repository (GitHub)

Data Analysis

Churn Risk Analysis

Conducted a diagnostic analysis on a dataset of 7,000+ customers using Pandas, quantifying an at-risk revenue of R$ 121k/month while monitoring operational metrics and key financial KPIs.
Developed a Power BI Dashboard (DAX) for dynamic monitoring. Applied statistical metrics to evaluate risk and financial impact, supporting strategic decisions regarding portfolio performance and customer retention.
Detected critical patterns, such as monthly contracts with a churn rate exceeding 40%, substantiating proposed corrective actions to increase Customer Lifetime Value (LTV).

Python Pandas Power BI DAX

See Project Repository (GitHub)

Data Analysis

Strategic Commercial Performance Analysis

Performed data analysis on an operation with R$60MM in revenue, identifying key factors associated with negative operational margins.
Developed an end-to-end ETL pipeline using Python and MySQL to extract, transform, and load relational data.
Built an executive dashboard in Power BI for monitoring financial KPIs and tracking commercial performance.

Python Pandas MySQL Power BI

See Project Repository (GitHub in Portuguese)