Leonardo Stabile

Leonardo Stabile

Data Scientist | ML & Data Engineer

Building and operationalizing machine learning systems end-to-end, from robust data pipelines and engineering foundations to scalable models and MLOps workflows that deliver reliable, production-ready insights.

Languages

Python Rust Fortran

Libraries

PySpark Pandas NumPy Scikit-Learn TensorFlow Matplotlib FastAPI

Databases

PostgreSQL SQL Server MySQL MongoDB

Tools & Cloud

AWS Databricks Airflow Docker Git Linux Power BI

My Projects

Data Engineering

Delta Lakehouse: NYC Property Sales

  • Designed and implemented a scalable Delta Lakehouse architecture, consolidating fragmented NYC real estate datasets into a curated, analytics-ready single source to support market trend analysis and investment decision-making.
  • Developed PySpark pipelines in Databricks for data ingestion, cleansing, and transformation, improving data quality, enforcing schema consistency, and enabling reliable downstream BI consumption.
  • Modeled a high-performance Snowflake schema in the Gold layer (AWS S3 + Unity Catalog), optimizing fact/dimension structures to reduce query latency and support self-service analytics.
  • Created curated SQL views on top of the Gold layer, providing a stable and analytics-ready layer that can be directly consumed by BI tools (e.g., Power BI), enabling self-service reporting and reducing dependency on engineering teams.
Python PySpark Databricks AWS S3 SQL
See Project Repository (GitHub)
Data Science

Laptop Price Estimation

  • Built a production-style machine learning pipeline to predict laptop prices from hardware specifications.
  • Trained and optimized Random Forest and XGBoost models using RandomizedSearchCV, achieving an RMSE of $195 on unseen test data.
  • Designed a modular and reproducible architecture with custom transformers, pipelines, and a serialized model for independent predictions.
Python Pandas Scikit-Learn XGBoost
See Project Repository (GitHub)
Data Science

Credit Risk Analysis and Prediction

  • Conducted a complete pipeline for credit risk modeling in a dataset with strong class imbalance.
  • Performed exploratory data analysis (EDA) and outlier treatment to identify predictive variables for default risk.
  • Developed and compared machine learning models for binary classification using Precision, Recall and F1-Score.
Python Pandas Scikit-Learn
See Project Repository (GitHub)
Data Analysis

Churn Risk Analysis

  • Conducted a diagnostic analysis on a dataset of 7,000+ customers using Pandas, quantifying an at-risk revenue of R$ 121k/month while monitoring operational metrics and key financial KPIs.
  • Developed a Power BI Dashboard (DAX) for dynamic monitoring. Applied statistical metrics to evaluate risk and financial impact, supporting strategic decisions regarding portfolio performance and customer retention.
  • Detected critical patterns, such as monthly contracts with a churn rate exceeding 40%, substantiating proposed corrective actions to increase Customer Lifetime Value (LTV).
Python Pandas Power BI DAX
See Project Repository (GitHub)
Data Analysis

Strategic Commercial Performance Analysis

  • Performed data analysis on an operation with R$60MM in revenue, identifying key factors associated with negative operational margins.
  • Developed an end-to-end ETL pipeline using Python and MySQL to extract, transform, and load relational data.
  • Built an executive dashboard in Power BI for monitoring financial KPIs and tracking commercial performance.
Python Pandas MySQL Power BI
See Project Repository (GitHub in Portuguese)