Introduction

AlphaPy Pro is an advanced machine learning framework for data scientists and quantitative traders. Built on top of scikit-learn, pandas, and modern ML libraries, it provides a comprehensive toolkit for feature engineering, model development, and systematic trading. Here are just some of the things you can do with AlphaPy Pro:

Build and optimize ML models using scikit-learn, XGBoost, LightGBM, and CatBoost.
Analyze financial markets with MarketFlow using multiple data providers.
Implement advanced trading strategies with meta-labeling and the Triple Barrier Method.
Develop and backtest trading systems with portfolio analysis.
Implement custom domain-specific pipelines for specialized applications.

The alphapy package is the core platform providing the ML pipeline. The domain pipeline MarketFlow (mflow) runs on top of alphapy for financial market analysis. As shown in the diagram below, we separate the domain pipeline from the model pipeline. The domain pipeline transforms raw application data into canonical form—training and testing sets—while the model pipeline handles feature engineering, model training, and evaluation. This architecture has been refined through numerous Kaggle competitions and real-world trading applications.

Let’s review all of the components in the diagram:

Domain Pipeline:: This is the Python code that creates the standard training and testing data. For example, you may be combining different data frames or collecting time series data from an external feed. These data are transformed for input into the model pipeline.
Domain YAML:: AlphaPy uses configuration files written in YAML to give the data scientist maximum flexibility. Typically, you will have a standard YAML template for each domain or application.
Training Data:: The training data is an external file that is read as a pandas dataframe. For classification, one of the columns will represent the target or dependent variable.
Testing Data:: The testing data is an external file that is read as a pandas dataframe. For classification, the labels may or may not be included.
Model Pipeline:: This Python code is generic for running all classification or regression models. The pipeline begins with data and ends with a model object for new predictions.
Model YAML:: The configuration file has specific sections for running the model pipeline. Every aspect of creating a model is controlled through this file.
Model Object:: All models are saved to disk. You can load and run your trained model on new data in scoring mode.

Core Functionality

AlphaPy has been developed primarily for supervised learning tasks. You can generate models for any classification or regression problem.

Binary Classification: classify elements into one of two groups
Multiclass Classification: classify elements into multiple categories
Regression: predict real values based on derived coefficients

Classification Algorithms:

CatBoost (CATB)
LightGBM (LGB)
XGBoost (XGB) Binary and Multiclass
Random Forests (RF)
Extra Trees (EXT)
Gradient Boosting (GB)
Logistic Regression (LOGR)
K-Nearest Neighbors (KNN)
Support Vector Machine (SVM)
Naive Bayes (NB)
AdaBoost (ADA)

Regression Algorithms:

CatBoost Regressor
LightGBM Regressor
XGBoost Regressor
Random Forest Regressor
Extra Trees Regressor
Gradient Boosting Regressor
Linear Regression
Ridge Regression
Lasso Regression
K-Nearest Neighbors Regressor

Key Features

AlphaPy Pro includes several advanced features for modern ML workflows:

Meta-Labeling: Triple Barrier Method for advanced financial ML
Feature Engineering: Automated feature generation with clustering, interactions, and transformations
Feature Selection: LOFO (Leave One Feature Out) importance and univariate selection
Model Calibration: Probability calibration with sigmoid and isotonic methods
Advanced Visualization: Learning curves, ROC curves, confusion matrices, and feature importance plots
Multiple Data Sources: EODHD, Yahoo Finance, Polygon, IEX Cloud, and more
Grid Search: Randomized and systematic hyperparameter optimization
Ensemble Methods: Model blending and stacking

External Packages

AlphaPy Pro leverages cutting-edge ML and data science packages:

Gradient Boosting: XGBoost, LightGBM, CatBoost
Feature Engineering: category_encoders, lofo-importance
Imbalanced Learning: imbalanced-learn (SMOTE, ADASYN, etc.)
Calibration: venn-abers for probability calibration
Market Data: yfinance, polygon-api-client, pandas-datareader
Portfolio Analysis: pyfolio (legacy), custom portfolio analytics
Visualization: matplotlib, seaborn, plotly
Time Series: statsmodels, arch