Data Science Toolbox
ELI5
tags: #explainibility
#interpretability
#python
ELI5 is a Python library which allows to visualize and debug various Machine Learning models using unified API. It has built-in support for several ML frameworks and provides a way to explain black-box models.
SHAP
tags: #explainibility
#interpretability
#python
SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. Includes great (interactive) dashboards. Only used it on random forrests so far.
Lime
tags: #explainibility
#interpretability
#python
lime (Local Interpretable Model-agnostic Explanations) explain what classifiers are doing.
Interpretable Machine Learning - A Guide for Making Black Box Models Explainable.
tags: #explainibility
#interpretability
#book
online version
UMAP
tags: #dimensionreduction
Code, Paper Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction.
t-SNE
Explaination of Random Forrest Feature Importance
tags: #randomforrests
#featureimportance
Article
dtreeviz
tags: #interpretability
#randomforrests
Code Article A python library for decision tree visualization and model interpretation.
HDBSCAN
tags: #visualisation
#unsupervised
Documentation The hdbscan library is a suite of tools to use unsupervised learning to find clusters, or dense regions, of a dataset.
Kalman Filters
Huber Loss
If MSE is too sensitive to outliers in your data and MAE not enough try Huber loss
Active Learning
tags: #data
#labels
If you have lots of unlabelled data but labelling is expensive, active learning can help find the best data to label.
- modAL python framework for active learning
- Active Learning
Pandas Profiling
tags: #exploration
#python
#visualization
pandas’ DataFrame.describe()
on steroids.
SweetViz
tags #exploration
#python
#visualization
In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code, alternative to Pandas Profiling
Great Expectations
tags: #data
#etl
#pipeline
Helps eliminate data pipeline debt, through data testing, documentation, and profiling. Assertions for data
PopMon
tags: #data
#pipeline
#drift
Monitor the stability of a pandas or spark dataframe
Kats “One stop shop for time series analysis in Python”
tags: #timeseries
#forecasting
#detection
Includes 10+ forecasting models, backtesting hyperparameter tuning, pattern detection and time series feature extraction.
Darts
Forecasting: Principles and Practice
tags: #book
#timeseries
#rlang
#forecasting
Book Mostly methods like ARIMA, including a chapter on #hierarchical
time series
Matrix Profile
tags: #timeseries
#anomalydetection
#motif
Website Presentation Part 1 and Part 2 Python package #python
“The matrix profile is a data structure and associated algorithms that helps solve the dual problem of anomaly detection and motif discovery. It is robust, scalable and largely parameter-free.”
Time Series Classification Repository
tags: #timeseries
#classification
http://timeseriesclassification.com/
tsfresh
tags: #python
#timeseries
#features
tsfresh alculates a large number of time series characteristics (features).
Featuretools
tags: python #features
Featuretools automatically creates features from temporal and relational datasets (timeseries and relational data)
Reptile
Article tags: #fewshot
#metalearning
Yellowbrick
tags: #vizualization
Website Yellowbrick extends the Scikit-Learn API to make model selection and hyperparameter tuning easier. Under the hood, it’s using Matplotlib.
SMOGN
tags: #imbalanced
#machinelearning
SMOGN: Synthetic Minority Over-sampling for regression with Gaussian Noise
HanTa Hanover Tagger
tags: #nlp
#tagger
A simple approach to lemmatization and POS-tagging based on heuristics and hidden markov models of German morphology. Github