Machine Learning Cookbook

This is a personal collection of repetitive commands and snippets for ML projects.

AWS

Specify key manually in boto3

S3 operations on boto3

SNS operations on boto3

AWS ML services

Enable static website hosting on S3
Enable hosting

aws s3 website s3://somebucket --index-document index.html

Goto Permissions > Public Access Settings > Edit and change (Block new public bucket policies, Block public and cross-account access if bucket has public policies, and Block new public ACLs and uploading public objects) to false.

Navigate to Permissions > Bucket Policy and paste this policy.

Conda

Install OpenCV in conda

conda install -c conda-forge open-cv

Update conda

conda update -n base -c defaults conda

Make binaries work on Mac

sudo xcode-select --install
conda install clang_osx-64 clangxx_osx-64 gfortran_osx-64

Create/Update conda environment from file

conda env create -f environment.yml
conda env update -f environment.yml

Install CUDA toolkit in conda

conda install cudatoolkit=9.2 -c pytorch
conda install cudatoolkit=10.0 -c pytorch

Disable auto-activation of conda environment

conda config --set auto_activate_base false

Disable multithreading in numpy

export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export OPENMP_NUM_THREADS=1
export MKL_NUM_THREADS=1

Faster alternatives to Conda

Docker Image	Remarks
micromaba-docker	Small binary version of mamba
condaforge/mambaforge	Docker image with conda-forge and mamba
condaforge/miniforge	Docker image with conda-forge as default channel

Celery

Run celery workers
File tasks.py contains celery object, concurrency is set to 1 and no threads or process are used with -P solo

celery -A tasks.celery worker --loglevel=info -P solo

Start flower server to monitor celery

Use flower from docker-compose

Docker

Start docker-compose as daemon

docker-compose up --build -d

Use journald as logging driver
Edit /etc/docker/daemon.json, add this json and restart.

{
  "log-driver": "journald"
}

Send logs to CloudWatch

sudo nano /etc/docker/daemon.json

sudo systemctl daemon-reload
sudo service docker restart

Set environment variable globally in daemon

mkdir -p /etc/systemd/system/docker.service.d/
sudo nano /etc/systemd/system/docker.service.d/aws-credentials.conf

sudo systemctl daemon-reload
sudo service docker restart

Disable pip cache and version check

ENV PIP_NO_CACHE_DIR=1 \
    PIP_DISABLE_PIP_VERSION_CHECK=1

Dockerfile for FastAPI

Return exit code in docker-compose

docker-compose up --abort-on-container-exit --exit-code-from worker

Change entrypoint of Dockerfile in compose

FastAPI

Use debugging mode

Enable CORS

Raise HTTP Exception

Run FastAPI in Jupyter Notebook

Mock ML model in test case

Flask

Test API in flask

Load model only once before first request

Gensim

Load binary format in Word2Vec

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('model.bin', 
                                          binary=True)
model.most_similar('apple')

Git

Prevent git from asking for password

git config credential.helper 'cache --timeout=1800'

Whitelist in .gitignore

Clone private repo using personal token

Create token from settings and run:

git clone https://<token>@github.com/amitness/example.git

Create alias to run command

# git test
git config --global alias.test "!python -m doctest``"

Install Git LFS

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs

Triggers for GitHub Action

Useful GitHub Actions

Action	Remarks
scrape.yml	Scrap webpage and save to repo

Gunicorn

Increase timeout

gunicorn --bind 0.0.0.0:5000 main:app --timeout 6000

Check error logs

tail -f /var/log/gunicorn/error_

Run two workers

gunicorn main:app  --preload -w 2 -b 0.0.0.0:5000

Use pseudo-threads
If CPU cores=1, then suggested concurrency = 2*1+1=3

gunicorn main:app --worker-class=gevent --worker-connections=1000 --workers=3

Use multiple threads
If CPU cores=4, then suggested concurrency = 2*4+1=9

gunicorn main:app --workers=3 --threads=3

Use in-memory file system for heartbeat file

gunicorn --worker-tmp-dir /dev/shm

Huey

Add background task to add 2 numbers

Jupyter

Auto-import common libraries

Create startup folder in ~/.ipython/profile_default
Create a python file start.py
Add imports there.

# start.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Make auto-reload of modules by default

Create startup folder in ~/.ipython/profile_default
Add this file

Auto print all expressions
Edit ~/.ipython/profile_default/ipython_config.py and add

Add conda kernel to jupyter
Activate conda environment and run below command.

pip install --user ipykernel
python -m ipykernel install --user --name=condaenv

Add R kernel to jupyter

conda install -c r r-irkernel

# Link to fix issue with readline
cd /lib/x86_64-linux-gnu/
sudo ln -s libreadline.so.7.0 libreadline.so.6

Start notebook on remote server

jupyter notebook --ip=0.0.0.0 --no-browser

Serve as voila app

voila --port=$PORT --no-browser app.ipynb

Enable widgets in jupyter lab

pip install jupyterlab
pip install ipywidgets
jupyter nbextension enable --py widgetsnbextension
jupyter labextension install @jupyter-widgets/jupyterlab-manager

Switch to language server in jupyter lab

pip install — pre jupyter-lsp
jupyter labextension install @krassowski/jupyterlab-lsp
pip install python-language-server[all]

Kaggle

Add kaggle credentials

pip install --upgrade kaggle kaggle-cli

mkdir ~/.kaggle
mv kaggle.json ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json

Linux

Zip a folder

zip -r folder.zip folder

Use remote server as VPN

ssh -D 8888 -f -C -q -N [email protected]

SSH Tunneling for multiple ports (5555, 5556)

ssh -N -f -L localhost:5555:127.0.0.1:5555 -L localhost:5556:127.0.0.1:5556 [email protected]

Reverse SSH tunneling
Enable GatewayPorts=yes in /etc/ssh/sshd_config on server.

ssh -NT -R example.com:5000:localhost:5000 [email protected] -i ~/.ssh/xyz.pem -o GatewayPorts=yes

Copy remote files to local

scp [email protected]:/mnt/file.zip .

Set correct permission for PEM file

chmod 400 credentials.pem

Clear DNS cache

sudo service network-manager restart
sudo service dns-clean
sudo systemctl restart dnsmasq
sudo iptables -F

Unzip .xz file

sudo apt-get install xz-utils
unxz ne.txt.xz

Disable password-based login on server
Edit this file and set PasswordAuthentication to no

sudo nano /etc/ssh/sshd_config

Auto-generate help for make files

Rebind prefix for tmux
Edit ~/.tmux.conf with below content and reload by running tmux source-file ~/.tmux.conf

Clear DNS cache

sudo systemd-resolve --flush-caches

Reset GPU

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

Markdown

Add comparison of code blocks side by side
Solution

Nginx

Assign path to port

location /demo/ {
                proxy_pass http://localhost:5000/;
                proxy_http_version 1.1;
                proxy_set_header Upgrade $http_upgrade;
                proxy_set_header Connection "upgrade";
	}

Increase timeout for nginx
Default timeout is 60s. Run below command or use alternative.

sudo nano /etc/nginx/proxy_params

Setup nginx for prodigy

NLTK

Get list of all POS tags

import nltk
nltk.download('tagsets')
nltk.help.upenn_tagset()

NPM

Upgrade to latest node version

npm cache clean -f
npm install -g n 
n stable

Pandas

Save with quoted strings

df.to_csv('data.csv', 
            index=False, 
            quotechar='"',
            quoting=csv.QUOTE_NONNUMERIC)

Postgres

Import database dump
If database name is test and user is postgres.

pg_restore -U postgres -d test < example.dump

Pycharm

Add keyboard shortcut for custom command
Link

Enable pytest as default test runner

Pydantic

Allow camel case field name from frontend

Validate fields

Python

Install build utilities

sudo apt update
sudo apt install build-essential python3-dev
sudo apt install python-pip virtualenv

Install mysqlclient

sudo apt install libmysqlclient-dev mysql-server
pip install mysqlclient

Get memory usage of python script

import os
import psutil
process = psutil.Process(os.getpid())
print(process.memory_info().rss)

Convert python package to command line tool

Install package from TestPyPi

pip install --index-url https://test.pypi.org/simple
	    --extra-index-url https://pypi.org/simple
	    example-package

Test multiple python versions using tox

Flake8: Exclude certain checks
Place setup.cfg alongside setup.py.

Send email with SMTP

Enable less secure app access in settings of gmail.

Run selenium on chromium

sudo apt update
sudo apt install chromium-chromedriver
cp /usr/lib/chromium-browser/chromedriver /usr/bin
pip install selenium

Generate fake user agent in selenium
Run pip install fake_useragent.

PyTorch

Install CPU-only version of PyTorch

conda install pytorch torchvision cpuonly -c pytorch

Auto-select proper pytorch version based on GPU

pip install light-the-torch
ltt install torch torchvision

Set random seed

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

Create custom transformation

Pytest

Disable warnings in pytest

Pytorch Lightning

Use model checkpoint callback

Redis

Connect to redis from commandline

redis-cli -h 1.1.1.1 -p 6380 -a password

Connect to local redis

Use URL for redis
If password is present:

redis://:{password}@{hostname}:{port}/{db_number}

else

redis://{hostname}:{port}/{db_number}

Add password to redis server

Edit /etc/redis/redis.conf file.

sudo nano /etc/redis/redis.conf

Uncomment this line and set password.

# requirepass yourpassword

Restart redis server.

sudo service redis-server restart

Enable cluster mode locally

Edit /etc/redis/redis.conf file.
```
sudo nano /etc/redis/redis.conf
```
Uncomment this line and save file.
Restart redis server.
```
sudo service redis-server restart
```

Requests

Post JSON data to endpoint

import requests

headers = {'Content-Type': 'application/json'}
data = {}
response = requests.post('http://example.com',
                         json=data,
                         headers=headers)

Use random user agent in requests

Use rate limit and backoff in API

SSH

Add server alias to SSH config
Add to ~/.ssh/config

Streamlit

Disable CORS
Create ~/.streamlit/config.toml

[server]
enableCORS = false

File Uploader

file = st.file_uploader("Upload file", 
                        type=['csv', 'xlsx'], 
                        encoding='latin-1')
df = pd.read_csv(file)

Create download link for CSV file

Run on docker

Docker compose for streamlit

Add Dockerfile to app folder.

Add project.conf to nginx folder.

Add Dockerfile to nginx folder.

Add docker-compose.yml at the root

Run on heroku
Add requirements.txt, create Procfile and setup.sh.

Deploy streamlit on google cloud
Create Dockerfile, app.yaml and run:

gcloud config set project your_projectname
gcloud app deploy

Render SVG

Tensorflow

Install CPU-only version of Tensorflow

conda install tensorflow-mkl

pip install tensorflow-cpu==2.1.0

Install custom builds for CPU

Find link from https://github.com/lakshayg/tensorflow-build

pip install --ignore-installed --upgrade "url"

Install with GPU support

conda create --name tensorflow-22 \
    tensorflow-gpu=2.2 \
    cudatoolkit=10.1 \
    cudnn=7.6 \
    python=3.8 \
    pip=20.0

Use only single GPU

export CUDA_VISIBLE_DEVICES=0

Allocate memory as needed

export TF_FORCE_GPU_ALLOW_GROWTH='true'

Enable XLA

import tensorflow as tf
tf.config.optimizer.set_jit(True)

Load saved model with custom layer

Ensure Conda doesn’t cause tensorflow issue

Upload tensorboard data to cloud

tensorboard dev upload --logdir ./logs \
    --name "XYZ" \
    --description "some model"

Use TPU in Keras
TPU survival guide on Google Colaboratory

Use universal sentence encoder

Textblob

Backtranslate a text