Initial steps
Define – Question the client & the business team
- Identify the problem
- Understand the ideal outcome and the “good enough” outcome
- Define the metrics to be evaluated and their importance
- Learn the business logic
- Understand the functional & non-functional requirements
- Find weaknesses & special constraints
- List assumptions
Data
- Identify the sources of data
- Learn the data
- Look at examples and tags – try to learn the factors
- Make sure the data is labeled consistently
- Spot weaknesses/problems in the data – imbalances, n/a’s
- Look for enhanced useful features
- based on current features (phone number to the area)
- easily obtained features (holidays)
- Create a processing strategy (feature → what is required)
Formulate Your Problem as an ML Problem
- Articulate your problem as an optimization problem
- Think About Potential Bias
- Frame your problem – classification/regression/anomaly, etc
- Choose 1-3 initial features
- Test Ability to Learn – correlations, PP-score
- noisy labels, ability to generalize, enough examples
More points to look at:
- Start with a problem and not the solution
- ML is not always the solution
- From Simple → complex
EDA
Get to know the data
This is a tricky part, where you might lose precious time without any output.
The most important principle in EDA is to:
And a few more important points:
- Use visualizations to learn about distributions, outliers, and more
- If possible, use active visualization with Bokeh or Streamlight
- Write your conclusions after this step in your notebook or .md file
Data Processing
Build the data processing into a reusable pipeline
- So we avoid leakage and have a consistent transformation
- If the transformation is changing we cannot compare models
- Save and version transformed data
Finished a transformation?
- Save the processed data and version it
- Compare models on similar data
Modeling
Create a baseline model
- Use autoML tools, or a simple heuristic
- If not – create a simple model such as linear or logistic regression
- Start from a simple feature space
- Look for correlated features
- Remove redundant features with L1 regularization
Design experiments
- Small-stepped, simple, well-defined experiments
- Document the “why”
- Look at metrics, assumptions, limitations, and state
- Set a research goal
- Make a hypothesis
Train a model
- Evaluate the results and analyze
- Check for failures
- Visualize results
- Think about solutions
- Look at false positives and false negatives – ask why the model failed?
- Can visualize the mistake
- Look for noisy features
- Think of a better model for the task
- Refine the hypothesis and repeat
- Better results
- Hitting the “good enough” – move on
- If not – add more features, more data, more complexity
- Worst results
- Debug the model
- Test another direction
After that
- Optimize hyperparameters
- Write smoke tests
- Write unit test-optional
Production
- When?
- The model reaches “good enough”
- Next phases
- Clean and structure the code
- Simplify the input and processing
- Optimize your model for serving
- optimize performance (if required)
- optimize for hardware
- Wrap model
- as REST API or other serving option
- Dockerize if required
- Design monitoring
- Model performance (evaluation metrics) – user clicked or not
- Resources tracking – CPU/GPU
- Connect to automated alerts/reports system
- Automated pipeline
- Define retraining policy
- Execute