Which Mean should we use? A guide on Arithmetic, Geometric, and Harmonic Means in Data Analysis
2021-05-11
A tutorial on Context Managers in Python
2021-05-30
Show all

5 steps to start becoming a Machine Learning Engineer

16 mins read

Step 1: Adjusting Your Mindset

Whenever I lead my workshops I always get a lot of questions afterward from developers who want to get started in machine learning but feel stuck.

Usually, the only thing holding them back is a self-limiting belief. I’m going to go over the most common of these beliefs and ways to get past them. Once you get past these mental blocks, there should be nothing stopping you from moving forward in your goals.

Self-limiting beliefs

Waiting to get started

The limiting belief that held me back the most was waiting to get started. I was always pushing back my first project because I felt like I always needed to finish one more thing. Whether that thing was reading a research paper or completing some online course there was always something else for me to work on before I could move forward.

Some common thoughts from students I teach are:

  • I need to get a degree first
  • I need to complete a course
  • I need to know statistics and probability
  • I need to master Python or R
  • I need to learn linear algebra

All of this is bullshit. You can get started right now and run your first regression or classifier. Sure it might not work very well or have some issues, but getting the ball rolling is more important than being perfect. Plus when you complete projects you always get a better understanding of your weaknesses and will know what to focus on the next project!

There is no reason not to get started now.

My first experience with machine learning was Andrew Ng’s Coursera class, Check it out here. Click Here

Lack of self-confidence

One of the most common limiting beliefs is a lack of self-confidence in one’s ability to learn and apply machine learning techniques to real problems.

In my experience the most common of these beliefs are:

  • All data Scientists and Machine learning engineers have Ph.D. Since I don’t have one it must mean I can’t do it.
  • If I start machine learning, I will end up failing and making a fool of myself
  • I’m just not smart enough to learn this

“A river cuts through rock, not because of its power, but because of its persistence”

James N. Watkins

Everyone starts out sucking at something, but it is through hard work and persistence that people build skills and get better. You might not feel very smart when you first start, but you can’t let that stop you from trying.

Waiting for the perfect time

Another form of procrastination I see has to do less with lack of knowledge and more to do with lacking time or perfect conditions.

This usually takes the form of the following thoughts:

  • My computer is not good enough to build machine learning applications on
  • I’m just a student now
  • I’m not a very good programmer
  • I’m too busy with work
  • I don’t have enough time
  • I don’t have enough experience

It takes a ton of time and effort to become very good at machine learning; however, getting started can be as easy as spending 5 hours a week putting together a small project. The best time to plant a tree was 10 years ago. The second-best time is now!

Here is a list of 8 projects you can get started on today. Click Here

Have tried and failed in the past

The fourth self-limiting belief that I see in my students is the feeling that they will fail now because they have failed in the past. This belief can’t be further from the truth.

People with this limiting belief often say to me:

  • I can’t understand X
  • I don’t know what to do next
  • I feel overwhelmed
  • My program is not working
  • I won’t be as good as Y

Nowadays there are many new tools and courses that help new machine learning engineers. Machine learning is hard but no harder than other technical skills like programming. You have to put in the long hours and hard work to build the skills and experience needed to get good.

My advice is that you should not take on large projects that can overwhelm people. Start small and build from there.

What is holding you back?

Have you ever had these self-limiting beliefs? How did you get past it? Understanding your feelings is the first step to changing your actions to match your goal of becoming a machine learning master.

Step 2: Pick a Process

After a few applied machine learning problems, you usually develop a pattern or process for quickly getting started and achieving good results. Once you have this process it is trivial to use it again and again on the project after project. The more developed your process, the faster you can get results!

Define the problem

This step is all about learning more about the problem at hand. Familiarize yourself with the domain and understand why you are building this solution. To help facilitate this, always ask yourself the questions below

What is the problem? Describe what the problem is formally and informally. Make sure you list assumptions you are making and any problems that are similar

Why does the problem need to be solved? List any motivations for solving the problem. What are the benefits a solution brings and how would you use it?

How would I solve the problem? Describe how the problem would be solved manually to build up domain knowledge

Prepare Data

Do you understand the data you have been given? Lots of people skip over this step because it is often tedious but it is super important. This work forces you to think about the data in the context of the problem before it gets lost in the craziness of algorithms

Data Selection: Consider what data is available to you. Is there any data missing? Can you remove any data?

Data Preprocessing: Organize your selected data. Format it, clean it and take a sample from it

Data Transformation: Process your ready data for machine learning by engineering its features using scaling, attribute decomposition, and attribute aggregation.

Explore different Algorithms

Now that you have your data it’s time to try out a bunch of different standard machine learning algorithms. Typically, you would run 10–20 standard algorithms on the transformed and scaled versions of the dataset you prepared in the last step.

The main goal of trying all of these different algorithms and dataset combinations is spreading your net far and wide. See what works and what doesn’t then go from there. More detailed explorations will follow with well-performing algorithms.

Improve Results

After you have finished exploring the different algorithms and picked one that works well for your dataset it is time to squeeze out the best results from it. You can do this in a few ways, but it’s important to make sure that your results are significant at this point because hyper-parameter tuning isn’t going to turn a crap result into a good result. It will just help you squeeze out a bit more performance.

Here are some standard ways to improve an already working algorithm.

Hyper-parameter Tuning: All algorithms have hyper-parameters and making sure these are optimal is key to getting the best performance.

Ensemble Methods: Where predictions are made by combining multiple models

Extreme Feature Engineering: Attribute decomposition and aggregation seen in data preparation is pushed to the limits

Present Results

The results of a complex machine learning problem are often meaningless in a vacuum. It’s important to put them in context. This typically means a presentation to stakeholders. This applies to big meetings with CEOs and online competitions. It’s good practice and gives everyone involved a good understanding of the problem and how you solved it.

Here is a quick template for you to present your results:

Why: Define the environment that the problem exists in and set up a motivation for the solution.

Question: Describe the problem as a question that you went out and answered.

Solution: Concisely describe the solution as an answer to the question you just posed

Findings: List out all of the discoveries you made while solving the problem.

Limitations: Clearly go over the limitations of the model. What is it not good at and what can be done better.

Conclusions: Go back to the why, question, and solutions and tie them together in a way that makes them easy to remember.

Remember that this is not the end all be all of the processes, but it is a good step towards becoming a machine learning engineer.

Step 3: Pick Your Tool

As engineers, we spend our careers learning to use new and better tools so that we can build products that bring real value. The tools we use can change often, but they all serve a function and have use cases. Even though machine learning in its modern form is very new, as an industry goes, there is a wide variety of tools at an engineer’s disposal. At the end of this article, you should have a greater understanding of some available tools and whether any of them is right for you.

Main Tools

WEKA | Waikato Environment for Knowledge Analysis

WEKA is a modern machine learning workbench with many great features every ML engineer will need to explore data and apply algorithms. All of this great functionality without having to write a single line of code. Whether you are a programmer or not, I recommend trying out different problems with WEKA.

I also use it a lot when familiarizing myself with new data sets, which should be in everyone’s problem-solving process (Becoming a Machine Learning Engineer | Step 2: Pick a process). The built-in algorithms, Graphing tools, easy data import, and the easy-to-use GUI allow me to explore data much faster than writing a quick script in python.

WEKA Download | Great WEKA tutorial

Python + Libraries

Python is an accessible programming language and the fastest growing right now regarding users, documentation, and libraries. With amazing libraries such as NumPy, SciPy, Tensorflow, Pandas, Flask, and much more, you can just about do anything you want with python.

The only downside to python is that can be slower than other languages when you don’t leverage libraries efficiently. Take for example matrices and NumPy, you should very rarely ever have to use loops to modify values, and if you must then keep them to an absolute minimum.

University of Michigan Python Coursera Course

R

R is a workhorse for statistical analysis and by extension machine learning. It is a platform to understand and explore your data using statistical methods and graphs. On top of that, it has a large number of machine learning algorithms, and advanced implementations written by the developers of the algorithm.

It is a great tool for all developers, but if you’re new to machine learning, I would suggest going deep with python. Not only can you develop on offs but you can also bring your project to production, which is not possible using just R.

Johns Hopkins R Coursera Course

Other great tools

Matlab/Octave

MATLAB/Octave is excellent for representing and working with matrices. These softwares are very popular in universities, and it is easy for non-programmers to get into, but you can ignore this if you plan to practice machine learning in the real world.

Matlab Download | Octave Download

C

In production environments, it is common to see algorithms prototyped in R or python and then implemented in C for execution speed and system reliability. If you plan to get serious about implementing machine learning algorithms it might make sense for you to focus on C.

C tutorial

What tools should you use?

Python and WEKA 🙂

Step 4: Practice, Practice, and More Practice

The best method to pick up essential machine learning skills fast is to practice building your skills with small easy to understanding datasets. This technique helps you build your processes using interesting real-world data that are small enough for you to look at in excel or WEKA. In this article, you will learn about a high-quality database with plenty of datasets and some tips to help you focus your time on what matters to you!

Why practice with datasets?

Following online tutorials will keep you trapped in a dependent mindset that will limit your growth because you’re not learning HOW to solve any problem. Your learning how to apply a specific solution to a particular type of problem. It’s the equivalent of overfitting, which we all know leads to poor real-world performance. If you’re interested in becoming a machine learning engineer, you need to make sure you can generalize to real data. Challenge yourself every day and attack problems using a defined process. Practicing your skills using datasets is the best way to do this.

Where do I get datasets?

Luckily for everyone, there is a fantastic repository of machine learning problems that you can access for free.

UCI Machine Learning Repository

The Center for machine learning and intelligent systems at the University of California, Irvine built the UCI machine learning repository. For 30 years, it has been the place to go for machine learning researchers and machine learning students that need datasets to practice. You can download all of the available datasets on their webpage. They also list all of the details about it including any publications that have used it, which is really useful when you want to learn researchers attacked the problem. The datasets can be downloaded in a few different ways as well (CSV/TXT).

There are only two downsides to the UCI datasets.

  1. The most significant downside is that these datasets are cleaned and pre-processed. Cleaning and pre-processing are essential parts of the machine learning process that you will face in your career. Not spending time practicing this skill will hurt you later down the road.
  2. The other downside is that they are small so you won’t get much experience in large-scale projects, but that shouldn’t matter because you guys are new at this! Start small!

Practicing in a Targeted way

How do you go about practicing in a targeted way when there are so many datasets?

An aspiring machine learning engineer would do his best to figure out what their goals are and pick a dataset that would best get them to that goal. I’ve developed some questions you can ask yourself to help narrow down the number of datasets.

  • What kind of problem are you looking to solve?
  • Regression, Classification, Regression, Clustering?
  • What sized dataset is it? Tens of data points or millions
  • How many features does the dataset have?
  • What type of features?
  • What domain is this dataset from?

Figure out what type of datasets you want to focus on to match up with your broader goals. Once you have this, you should be able to filter through the huge number of datasets that are available on the platform.

Example Problems

Don’t worry if you’re not sure exactly what you’re trying to learn. It’s much better not to get stuck trying to find the perfect study plan. I’ve made a list of some datasets that you might find interesting. There are a few types of problems here so give them all a shot.

Regression: http://archive.ics.uci.edu/ml/datasets/Wine+Quality

Clustering: https://archive.ics.uci.edu/ml/datasets/Bag+of+Words

Classification: http://archive.ics.uci.edu/ml/datasets/Wine

Health Classification: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29

If you’re serious about self-study, consider making a modest list of datasets you want to investigate further. Follow the targeted practice plan to build a valuable foundation for diving into more complex and exciting machine learning problems.

Step 5: Build a Portfolio

This article is the last in the “Becoming a Machine Learning Engineer” series. I started writing with the goal of improving my writing and learning through creating. My biggest fear was everything I wrote would go unnoticed. Lucky for me the exact opposite happened. People engaged, commented, clapped, shared, and followed me and my writing. I’m humbled by this success and have decided to keep writing in hopes that I can help a new generation of machine learning engineers on their journey of learning. Thank You

Becoming a machine learning engineer is not a trivial task. It takes lots of hard work and patience to go from nothing to building systems that learn from data. If you have gotten this far in the series then congratulations, you are where I was a few months ago, but if you are like me, then you are not quite sure how to showcase the skills you have built up over many months of dedicated practice.

The single most productive thing you can do to showcase your skills is to build a portfolio of your work. A high-quality portfolio can showcase:

  • Ability to communicate
  • Technical competence
  • Ability to reason through problems
  • Motivations and ability to take initiative

These are all things that employers want to see when they are decided who to hire. Unfortunately, not many people put together a portfolio is a way that will showcase this. When putting together your first machine learning portfolio, there are five things to keep in mind. These five things are guidelines that will ensure your portfolio will give you the biggest bang for your buck.

Keep Projects Small

Size does matter in your portfolio, while a significant project is flashy it comes with risks and costs. A small project likely won’t take longer than 20–40 hours to complete, and if it doesn’t work out then your loss is much less than a multi-month time sink that some large projects can become.

Complete Projects

The only thing worse than having no portfolio is having a portfolio filled with half-done projects. It screams to the world that you are not capable of finishing what you started and should be avoided at all costs.

Independent Projects

Machine learning can be applied to nearly every field, and you should showcase this in your portfolio by completing projects that can stand on their own and not extensions of previous work. Ignore this if you know exactly what field you want to get good at and are looking to showcase your expertise.

Novel Projects

Many students in my classes make the same mistake. They have followed many tutorials and feel like they should be able to do anything. Then they try to complete my take-home projects and fail. Following online tutorials is not a way to learn well, completing novel projects from start to finish is how you learn well. If I see a portfolio of many tutorial-type projects in it, I thought out the whole resume.

Easily Accessible

Make your portfolio available online for all to see. The more people find, read, and comment on your work the better. Not only will it be a vector for future employers to find you but you might also get great feedback on your projects. The easiest way to do this is just to host everything in a git repository with a substantial ReadMe.

If you’re like me and learn best from example then take a look at this data science portfolio.

Take away

Having a portfolio of your work is quickly becoming an essential part of hiring new Machine Learning engineers. It has been a part of the software engineers’ hiring process for a while now. Get started now, dig up old projects or build new ones, put them together, and write a few reports on what you learned from the projects.

Source:

https://medium.com/ai%C2%B3-theory-practice-business/get-started-in-machine-learning-in-5-steps-f4791a3efd5d

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.