Around once a month, I get emailed by a student of some type asking how to get into Data Science, I've answered it enough that I decided to write it out here so I can link people to it. So if you’re one of those students, welcome!
I'll segment this into basic advice, which can be found quite easily if you just google 'how to get into data science' and advice that is less common, but advice that I've found very useful over the years. I'll start with the latter, and move on to basic advice. Obviously take this with a grain of salt as all advice comes with a bit of survivorship bias.
Less Basic Advice:
1. Find a solid community
If you’re at a university, half the point of being there is to find smart, ambitious, and motivated people like yourself to learn and grow with. For my alma mater, that community was the Data Science and Informatics club. Communities/networks help you get started, keep you motivated, and are key for scoring internships and full time offers in the long term.
2. Apply Data Science to Things you Enjoy
Getting good at anything is difficult (duh), and applying data science to a field or area you care about helps you stay motivated and stand out. A couple of my examples of this are: using UF's (alma mater) student government elections to learn about machine learning approaches, or tracking my friends' Elo scores by recording our games of ping pong. These projects taught me essential skills without explicitly feeling like work.
Getting useful practice that is representative of the job you want to perform in the future is crucial because out of this practice you can only get one of two things:
a. The realization that you don't actually like this type of data science in which case you should stop reading immediately
b. Valuable experience that you can easily write about (blog) or talk about (to people who want to pay you money)
This brings me to my next point.
3. Minimize the ‘Clicks to Proof of Competence’
Recruiters will spend 15 seconds on your resume, potential teams will spend 1-5 minutes (at most) on your resume + website/Github (on average, visitors to my portfolio site spend 2 minutes and 16 seconds before moving on). Both groups often use proxies for competence like GPA, school quality, or experience in data from a tech firm (I call these: proof of status). As a result, you should very closely think about the time needed to signal to the reader that you can do whatever job they’re looking to hire for. A rough metric to consider for this is Clicks to Proof of Competence.
If the recruiter has to click on the right repository in your Github and then click through files until they find the Jupyter notebook with unreadable code (without comments nonetheless), you’ve already lost. If the recruiter sees Machine Learning on your resume, but it takes 5 clicks to see any ML product or code that you've made, you've already lost. Anyone can lie on a resume; make a point to direct the reader’s attention quickly, and you’ll be in a significantly better spot.
The way i've thought about optimizing for this metric is pretty clear on my website. It roughly takes 10 seconds to skim the text (I would bet that most people don't read it all the way through), and then immediately people can choose a Data Science project to view, which are ordered by how well they show the work I can do. For starting off in DS, I would highly recommend making a website (even a bootstrap template website is fine) and hosting it on Github pages or heroku with your own domain.
4. Learn Through Research or Entry Level Jobs
After you do those three things, see if you can convince someone to pay you to learn data science. There is a great election data science group at UF that I loved (Dr McDonald and Dr Smith run it currently), but if you go to any research group and interview with them they might pay you for your work. Eventually, with experience like that, then you can apply for internships and get paid super well. The key here is to not start out looking for the incredibly fancy DS internships, but locally at companies or research groups that have Data Science tasks but not enough money to hire a full time Data Scientist. Data Science learning compounds quickly, so start now! Given all of that, let’s move on to the more basic advice.
Extremely Basic Advice:
Data Science is mostly programming + statistics applied to whatever field you're in, so a background in those two areas is crucial.
1. Statistics
Get a good background in stats as quickly as possible (take classes, learn on your own online). Textbooks will take you far, curiosity will take you farther.
Books/resources:
Naked Statistics (basic, paid)
ISLR (Introduction to Statistical Learning in R) (textbook, free)
Statistics and Probability: Khan Academy (basic, free)
2. Programming
Learn either Python or R and get really good at it. Do something new every day, spend at least 5-10 hours per week on it as soon as possible. Learn SQL after this. You cannot skip around this.
Books/resources:
R for Data Science (free)
Data Science From Scratch (paid)
Intro to Comp Sci and Programming in Python (MIT Course, free)
3. Business Experience
At P&G, my data science work was applied to retail. At Facebook, to integrity problems. At Protect Democracy, to, uh, Democracy. Learning about applications of data science into some business context is hard and takes practice, and often involves a solid understanding of metrics, product analytics and incentive structures. This fits in very well with #2 from the less basic advice.
Fin
Learning data science is hard but I’ve found it to be incredibly rewarding. My final offer to you, in exchange for reading to the bottom of this long-ish piece, is to say that once you finish applying data science to a problem you’re passionate about and posting it somewhere online, DM it to me on Twitter and I promise to read it and retweet it. Good luck!