In an earlier post on "How to Create an Outstanding Data Science Portfolio," we discussed different ideas on approaching your portfolio and maximizing the chances of finding your first job as a data scientist. One of the best ways to highlight your technical skills is to have a strong presence on GitHub. What would this presence entail?
For this post, we will focus on the four areas where GitHub can help you strengthen your portfolio:
- Project repositories
- Profile README
- Open-source collaboration
- GitHub Pages
Everything below assumes that you have basic familiarity with git and GitHub terminology such as repository, branch, commit, pull request, etc. If not, go ahead and check out this guide and come back for this post.
As a data scientist, you can make your project repository focused on one or more of the following:
- Dataset collection - it could be a set of useful shell and/or python scripts to collect interesting data from a website or a web API. These types of projects are very often overlooked by data scientists building their portfolios. It's a shame because there's a real shortage of suitable datasets to work with. For this reason, sharing an interesting dataset (or at least a way to collect it) could be your shortcut to building a good reputation among your peers in the data science community.
- Data storytelling and visualization - a project where you thoroughly analyze a particular dataset, uncover unique insights, and build elegant visualizations. These projects are amazing for practicing one of the most valuable skills for a data scientist — communicating your findings to other people with little to no programming skills or statistical knowledge.
- Machine learning - this type of project would contain code that performs data cleaning, feature engineering, and ML modeling. Here you can either 1) train a model on a unique and interesting dataset that no one has approached before you, or 2) try to use novel techniques and ML algorithms to achieve state-of-the-art performance on some benchmark datasets. Unless your ultimate goal is to have a research-oriented position, we recommend the former approach.
Regardless of what type of project you decide to work on, you need to make sure it's well-structured. While a project structure may differ depending on what you are trying to achieve as well as your personal preference, there are some conventions:
- Keeping the source code separate from your jupyter notebooks and unit tests
- Having a separate directory for all static files and media assets (images, audio files, etc.)
- Having all project configs (such as virtual environment files) at the root of the project
As a good starting point, check out this good (although a little opinionated), "Cookiecutter Data Science" library that can help you set up a logical project structure in minutes.
Another important consideration for your project is code quality. You need to make sure that your code follows best practices when it comes to modularity, variable naming, formatting, clear comments, and documentation, etc. For example, if you are using Python, there's either a PEP 8 Style Guide or its alternative Google Python Style Guide. Whichever one you chose is a matter of taste, but make sure that your code editor/IDE (e.g. Visual Studio Code or PyCharm) automates as many code quality checks as possible (via one of the many available linting libraries such as Pylint, Pyflakes, Pydocstyle, etc.)
And finally, you must have a clear and well-structured README file at the root of your repository. At the bare minimum, you should specify the project name, description, and setup/usage instructions. Additionally, depending on the project type, you might need to add contribution instructions, credits, table of contents, etc.
To properly format your README file, you'll need to familiarize yourself with some Markdown syntax. Thankfully, there are many Markdown cheat sheets available online, like this one. Many IDEs support WYSIWYG-style Markdown editing. Alternatively, you can install a dedicated Markdown editor like Typora or Mark Text.
There's one very special Profile README file that you can create to give the visitors of your GitHub profile a quick overview of yourself, your skills and various projects you are working on.
If you create a repository with the same name as your username and add a README file to it, GitHub will prominently display this README file above all your pinned repositories.
After that, you can edit this file just like you would any other project README. The only difference is this README is all about you, so fill it with your interests and achievements, and style it according to your personality.
You can pick a header image to be displayed at the top or use emojis, icons, animated GIFs, and videos. GitHub Stats Card can generate a card with your GitHub statistics (like number of stars, commits, PRs, etc.) that you can embed into your README.
If you want to get started quickly without manually editing your Profile README file, you can use one of several Profile README generators such as:
Just enter the info you want, download the generated README and add it to your profile - that's it!
Open source contributions
Another great way to hone your skills is to contribute to other GitHub projects. It could be a small project that your friend is trying to get off the ground, or a big open-source project, like scikit-learn or pandas, that has been around for many years and has an active community of collaborators.
There are many different ways to do that. Of course, the most common form of contribution is code that either fixes a bug or adds a feature. However, you can contribute in many other ways: add or clarify documentation, report an issue, create a tutorial, etc.
It may even be recommended for beginners to start with non-code contributions to get to know the project and its community a little better. Over time you may even find a few mentors who will guide you and provide feedback on your work. If open-source contributions are something you are interested in, check out this excellent "How to Contribute to Open Source" guide.
Personal blog on GitHub Pages
Have you been thinking about creating your own data science blog?
Then GitHub Pages is the service for you! GitHub Pages allows hosting static websites.
Of course, as a data scientist, you might not be an expert in web development —that's where static site generators come to the rescue. These generators are essentially opinionated frameworks that come with "batteries included" (if by batteries we mean design themes and layouts). So all you need to do is focus on the data science content.
Some of the most popular static site generators are:
This post covers all the four major services and products available on GitHub that will help you boost your portfolio.If you are new to GitHub and maybe only recently created an account there, we would suggest starting with creating your Profile README, as it would allow you to learn Markdown syntax and get familiar with GitHub's user interface.After that, you might want to clean up your data science projects that have been living on your personal computer and upload them to GitHub. Don't forget about code quality, documentation, and a README.Next, set up your blog, try out a few different static site generators and design themes, then start posting!Or maybe fixing that pesky bug in a python library that you've been using is something that's more up your alley. So go ahead, fix it and create a pull request.Hopefully, we've convinced you that establishing your presence on GitHub is one of the best ways to boost your portfolio.
Learn more about the data science bootcamp and other courses by visiting TripleTen and signing up for your free introductory class.