A Practical Introduction to Data Science
Mon Apr 01, 2013 · 1607 words
Originally cross-posted on the old Zipfian Academy blog.

There are plenty of articles and discussions on the web about what data science is, what qualities define a data scientist, how to nurture them, and how you should position yourself to be a competitive applicant. There are far fewer resources out there about the steps to take in order to obtain the skills necessary to practice this elusive discipline. Here I will provide a collection of freely accessible materials and content to jump-start your understanding of the theory and tools of Data Science.

Environment

While the emerging field of data science is not tied to any specific tools, there are certain languages and frameworks that have become the bread and butter for those working in the field. I recommend Python as the programming language of choice for aspiring data scientists due to its general purpose applicability, a gentle (or firm) learning curve, and — perhaps the most compelling reason — the rich ecosystem of resources and libraries actively used by the scientific community.

Development

When learning a new language in a new domain, it helps immensely to have an interactive environment to explore and to receive immediate feedback. IPython provides an interactive REPL which also allows you to integrate a wide variety of frameworks (including R) into your Python programs.

Statistics

It was once said that a data scientist is someone who is better at software engineering than a statistician and better at statistics than any software engineer. As such, statistical inference underpins much of the theory behind data analysis and a solid foundation of statistical methods and probability serves as a stepping stone into the world of data science.

Courses

While R is the de facto standard for performing statistical analysis, it has quite a high learning curve and there are other areas of data science for which it is not well suited. To avoid learning a new language for a specific problem domain, we recommend trying to perform the exercises of these courses with Python and its numerous statistical libraries. You will find that much of the functionality of R can be replicated with NumPy, SciPy, matplotlib, and pandas.

Books

Well written books can be a great reference (and supplement) to these courses, and also provide a more independent learning experience. These may be useful if you already have some knowledge of the subject or just need to fill in some gaps in your understanding:

Machine Learning/Algorithms

A solid base of Computer Science and algorithms is essential for an aspiring data scientist. Luckily there are a wealth of great resources online, and machine learning is one of the more lucrative (and advanced) skills of a data scientist.

Courses

Books

Data ingestion and cleaning

One of the most under-appreciated aspects of data science is the cleaning and munging of data that often represents the most significant time sink during analysis. While there is never a silver bullet for such a problem, knowing the right tools, techniques, and approaches can help minimize time spent wrangling data.

Courses

Tools

Visualization

The most insightful data analysis is useless unless you can effectively communicate your results. The art of visualization has a long history, and while being one of the more qualitative aspects of data science, its methods and tools are well documented.

Courses

Books

Tutorials

Tools

Computing at Scale

When you start operating with data at the scale of the web (or greater), the fundamental approach and process of analysis must change. To combat the ever increasing amount of data, Google developed the MapReduce paradigm. This programming model has become the de facto standard for large scale batch processing since the release of Apache Hadoop in 2007, the open-source MapReduce framework.

Courses

Books

Putting it all together

Data Science is an inherently multidisciplinary field that requires a myriad of skills to be a proficient practitioner. The necessary curriculum has not fit into traditional course offerings, but as awareness of the need for individuals who have such abilities is growing, we are seeing universities and private companies creating custom classes.

Courses

Tutorials

Blogs

Conclusion

Now this just scratches the surface of the infinitely deep field of Data Science and I encourage everyone to go out and try it yourself!

Like what you read? Interested in receiving updates? Have a comment?
Follow @jonathandinu 🤗

back · whoami · teaching · research · talks · writing · cv · colophon · @jonathandinu