Zipfian Academy

The Data Science Mindset

Names like ‘R’, ‘SQL’, and ‘D3’ make data science seem more like alphabet soup than a deliberate practice of working with data. It’s so easy to get lost in the sea of acronyms, packages, and frameworks that we often find our students prematurely optimizing for the right toolset to use, unable to move forward until they have researched every available option. In reality, data science isn’t just about the tools. It’s a mindset: a way of looking at the world. It’s about taking advantage of our modern computers and all of the information that they’re already collecting to study how things work and push the limits of human knowledge just a little bit further. We have a favorite saying around here — data is everything and everything is data. If we begin with this mindset, a lot of data science approaches naturally follow.

Store Everything

Storage is cheap. Collect everything and ask questions later. Store it in the rawest form that is convenient, and don’t worry about how or even when you’re going to analyze it. That part comes later.

Use Existing Data

We’re already storing data — let’s use it. When faced with questions, data scientists regularly adapt the query so that it can be approximately answered with an existing and convenient dataset. The best part of data science is discovering surprising applications of existing stores of data. For example, there is a plethora of satellite imagery of Earth. We can use this data to learn about fertilizer use in Uganda, or use pictures of the Earth at night to estimate rural electrification in developing countries.

Connect Datasets

We’re storing everything, all over the world, inexpensively, for the first time in history. There are many lessons to be learned by utilizing more of this treasure trove. Don’t worry about making the best use out of a single source of data. Focus on connecting disparate datasets rather than tuning your models. Conventional statistics teaches a lot about how to choose analysis methods that are appropriate for your data collection approach and how to tune the models for a specific dataset.

Effective data science is about using a range of datasets, connecting the dots between one set of data and another, such as predicting restaurant health scores based on Yelp reviews. In machine learning speak: it’s often better to collect more features rather than spend days optimizing hyperparameters.

Anything Can Be Quantified

Our culture loves to quantify. If you can turn it into a number, that number can be put into a table. Importantly, that table can now be processed by a computer.

A spreadsheet about sewer overflows is clearly data to most people, but what about a calendar? At first, a calendar might not seem like the sort of data that you analyze with statistics. However, you can also represent a calendar as a spreadsheet and as a graph.

Data science becomes a creative endeavor when peeling away the obvious variables presented to you. Maybe you have a bunch of PDF documents. You could easily extract the text in the PDFs and search through the content. Depending on the problem you are solving, these files hold more interesting information than just the text. You can get the page count, the file size, and the shapes of the pages and the program that created it. There is information hidden in many datasets that goes beyond what’s immediately obvious.

There is a lot of talk about the difference between different kinds of data. There’s “qualitative” vs. “quantitative” and “unstructured” vs. “structured.” To me, there isn’t much difference between “qualitative” and “quantitative” data, nor is there between “unstructured” and “structured” data because I know that I can convert between the different types.

At first, the registration papers of company might not seem like interesting data. They begin as paper, most of the fields are text, and the formats aren’t particularly standardized. But when you put them in a database in a machine-readable format, qualitative data becomes quantitative data that can be used to supplement other data sources.

Send Boring Work to Robots

We no longer live in an era where “computer” refers to someone who carries out calculations. Find yourself doing something over and over? Give it to the bots. As far as data analysis goes, modern computers can be far more effective at rote tasks, such as drawing new graphs with every update of a dataset.

Data collection is a prime example of a task that should be automated. A common scene in university research labs is swaths of grad students handing out paper questionnaires to participants of studies. The data scientist says: collect the data automatically and unobtrusively, using existing systems whenever possible. The supercomputers we carry in our pocket are a great place to start.

This mindset can be applied not only to the data, but also to the process itself. Rather than learning and remembering your entire analysis process, you can write a program that does the whole thing for you, from the original acquisition of the data, to the modeling, to the presentation of results to another person. By making everything a program, you make it easier to find mistakes, to update your analyses, and reproduce your results.


Once inside the data science mindset, solving interesting problems becomes a function of data acquisition and processing. Computers can fit models and make predictions about datasets that are too big to wrap your head around and convert paper documents into electronic tables. They probably know more about you and your habits than you know yourself! Use the tools available to you, but don’t get caught up on the tools themselves.

Properly discussing these relevant tools is another post (maybe a book), but here’s one thought. While it always helps to have more education, you don’t need a PhD in math or computer science in order to create useful things. Loads of wonderful algorithms have already been implemented for you, and simple algorithms often work quite well. If you’re just getting started, focus on the “plumbing” that connects different datasets and systems together.

Data Science Mindset at Zipfian Academy

Our course teaches many data science tools, but we also teach the data science mindset, because you need both to be a great data scientist. To this end, we organize our 12-week course by projects — such as a recommendation engine or spam filter — rather than software packages or algorithms. We teach the various tools in context of applied projects so students learn how to choose the appropriate tool and how to build the plumbing that connects them.

In the end, it’s not about the newest, trendiest framework or fastest data analysis platform. It’s about finding interesting insights from your data and sharing it with the world. Start small, get your hands dirty, and have fun!

How to Data (Science): Mapping SF Restaurant Inspection Scores

Are you a company or data scientist that would like to get involved? Give us a shout at

If this post excites you, I encourage you to apply to our 12 week immersive bootcamp (applications close August 5th) where you will learn data science through hands-on exercises and real world projects!

In our previous post, we outlined the best data science resources we have found online. In this post, we’ll walk through our data science process by analyzing the inspections of San Francisco restaurants using publicly available data from the Department of Public health. We will explore this data to map the cleanliness of the city and gain perspective on the relative meaning of these scores through statistics. During the analysis, we used a spectrum of powerful tools for data science (from UNIX shell to pandas and matplotlib) and provide some tips and data tricks.

While the health inspection scores are based on a fixed scale (i.e. threshold for health quality) where each restaurant can be considered a independent random variables, we think there is value in looking at how they are distributed. This does not actually asses the chance of food borne illness or quality of food but simply looks at the scores from an exploratory perspective


Understand the Data Science Process

Learn about essential tools (UNIX, Python and associated libraries)

Be inspired by Open Data and our role as data citizens

All of the code is contained in an IPython notebook and can be viewed or downloaded from Github.


When we analyzed the data, we found that the most common was a perfect score of 100. Interestingly, the distribution is heavily skewed towards high scores (mean of 92, 75% quartile of 98, 25% quartile of 88), and there exists a long tail of restaurants with very low scores.

Plotting the data geographically, we find that there is a large concentration of restaurants with scores below 70 in Chinatown and Civic Center, putting them in the bottom 1/10th of all scores. The contrast in scores between the Financial district and Chinatown is quite interesting: the highest scoring restaurants cluster (FiDi) neighbors the lower scoring cluster (Chinatown). Also of particular interest was the gradient of 24th St. moving from the Noe Valley (high scores) towards the Mission (lower scores). We plan on adding more data to correlate common health violations with scores in those areas. Have any ideas for a health data mashup? Send them our way at

The interactive map below allows you to visualize the the data by scores and density. Check it out and see for yourself:

Each restaurant is geographically binned using the D3.js hexbin plugin. The color gradient of each hexagon reflects the median inspection score of the bin, and the radius of the hexagon is proportional to the number of restaurants that fall in the bin. Binning is first computed with a uniform hexagon radius over the map, and then the radius of each individual hexagon is adjusted for how many restaurants ended up in its bin.

Large blue hexagons represent many high scoring restaurants in an area and small red mean a few very poorly scoring restaurants. The controls on the map allow users to adjust the radius (Bin:) of the hexagon for computing the binning as well as the range (Score:) of scores to show/use on the map. The color of the Bin: slider represents the average color of the two Score: range sliders and its size represents the radius of the hexagons used to compute the binning. The colors of each of the Score: sliders represent the threshold color for that score, i.e. if the range is 40 - 100 the left slider’s color corresponds to a score of 40 and the right slider to a score of 100. The colors for every score in-between are computed using a power scale gradient (with exponent 5).


Somewhat recently, Yelp announced that it is partnering with Code for America and the City of San Francisco to develop LIVES, an open data standard which allows municipalities to publish restaurant inspection data in a standardized format. This is a step towards a much much more transparent government, leading ultimately to a more engaged citizenry.

To understand what those opaque numbers in restaurant windows mean, I set out to use statistics and data science to better grasp the implications of the ratings.


The entire process has been documented in an IPython notebook here and I hope anyone who is curious will run the code and review the analyses before they take the results at face value (because No one should trust a data scientist).

Some interesting results and insights I have found can be summed up by the plots below.

In order to learn more about the relative rating of each restaurant and find out just how good a 90 is, I simply plotted all the data in a histogram. It turns out (quite surprisingly) that the majority of restaurants score better than 94 and that 100 is the mode of the dataset. This is actually quite comforting to know that so many restaurants score so well, but might make you think twice about eating at your favorite restaurant that happened to score a 90.

The right plot is a binning of the scores into the categories the city defined to give a more qualitative interpretation of the scores (‘Poor’, ‘Needs Improvement’, ‘Adequate’, and ‘Good’). The interesting thing to note about these quantizations of the scores is that the scale is very nonlinear: 0 -> 70, 71 -> 85, 86 -> 90, 91 -> 100.

With such a skewed distribution and nonlinear scales, often our old way of thinking does not directly translate. To get a better grasp on the relative scores of restaurants compared to each other (and potentially other cities) I computed the quantiles for the distribution. This allows us to have a somewhat standardized ranking to compare different scales and distributions in a normalized fashion. It is for this reason that summary statistics can be quite powerful tools for inference and a standard tool in any statistician’s (or data scientist’s) tool belt.

Due to these very basic and easy to implement analyses, I am now a much more informed citizen and realize that scales in general can distort your perception. In school we come to internalize 70 as a passing score, anything better than 90 quite good, and 98-100 to be unheard of… for Berkeley Physics at least ;)


I hope this post showed you that you do not necessarily need to do very complex analyses to get interesting insights and that it inspires folks to get out there and start working with open data. The first step to breaking into data science is to start making, and pick a project that you are passionate about (or always wanted to know the answer to). If you have any questions about restaurant health inspection data, the data science process, or our program and classes please do not hesitate to reach out (or to just say hello!) at Happy Data-ing!



A Practical Intro to Data Science

Are you a company or data scientist that would like to get involved? Give us a shout at

There are plenty of articles and discussions on the web about what data science is, what qualities define a data scientist, how to nurture them, and how you should position yourself to be a competitive applicant. There are far fewer resources out there about the steps to take in order to obtain the skills necessary to practice this elusive discipline. Here we will provide a collection of freely accessible materials and content to jumpstart your understanding of the theory and tools of Data Science.

At Zipfian Academy, we believe that everyone learns at different paces and in different ways. If you prefer a more structured and intentional learning environment, we run a 12 week immersive bootcamp training people to become data scientists through hands-on projects and real-world applications. We also host a six-part series of shorter form (1.5 hour) courses which provide a hands-on survey of each topic covered in this post.

We would love to hear your opinions on what qualities make great data scientists, what a data science curriculum should cover, and what skills are most valuable for data scientists to know. Share your thoughts over at Hacker News!

While the information contained in these resources is a great guide and reference, the best way to become a data scientist is to make, create, and share!


While the emerging field of data science is not tied to any specific tools, there are certain languages and frameworks that have become the bread and butter for those working in the field. We recommend Python as the programming language of choice for aspiring data scientists due to its general purpose applicability, a gentle (or firm) learning curve, and — perhaps the most compelling reason — the rich ecosystem of resources and libraries actively used by the scientific community.


When learning a new language in a new domain, it helps immensely to have an interactive environment to explore and to receive immediate feedback. IPython provides an interactive REPL which also allows you to integrate a wide variety of frameworks (including R) into your Python programs.


It is often said that a data scientist is someone who is better at software engineering than a statistician and better at statistics than any software engineer. As such, statistical inference underpins much of the theory behind data analysis and a solid foundation of statistical methods and probability serves as a stepping stone into the world of data science.


While R is the de facto standard for performing statistical analysis, it has quite a high learning curve and there are other areas of data science for which it is not well suited. To avoid learning a new language for a specific problem domain, we recommend trying to perform the exercises of these courses with Python and its numerous statistical libraries. You will find that much of the functionality of R can be replicated with NumPy, SciPy, matplotlib, and pandas.


Well written books can be a great reference (and supplement) to these courses, and also provide a more independent learning experience. These may be useful if you already have some knowledge of the subject or just need to fill in some gaps in your understanding:

Machine Learning/Algorithms

A solid base of Computer Science and algorithms is essential for an aspiring data scientist. Luckily there are a wealth of great resources online, and machine learning is one of the more lucrative (and advanced) skills of a data scientist.



Data ingestion and cleaning

One of the most under-appreciated aspects of data science is the cleaning and munging of data that often represents the most significant time sink during analysis. While there is never a silver bullet for such a problem, knowing the right tools, techniques, and approaches can help minimize time spent wrangling data.



  • Predictive Analytics: Data Preparation: An introduction to the concepts and techniques of sampling data, accounting for erroneous values, and manipulating the data to transform it into acceptable formats.


  • OpenRefine (formerly Google Refine): A powerful tool for working with messy data, cleaning, transforming, extending it with web services, and linking to databases. Think Excel on steroids.

  • DataWrangler: Stanford research project that provides an interactive tool for data cleaning and transformation.

  • sed: “The ultimate stream editor” — used to process files with regular expressions often used for substitution.

  • awk: “Another cornerstone of UNIX shell programming” — used for processing rows and columns of information.


The most insightful data analysis is useless unless you can effectively communicate your results. The art of visualization has a long history, and while being one of the most qualitative aspects of data science its methods and tools are well documented.





  • D3.js: Data-Driven Documents — Declarative manipulation of DOM elements with data dependent functions (with Python port).

  • Vega: A visualization grammer built on top of D3 for declarative visualizations in JSON. Released by the dream team at Trifacta, it provides a higher level abstraction than D3 for creating “ or SVG based graphics.

  • Rickshaw: A charting library built on top of D3 with a focus on interactive time series graphs.

  • modest maps: A lightweight library with a simple interface for working with maps in the browser (with ports to multiple languages).

  • Chart.js: Very simple (only six charts) HTML5 “ based plotting library with beautiful styling and animation.

Computing at Scale

When you start operating with data at the scale of the web (or greater), the fundamental approach and process of analysis must change. To combat the ever increasing amount of data, Google developed the MapReduce paradigm. This programming model has become the de facto standard for large scale batch processing since the release of Apache Hadoop in 2007, the open-source MapReduce framework.



Putting it all together

Data Science is an inherently multidisciplinary field that requires a myriad of skills to be a proficient practitioner. The necessary curriculum has not fit into traditional course offerings, but as awareness of the need for individuals who have such abilities is growing, we are seeing universities and private companies creating custom classes.





Now this just scratches the surface of the infinitely deep field of Data Science and we encourage everyone to go out and try your hand at some science! We would love for you to join the conversation over @zipfianacademy and let us know if you want to learn more about any of these topics.


  • Data Beta: Professor Joe Hellerstein’s blog about education, computing, and data.

  • Dataists: Hilary Mason and Vince Buffalo’s old blog that has a wealth of information and resources about the field and practice of data science.

  • Five Thirty Eight: Nate Silver’s famous NYT blog where he discusses predictive modeling and political forecasts.

  • grep alex: Alex Holmes’s blog about distributed computing and the intricacies of Hadoop.

  • Data Science 101: One man’s personal journey to becoming a data scientist (with plenty of resources)

  • no free hunch: Kaggle’s blog about the practice of data science and its competition highlights.


If you made it this far, you should check out our 12-week intensive program. Apply here!

Welcome to Zipfian

We enthusiastically welcome anyone and everyone who has a childlike curiosity about not just how the world works, but why the world works to join the Zipfian community!

Won’t you join us on our intellectual journey

We are embarking on a mission to push the bounds of what is possible with data — exploring the core of what it really means to be a data scientist.

Zipfian Academy is an attempt to make the world a bit more transparent — by arming dedicated individuals with the skills necessary to make sense of the mountain of data before us. We dream of a world in which no claim gets accepted without evidence to support it, no information that could improve the state of our world is sequestered, and the tools and knowledge to ask the right questions are open and freely available.

It is in this spirit that we invite you to join the conversation. Tell us a story, give us your opinion — this is a place to be heard. We look forward to hearing from all of you!

— Jonathan & Ryan

P. S. Stay tuned to our feeds to hear about classes, meetups, and other upcoming events.