This post contains prep instructions for this Friday’s methods workshop:
“How I Learned to Let Go and Love Python: An R Story” Friday, Sep 20, 2:30–4:00 PM. IAB 707 (LRR).
It’s not always easy to let go, but in this case the reward far outweighs the risk! The purpose of this session is to introduce the ways in which Python supplants or supplements R in political science research. We will configure Python and Jupyter Notebook on your system, review package management, and collaborate to rewrite a commonly used R script in Python.
brew install python python3.
choco install python python3(run as administrator).
pipto install jupyter notebook.
pip3 install jupyter.
pip install jupyter(run as administrator).
To boot the notebook, simply:
jupyter notebook (on PC it is sometimes
jupyter notepad) from the command line.
The point that hacking is great aside, there are a few languages the budding social scientist hacker might consider learning. Here are the most popular and powerful tools I would suggest. We will be learning how to use these tools on American Oil Can (check back weekly for the next part of the series).
R is a political science staple. “[R] has been kicking around since 1997 as a free alternative to pricey statistical software, such as Matlab or SAS,” as a FastCompany writer explains. “[Over] the past few years, it’s become the golden child of data science—now a household name not only among nerdy statisticians, but also Wall Street traders, biologists, and Silicon Valley developers. Companies as diverse as Google, Facebook, Bank of America, and the New York Times all use R, as its commercial utility continues to spread.” (FastCompany)
You will usually find that a political scientist prefers either R or STATA. I prefer R because it’s free and has a support community of around 2 million friendly nerds who will help you fix your code when you are in a bind. New R packages are coming out all the time, which means you will never be at a loss for access to the latest methods (Oh, you’re a Bayesian? That’s great!), and it plays very well with different operating systems.
STATA, on the other hand, comes with yearly price tag of $125 ($235 if you want the full version): a steep investment for those already timid about learning to hack. It, too, is a great program, but for our purposes of teaching good coding practices and learning tidy logical syntax, STATA is not really an exemplary citizen.
Full disclosure: I recommend that ultimately folks learn both R and STATA so they can collaborate with all statistical program users. But for now, let’s stick with R!
In the past few years, Python has made leaps and bounds in the data analysis communities. Resources that did not previously exist now enable Python to deal with data sets and analyses. Packages like PANDAS, which creates data frames like those in R, and NumPy, which allows you to perform linear algebra, are very useful to the data scientist. As a FastCompany author puts it: “if R is a neurotic, loveable geek, Python is its easygoing, flexible cousin. Python is rapidly gaining mainstream appeal as a hybrid of R’s fast, sophisticated data mining capability, and a more practical language to build products. Python is intuitive and easier to learn than R, and it’s ecosystem has grown dramatically in recent years, making it more capable of the statistical analysis previously reserved for R.” (FastCompany).
I highly recommend Python as a starter language because it is simple and oh so functional. It’s also fun: Sarah, my other half, picked up Python in a few days and wrote me a program. It asks “Who’s the coolest?” and calls me a dum-dum if I say anyone other than “Sarah.” :)
One language that I don’t see mentioned too often is Bash, also known as the UNIX shell; I’ve noticed that a lot of tutorials will assume that you, the reader, understand what Bash is, why it’s important, and how it’s useful. Though we will get into what UNIX and UNIX shell mean later on, Bash is, in short, the non-GUI (Graphical User Interface—your desktop, mouse, icons, windows…anything point-and-click) window into what’s happening behind the scenes on your computer. When you right click to create a new folder and name it, you are essentially executing a Bash command (the
If using Bash is equivalent to using the mouse, then why use it? A few reasons. For one, it’s much slower to use a mouse than it is to type a direct command. Think about how long it takes to open Finder, navigate to your Documents folder, create a new folder, and then name that folder “Cool Cat Pics”. Just typing that was painful. In Bash, it’s one command:
mkdir "~/Documents/Cool Cat Pics". (On a Mac, that little squiggle tilde
~ represents your home folder.) So much more direct. So much faster. So much less #headdesk.
It’s also great for automation. Do you tire of repetitive tasks, such as versioning all of your old manuscript files by renaming to
my_earth_shattering_paper_file_or_figure_v2? Do you find yourself downloading web pages for content analysis by hand, selecting which format you’d like, where to put it, what to name it, and oh wait, there’s some issue. Bash can do all of this for you with a trivial amount of keystrokes. We will learn how to perform commands for common tasks like these later on in this series.
Yes, Bash is great for giving you back a few minutes of your time. But that’s not the main benefit. Bash is important to know not because of its usefulness as a language in itself (and it is useful), but because of the resources it allows you to access. Bash grants you access to language compilers for Python and Ruby, website compilers like Yeoman and Jekyll, and useful tools to convert batches of images between formats (image requirements for manuscripts, anyone?) or scrape text from PDFs with OCR or extraction (folks who want to do content analysis, look here!). Many of these tools are installed through package managers like Homebrew. When combined with Homebrew, Bash is like having a personal assistant on your computer, and it’s an important part of becoming a social science hacker.
Homebrew is the hacker’s spice rack. When you set up Homebrew, you gain access to thousands of packages, each of which is as easy to install as it is to pull the chili powder out of the kitchen cabinet. Among these packages are tools like textract, which extracts text from PDFs, MP3s, Microsoft Word Documents, and other file types; compilers for Python or Ruby, which are your door into a whole new world of computing possibility, and; versioning utilities like git, which we will introduce in the next section. Homebrew makes it easy to install, update, and keep track of the packages we will use in the Essential Hacking series. We will chat more about how to install it later.
Homebrew – a general purpose package manager – has many virtual “brothers and sisters,” if you will. Other package managers, such as Pip, provide the same functionality of Homebrew but for specific programs or needs. Pip is a “Pythonic” (meaning “specific to Python”) package manager that will allow us to install data analysis utilities like PANDAS or NumPy. There are other package managers out there, but in this series, we will focus on Homebrew and Pip.
git is a version control utility run from the command line—and it is iconic. Every hacker who purports to write even semi-decent code uses git (or a similar version control system) to keep track of changes to their code. Why is this useful? Well, the best way to illustrate what a lifesaver git is is by explaining its name.
If someone British wants to call you an idiot, they might just call you a “git.” That’s right: “git” is British for idiot. Why oh why, you might wonder, is one of the most important pieces of software in the modern hacker’s toolbox named “idiot”? Because every person, no matter how smart or how good at hacking they are, can be an idiot and accidentally destroy their code. That’s exactly why PERSON, a British hacker, first wrote git: he realized how easy it was to derail an entire program by making a stupid mistake.
That stupid mistake is called a bug. You may have heard the term before. When a new iSomething, such as the Apple Watch (not that any of us can afford it… or the new MacBook, is released by Apple, some people choose to wait until Apple has “worked out all of the bugs” before purchasing it. In other words, these people know that when a new piece of technology comes out there are sure to be some mistakes in the code that cause it to misbehave or malfunction. When enough of these bugs add up, or when these bugs are placed in crucial pieces of code, the device or program ends up breaking down completely. It sure is frustrating to the customer when his or her iDevice freaks out due to a bug in the code. But when you start to hack, you will learn how easily bugs slip into your code. And hopefully, you will realize what a marvel it is that Apple’s Development Team – hundreds of people working on the same script – is able to produce new products so fast with so few of them!
And now, let us pause for a brief moment so President Obama can shower us with enough dollars to buy an Apple Watch. Thanks, Obama.
git helps us to avoid bugs by backing up our functional code in a “repo” (short for repository) and compartmentalizing new code in small offshots called “branches.” Usually, when working in a team setting, one team member will be assigned one chunk of the project, and another member will be assigned another chunk of the project. Each member would check out their own branch. Then, when it’s time to bring the project together, git’s merging algorithms let the team weave those pieces together without an fuss. We will go into more detail about repos, branches and git’s other facets later. Specifically, we will be learning how to use Github. Suffice to say, however, git is a useful when it comes to version control—especially when it comes to writing papers and analyzing data.
There’s one more piece of the Essential Hacking for Political and Social Research cannon we have not yet mentioned: LaTeX.
LaTeX – pronounced LAH-tekh, where the X is like the Greek letter chi – bills itself as “a high-quality typesetting system,” and that it is! To those who aren’t in the know, LaTeX (referred to colloquially as just tex) seems like a redundant holdover from the 80’s that is used by hipsters because they can’t deign to use Microsoft Word; using tex is for the uninitiated just another hoop they have to jump through to submit a manuscript or cue to the reader that he or she is writing Science. Before I became tex-proficient, I felt the same way. Oh, how things have changed.
Tex is my word processor of choice for every professional I produce: my resume, cv, papers, memos, letters, are all typeset in tex. That is because I have experienced several drawbacks to using WYSIWYG (what you see is what you get) word processors like Microsoft Word, LibreOffice, or Mac Pages, and that LaTeX dodges these shortcomings. When I use Microsoft Word,
These pitfalls are common when doing research write-ups, as you likely know. LaTeX, on the other hand, manages all of that for you and more:
Tex saves me all of the time my perfectionist self would have spent fiddling with those things and demands that I sit down and type a coherent argument. In other words, tex gives me the freedom to focus on the content of what I’m trying to say instead of how it looks. As a result, I have become a much more productive and content-oriented writer. Oh, and a lot less of a frustrated one!
There are several other advantages to tex that we will dive into later: the version control capabilities; open-source formatting, so that eveyrone can access your work for free; file persistence, so that you will always be able to access your old files in exactly the way you made them, without any conversion errors; small file size (think less than 10 kilobytes), so your Dropbox doesn’t fill up so fast; the ability to generate vector graphics for perfect publication quality; a myriad of beautiful, open-source fonts with ligatures; the ability to directly integrate R code into the paper; the ability to generate both a PDF and a webpage from the same source. The list continues.
In addition to learning how to use all of these items, we will learn how to use LaTeX – the scientific standard for typesetting – here on American Oil Can. No judgement, no pretense. Just simple-to-follow posts that will help you understand what skills you need to learn and how you should use them!
When I tell folks I study Political Science, they’re usually surprised to learn that it is not only a research-oriented profession but a necessarily mathematical one. No, we aren’t all talking heads in training, and yes, the graphs you see on CNN comparing Hillary Clinton’s ideological position to Ted Cruz’s rely on more than guesswork under the aegis of knowledge… as crazy as that may sound. Although the study of topics in social science cannot always be designed for causal inference or analyzed quantitatively (and that’s a beautiful thing), I’d argue that Political Science places an emphasis on numerical analysis that is not well understood even by some who propose to begin the study of it. Accordingly, my family is under the impression I’m taking a few gap years to study the Presidents.
It’s not hard to imagine that many people who enter into Political Science research – or social science research, for that matter – are similarly surprised, only discovering the “mathophilic” nature of their field after it’s too late to turn back. Gary King summed up the sentiment in his praise for Jeff Gill’s book, Essential Mathematics for Political and Social Research:
Did you choose the social sciences because you thought they had relatively little mathematical content? Surprise! You’re now in a bizarre situation, in which many of us once found ourselves […]
A bizarre situation indeed—one which, as of the book’s 2006 writing, was captured well by Dr. King. Today, in 2015, his words still ring true; a strong math background is a necessary prerequisite for the study of the social sciences. In the past decade, however, the number of prerequisites necessary for their study has grown. Fledging Political Scientists must know how to do more than math. They must also know how to hack.
Hacking used to be the art of gaining access to things that were off-limits through coding prowess—often against the wishes of an organization whose systems a hacker would break into. Breaking the rules meant that hackers had to be very good at writing code. Other people were good at coding, too, but being a “regular” programmer doesn’t really come with the enviable edgy coolness associated with doing things one is not supposed to be doing; being a hacker, rather than a coder or developer, comes with a certain tech caché. It’s not hard to imagine that because of the prestige entailed in the term, regular programmers began using it to talk about their work. Hence, the term’s current bifurcate meaning. At least that’s one story.
In today’s tech circles, hacking simply means “writing code well,” as evidenced by the emergence of hackathons (where teams will compete to solve a problem for a sponsor organization) and programs like Code for America’s National Day of Civic Hacking (where coders “design processes to improve our communities and the governments that serve them”).
Is hacking malicious? The newer definition of hacking aside, it’s generally unclear whether hacking proper is a species of curiosity, ego, profit, or malice (among other things). The intent of hacking can only be determined on a case-by-case basis, and whether the act of hacking into something is good or bad often depends on one’s point of view. See, e.g., Stuxnet. For our intents and purposes, however, hacking is only malicious insofar as it can sometimes cause your brain to explode :). For the social scientist, hacking is a wonderful and very useful thing!
Hacking (computer programming; coding) has become a central component of social science research for several reasons. One is that hacking has allowed researchers to establish a greater standard of replicability for their projects. Take a quick look at the publication requirements for any of the top political science journals and you will see that publication in the journal is conditional on the submission of your replication file – usually composed in R, STATA, or Python – and its subsequent verification by an independent statistical expert. In fact, the American Journal of Political Science (AJPS) recently established publication standards that prevent the publication of a paper if an independent statistical expert is unable to replicate the results you present using your data and code. (As an aside, there is a relevant debate raging in the wake of a fudged political science study. The study was able to make it past the peer review process at Science, even though it was fraudulent.) If you want to get published in a top journal, you need to know how to hack.
Another reason hacking is essential to the practice of political science, or any science for that matter, is that it has scaled incredibly the ability of researchers to collect data and draw conclusions. Techniques like web scraping and services like Amazon’s MechanicalTurk allow people to enlist an army of either robots or single-tasked workers for very little cost (allowing those of us with more…ahem… modest budgets to get some decent research done). W-NOMINATE, a process which allows us to plot congresspeople in a policy space, is a staple of the political science research corpus. Its widespread use and utility would not have been possible without hacking. Even if you don’t view hacking as necessary, it is abundantly clear that it makes your “personal research lab” a whole lot more productive.
Here’s one more cool benefit you’ll get from learning to hack: once you write a script, you can re-use it again and again. In other words, once you put in the initial investment of time and energy (which, by the way, is fun and rewarding), that investment never loses its value. And that experience carries over—learning one programming language is very much like learning any other, and once you learn one, you’ve made learning the next much, much easier. Learning to hack isn’t like learning a new method that you use on one paper and then never again. Learning to hack buffs the learning curve for everything else in your favor.
The list of programming languages available to learn are endless, ranging from the ancient to the absurd. For example, see figure 1 for a slice of C, which first appeared in 1972, and see figure 2 for a slice of LOLCODE, which is a hilariously useless (but functional) language. Unless you’re trying to trick a graphics processor into doing sophisticated calculations for you (in the case of C) or just flat out trying to troll a journal (in the case of LOLCODE), you will not need to learn these languages. (Actually, I kind of want to troll a journal with a working LOLCODE replication script now).
Most languages relevant to the burgeoning newbie (that’s us!) – the ones that help you do data collection, analysis, and visualization – are actually pretty intelligible. See, for example, the Python code in figure 3.
It’s simple to understand what’s going on in this script. We have a list of your friends, and the program prints each of your friends’ names along with the numerical position of that friend in the list. When processing data, we would write a similar script. In the simplest form of that script, we would have a list of data, and for each datum in that list, we would perform a function or a calculation. Not too bad, right?
Certainly, there are several benefits of hacking that I have not discussed here. More so, there are some costs to learning how to hack: time investment, among other inconsequentials like syntax frustration (when something isn’t working because you are missing a comma) or the development of code envy (when you look at someone’s code and get jealous because it is so good). The way hacking is usually taught, there is a steep learning curve, which turns people off from learning how to hack because they feel like they can’t make the investment. Or, it causes tears because people don’t feel like they can ever be good enough to “be one of those hacker guys.” Another fear might not be that you are scared of not being the best, but that you are scared of being bad at it: you’ve seen so many of your peers/colleagues spit out code like it’s a second language. It’s scary to think that you just won’t be able to do it.
These are reservations that I have felt, too. Here on American Oil Can, I’m here to help walk you through the first few steps you’ll take as a fledgling coder—no judgement, no pretense. Just simple-to-follow posts that will help you understand what skills you need to learn and how you should use them. Here, we will teach you the ins and outs of the Mac Stack for data processing and analysis: UNIX, the Terminal, Bash, Homebrew, Python, R, Quandl, web scraping, and much more. (And let’s be honest: get a Mac or install Linux.) Let’s get lubridating!
This is the traditional “Hello World” post, to test the functionality of Jekyll and my plugins. :)