Which languages should a social scientist learn to be a certified hacker?
The point that hacking is great aside, there are a few languages the budding social scientist hacker might consider learning. Here are the most popular and powerful tools I would suggest. We will be learning how to use these tools on American Oil Can (check back weekly for the next part of the series).
Dealing with Data
R is a political science staple. “[R] has been kicking around since 1997 as a free alternative to pricey statistical software, such as Matlab or SAS,” as a FastCompany writer explains. “[Over] the past few years, it’s become the golden child of data science—now a household name not only among nerdy statisticians, but also Wall Street traders, biologists, and Silicon Valley developers. Companies as diverse as Google, Facebook, Bank of America, and the New York Times all use R, as its commercial utility continues to spread.” (FastCompany)
You will usually find that a political scientist prefers either R or STATA. I prefer R because it’s free and has a support community of around 2 million friendly nerds who will help you fix your code when you are in a bind. New R packages are coming out all the time, which means you will never be at a loss for access to the latest methods (Oh, you’re a Bayesian? That’s great!), and it plays very well with different operating systems.
STATA, on the other hand, comes with yearly price tag of $125 ($235 if you want the full version): a steep investment for those already timid about learning to hack. It, too, is a great program, but for our purposes of teaching good coding practices and learning tidy logical syntax, STATA is not really an exemplary citizen.
Full disclosure: I recommend that ultimately folks learn both R and STATA so they can collaborate with all statistical program users. But for now, let’s stick with R!
In the past few years, Python has made leaps and bounds in the data analysis communities. Resources that did not previously exist now enable Python to deal with data sets and analyses. Packages like PANDAS, which creates data frames like those in R, and NumPy, which allows you to perform linear algebra, are very useful to the data scientist. As a FastCompany author puts it: “if R is a neurotic, loveable geek, Python is its easygoing, flexible cousin. Python is rapidly gaining mainstream appeal as a hybrid of R’s fast, sophisticated data mining capability, and a more practical language to build products. Python is intuitive and easier to learn than R, and it’s ecosystem has grown dramatically in recent years, making it more capable of the statistical analysis previously reserved for R.” (FastCompany).
I highly recommend Python as a starter language because it is simple and oh so functional. It’s also fun: Sarah, my other half, picked up Python in a few days and wrote me a program. It asks “Who’s the coolest?” and calls me a dum-dum if I say anyone other than “Sarah.” :)
Dealing with Workflow
One language that I don’t see mentioned too often is Bash, also known as the UNIX shell; I’ve noticed that a lot of tutorials will assume that you, the reader, understand what Bash is, why it’s important, and how it’s useful. Though we will get into what UNIX and UNIX shell mean later on, Bash is, in short, the non-GUI (Graphical User Interface—your desktop, mouse, icons, windows…anything point-and-click) window into what’s happening behind the scenes on your computer. When you right click to create a new folder and name it, you are essentially executing a Bash command (the
If using Bash is equivalent to using the mouse, then why use it? A few reasons. For one, it’s much slower to use a mouse than it is to type a direct command. Think about how long it takes to open Finder, navigate to your Documents folder, create a new folder, and then name that folder “Cool Cat Pics”. Just typing that was painful. In Bash, it’s one command:
mkdir "~/Documents/Cool Cat Pics". (On a Mac, that little squiggle tilde
~ represents your home folder.) So much more direct. So much faster. So much less #headdesk.
It’s also great for automation. Do you tire of repetitive tasks, such as versioning all of your old manuscript files by renaming to
my_earth_shattering_paper_file_or_figure_v2? Do you find yourself downloading web pages for content analysis by hand, selecting which format you’d like, where to put it, what to name it, and oh wait, there’s some issue. Bash can do all of this for you with a trivial amount of keystrokes. We will learn how to perform commands for common tasks like these later on in this series.
Yes, Bash is great for giving you back a few minutes of your time. But that’s not the main benefit. Bash is important to know not because of its usefulness as a language in itself (and it is useful), but because of the resources it allows you to access. Bash grants you access to language compilers for Python and Ruby, website compilers like Yeoman and Jekyll, and useful tools to convert batches of images between formats (image requirements for manuscripts, anyone?) or scrape text from PDFs with OCR or extraction (folks who want to do content analysis, look here!). Many of these tools are installed through package managers like Homebrew. When combined with Homebrew, Bash is like having a personal assistant on your computer, and it’s an important part of becoming a social science hacker.
Homebrew (and other package managers)
Homebrew is the hacker’s spice rack. When you set up Homebrew, you gain access to thousands of packages, each of which is as easy to install as it is to pull the chili powder out of the kitchen cabinet. Among these packages are tools like textract, which extracts text from PDFs, MP3s, Microsoft Word Documents, and other file types; compilers for Python or Ruby, which are your door into a whole new world of computing possibility, and; versioning utilities like git, which we will introduce in the next section. Homebrew makes it easy to install, update, and keep track of the packages we will use in the Essential Hacking series. We will chat more about how to install it later.
Homebrew – a general purpose package manager – has many virtual “brothers and sisters,” if you will. Other package managers, such as Pip, provide the same functionality of Homebrew but for specific programs or needs. Pip is a “Pythonic” (meaning “specific to Python”) package manager that will allow us to install data analysis utilities like PANDAS or NumPy. There are other package managers out there, but in this series, we will focus on Homebrew and Pip.
git is a version control utility run from the command line—and it is iconic. Every hacker who purports to write even semi-decent code uses git (or a similar version control system) to keep track of changes to their code. Why is this useful? Well, the best way to illustrate what a lifesaver git is is by explaining its name.
If someone British wants to call you an idiot, they might just call you a “git.” That’s right: “git” is British for idiot. Why oh why, you might wonder, is one of the most important pieces of software in the modern hacker’s toolbox named “idiot”? Because every person, no matter how smart or how good at hacking they are, can be an idiot and accidentally destroy their code. That’s exactly why PERSON, a British hacker, first wrote git: he realized how easy it was to derail an entire program by making a stupid mistake.
That stupid mistake is called a bug. You may have heard the term before. When a new iSomething, such as the Apple Watch (not that any of us can afford it… or the new MacBook, is released by Apple, some people choose to wait until Apple has “worked out all of the bugs” before purchasing it. In other words, these people know that when a new piece of technology comes out there are sure to be some mistakes in the code that cause it to misbehave or malfunction. When enough of these bugs add up, or when these bugs are placed in crucial pieces of code, the device or program ends up breaking down completely. It sure is frustrating to the customer when his or her iDevice freaks out due to a bug in the code. But when you start to hack, you will learn how easily bugs slip into your code. And hopefully, you will realize what a marvel it is that Apple’s Development Team – hundreds of people working on the same script – is able to produce new products so fast with so few of them!
And now, let us pause for a brief moment so President Obama can shower us with enough dollars to buy an Apple Watch. Thanks, Obama.
git helps us to avoid bugs by backing up our functional code in a “repo” (short for repository) and compartmentalizing new code in small offshots called “branches.” Usually, when working in a team setting, one team member will be assigned one chunk of the project, and another member will be assigned another chunk of the project. Each member would check out their own branch. Then, when it’s time to bring the project together, git’s merging algorithms let the team weave those pieces together without an fuss. We will go into more detail about repos, branches and git’s other facets later. Specifically, we will be learning how to use Github. Suffice to say, however, git is a useful when it comes to version control—especially when it comes to writing papers and analyzing data.
Dealing with Publication
There’s one more piece of the Essential Hacking for Political and Social Research cannon we have not yet mentioned: LaTeX.
LaTeX – pronounced LAH-tekh, where the X is like the Greek letter chi – bills itself as “a high-quality typesetting system,” and that it is! To those who aren’t in the know, LaTeX (referred to colloquially as just tex) seems like a redundant holdover from the 80’s that is used by hipsters because they can’t deign to use Microsoft Word; using tex is for the uninitiated just another hoop they have to jump through to submit a manuscript or cue to the reader that he or she is writing Science. Before I became tex-proficient, I felt the same way. Oh, how things have changed.
Tex is my word processor of choice for every professional I produce: my resume, cv, papers, memos, letters, are all typeset in tex. That is because I have experienced several drawbacks to using WYSIWYG (what you see is what you get) word processors like Microsoft Word, LibreOffice, or Mac Pages, and that LaTeX dodges these shortcomings. When I use Microsoft Word,
- I find that I spend more time than necessary fiddling with fonts and spacing. How many extra spaces should be after my title? What font size should I set my headings and subheadings in?
- Figure and table labeling becomes a pain in the butt: Wait, which figure number are we on? We need to put another figure in here? Oh, no problem, let me just redo the entire list of figures.
- Citation management is unnecessarily difficult, and citations must be pulled into the standardized Word reference database, which does not work on PCs and Macs, does not work on the Office 360 cloud, which Columbia uses (it’s actually pretty nice), and is horrendously outdated.
- And my personal favorite: Crap, I typed another line and now all of my figures and paragraphs and new pages are in the wrong place and all of my page numbers are garbage.
These pitfalls are common when doing research write-ups, as you likely know. LaTeX, on the other hand, manages all of that for you and more:
- Titles, authorship, and headers are all sized and spaced professionally and automatically. All I have to do is indicate that I want a heading or subheading, etc., and tex takes care of the rest.
- Figures and tables are dynamically labeled.
- Pagination is taken care of for you. Need little roman numerals for front matter and regular numbers for the rest? No problem—no need to fuss with “section breaks” and “page breaks”. Table of contents, list of figures, and list of tables are each beautiful and dynamically generated.
- Tex manages citations, and hundreds of citation styles are available. Need APSA formatting? No problem with tex!
Tex saves me all of the time my perfectionist self would have spent fiddling with those things and demands that I sit down and type a coherent argument. In other words, tex gives me the freedom to focus on the content of what I’m trying to say instead of how it looks. As a result, I have become a much more productive and content-oriented writer. Oh, and a lot less of a frustrated one!
There are several other advantages to tex that we will dive into later: the version control capabilities; open-source formatting, so that eveyrone can access your work for free; file persistence, so that you will always be able to access your old files in exactly the way you made them, without any conversion errors; small file size (think less than 10 kilobytes), so your Dropbox doesn’t fill up so fast; the ability to generate vector graphics for perfect publication quality; a myriad of beautiful, open-source fonts with ligatures; the ability to directly integrate R code into the paper; the ability to generate both a PDF and a webpage from the same source. The list continues.
In addition to learning how to use all of these items, we will learn how to use LaTeX – the scientific standard for typesetting – here on American Oil Can. No judgement, no pretense. Just simple-to-follow posts that will help you understand what skills you need to learn and how you should use them!