That magical time of year has come again. Yes, the very fine folks over at Retrosheet have updated their game files once again, and the new batch includes (among other updates) the play-by-play files for every baseball game in 2014. There are a lot of things you can do with the Retrosheet files, but one of the most powerful options is to create a Retrosheet MySQL database. These tools don’t include any data, nor do they fetch it from the Internet. Before using them, you’ll need to get some baseball data. See “Get Historical Play-by-Play Data” Hack #14 for more information. A company called DiamondWare developed a set of three tools for Microsoft Windows for processing Retrosheet–style event files.
Okay, this post is going to be a long one, but by the end of it, you will havecreated yourself a Retrosheet database for MySQL or MariaDB. There is a postsimilar to this on Beyond the Boxscorecalled Saberizing a Mac #9: Retrosheet (part1);however, it is severely lacking details on how to get the database actuallyset up your Mac.
In order to parse the .EVN and .EVA event files from Retrosheet, we need toinstall the Chadwick tools.Before we can do that we need to install Homebrew which is a package managerfor OSX.
Download XCode for Mac
You can install XCode directly from the Mac App Store. We need thisto install the command-line tools.
Get the Command-Line Tools
Open up your terminal which is located in
/Applications/Utilities/
, and inthe terminal run the following command.
Install Homebrew
Now that we have the command-line tools, we can finally download and installHomebrew. Keep your terminal app open and run:
Once, it is installed run
brew doctor
and if you get the messageYour system is ready to brew
you are golden. If you do not, thispostis an excellent troubleshooting resource.
If you already have MySQL or MariaDB installed on your computer, you can skipthis section and move onto installing Chadwick.Once we have homebrew installed we can build MariaDB which is a fork of MySQLthat has better performance. Reasons for choosing MariaDB over MySQL can befound here.To install we run:
Once that is completed run:
You can make it so every time you log in to your computer the MariaDB/MySQL serverstarts up, but I prefer to start it manually any time I need it. To start itall you have to do is run in your terminal.
Once you start up the server, you logging in is very easy.
A couple weeks ago I was finally able to get Chadwick added to Homebrew. This makesinstalling the it so much easier on a Mac.
Chadwick is an open-source program that allows one to parse the event filesfrom Retrosheet. Suppose I have the event files from the 2014 season, which canbe downloaded directly from Retrosheet.I then unzip the files and navigate into the directory in which it was unzipped.
I am interested in the Braves home opener from 2014, which was against theNew York Mets. Using the chadwick tool
cwbox
, I am able to print out the boxscorefrom that game to my terminal. The steps below illustrate how this is accomplished.
Okay, we are finally ready to begin building the retrosheet database. First download this zip file which contains the sql fileswe will need and empty folders named unparsed and parsed. Unzip this fileinto your home directory. The unparsed folderis what we will download the Retrosheet event files into, and the parsed folderwill contain the csv’s after we parse the event files. Here is what the retrosheetfolder tree will look like.
How to install package files sims 3. Once again I recommend moving the unzipped folder after you download it into your home directory because the loader sql files tell the database server to look forthe data in ~/Retrosheet/parsed/. If you want this folder somewhere else you’llhave to edit the sql loader files.
Downloading the Data
We can actually download the data directly from Retrosheet using the terminal andwget. This really makes it easier than downloading them indiviudally. Follow the steps that are shown below to download the data zip files by decade, beginning withthe 1950’s. Once we download them, we will unzip the files as well.
If you open up the unparsed folder in your finder you will find a folder full of .EVA, .EVN, and .ROS files. These are the Retrosheet files that Chadwick will parse.
Parsing the Data
Since we now have all the data we need, we can finally start parsing the data.Using three for loops in your terminal will allow us to parse all the data from 1950to 2014. One loop parses the events, the games, and the subs into csv files thatcan be loaded into the database.
Creating and Populating the Retrosheet Database
Using the sql files that are included in the zip file, we can create and populatethe Retrosheet database. The first script, 01_schema.sql defines the Retrosheetdatabase schema. It creates an events_bck, events, games_bck, games, and subs tables.The events_bck and games_bck tables are what we will load the data into using 02_load_events.sql and 03_load_games.sql. We load the subs into the subs table using 04_load_subs.sql. To insert the data from events_bck and games_bck into the eventsand games tables we use the 05_partition.sql. In this file we create the “YEAR_ID”columns which the tables are patitioned by and copy the rows from the two loadertables into our final tables. Lastly, our lookup tables are stored in 06_lookup_codes.sql. Thisalso contains the players table which holds all the player ID’s from each player in the Retrosheetdata.
The structure for these tables is suggested by Colin Wyers from the Hardball Times,and all I did was make modifications that would make the files compatiable on a Mac.His orginial post is entitled Building a Retrosheet Database, the Short Form.
In the code below, you can see that our first step is to start the mysql server.The next step creates the Retrosheet schema, and the others populate the database.Notice the –local-infile=1; we have to add this because this featurehas been disabled by default because of security issues.Using that flag though allows it temporarily, and rest assured there is nothingin those sql scripts that should cause you any worry.
We do not have to specify a database because each of the loader sql scriptsinclude the following in the first line.
As a side note, if you created a password for your server (which is not covered herebecause this database is meant to remain local), you will have to use the -pflag.
At this point you should have a Retrosheet database on your Mac. If these steps did not work,you found any errors, want more elaboration, or have any other general questions please emailme at [email protected]. In future postsI will demonstrate some of the SQL queries you can perform using the Retrosheet data.
Posted with : MLB , Retrosheet , SQL
Since we are in the beginning of a new season, I thought it might be helpful to review some of the different data sources that are currently available for baseball. I will focus on the sources that I have found most useful in my own work.
Lahman Database
A great source of season-by-season baseball data is the Lahman database maintained by Sean Lahman. For my work, I download the files in csv format, although other data formats are available. Besides the popular Batting, Pitching, and Master datafiles, there are files on playoff games, Hall of Famers, Teams, and salaries. There is a Lahman R package that contains this data, although I am not sure it is updated for the 2016 season.
Retrosheet Data
Download Retrosheet Play By Play File In Mac Os
Retrosheet is a grassroots movement to “founded in 1989 for the purpose of computerizing play-by-play accounts of as many pre-1984 major league games as possible”. If you have been reading my blog, you’ll know that I regard this as one of the best sources of data broken down by the game or play level. Here is a post where I describe the process of downloading the Retrosheet play-by-play data into R.
PitchFX Data
Lego star wars mac download full version free. PitchFX is a tracking system that collects data about each pitch in baseball that has been available since 2006. The R package pitchFX by Carson Sievert allows one to scrape PitchFX data for particular days of interest. This is rich data allowing one to compare pitchers with respect to pitch speed, pitch type, breaks, location, and outcome. I have demonstrated the use of this data for a number of posts. I’ve tried this scraping recently and it seems to work fine.
Baseball-Reference
Sean Forman’s Baseball-Reference site is a “complete source for current and historical baseball players, teams, scores and leaders.” It can be a viewed as an easily viewable version of much of the Retrosheet data, but it has much more, such as win probability graphs for every game in baseball history. The format of this site has been recently updated. One nice feature from a data perspective is that one can “share” data in a number of different formats, such as excel or csv, which makes for easy import into R.
Download Retrosheet Play By Play File In Mac Free
FanGraphs
FanGraphs is a large site containing articles and a vast array of statistics for past and present baseball players and teams. If you search for 2016 Batting Leaders, for example, you’ll see a wide range of statistics divided into “Standard”, “Advanced”, “Batted Ball”, “Win Probability”, “Pitch Type”, “Pitch Value”, “Plate Discipline”, “Value”, “PitchFX”. There are many interesting measures described on the FanGraphs Glossary section. It appears that much of the data can be downloaded that facilitates easy import into R. I am working on a new book on “baseball graphs” and I plan on devoting a chapter on the all of the new batting and pitching measures illustrated in FanGraphs.
Baseball Savant
Quoting from their site, BaseballSavant “is a site dedicated to providing player matchups, Statcast metrics, and advanced statistics in a simple and easy-to-view way.” This site provides a window into the new Statcast data that collects information about the location and movement of the ball and every player on the field. It appears to be a work in progress so I would expect that different and new types of Statcast data would be available in the future.
R Baseball Packages
I have already mentioned several R packages, specifically Lahman that contains the Sean Lahman database, and pitchRX that allows one to easily scrape pitchFX data. https://performancebrown.weebly.com/blog/wow-legion-73-5-client-download-mac. I should also mention the openWAR package by Ben Baumer and Greg Matthews. This allows one to easily scrape play-by-play data for current games from the MLBAM GameDay files. The purpose of this package was to provide an “open” source for performing run value and WAR calculations for baseball plays. Also the newer baseballr package by Bill Petti was written to facilitate the downloading of baseball data from FanGraphs and BaseballSavant.
Any Other Good Sites?
The purpose of this post was to give a quick snapshot on the baseball data sources that I have found most useful in my work and have illustrated in the posts on this site. As we know, things are always changing fast. Let me know if you there are other good sources of data that you think I should mention in future posts.