Tiffany Timbers bio photo

Tiffany Timbers

#datascientist #teacher #genomicist #neuroscientist #jackofalltrades #masterofafew

Email Twitter LinkedIn Github

Motivation

We finally finished collecting data from our high-dimensional behavioural analysis of 40 C. elegans wild-isolates from across the globe and this week I was ready to dig into the data - I am super excited to observe the diversity of behavioural profiles of 40 genetically distinct strains for whom we have whole genome sequences!

This is a much bigger dataset than I usually work with, and for this analysis I am not only interested in one particular behavioural aspect, I want to look at everything. Thus I found my usual strategy of importing all of my data directly into R did not work with this 20 GB dataset…

Do you need all your data in R at once? In my case, the answer is no, but I do want it all connected together somewhere so I can grab various pieces of for each part of my analysis. Thus I was motivated to create a database that I can query from R.

Software and installation

I decided to use SQLite from the command line because it is simple to install and use and I had been exposed to it previously at a Software Carpentry workshop. Here are the installation instructions for SQLite for each operating system, courtesy of Software Carpentry:

SQLite

Windows

The Software Carpentry Windows Installer installs SQLite for Windows. If you used the installer to configure nano, you don't need to run it again.

Mac OS X

SQLite comes pre-installed on Mac OS X.

Linux

SQLite comes pre-installed on Linux.

Creating a database in SQLite

To create a database in SQLite using the command line, open a Bash instance and navigate to the directory where you would like to have your database, then type:

sqlite3 mydatabase

You should then see the command line change to the following:

SQLite version 3.8.5 2014-05-29 12:36:14
Enter ".help" for usage hints.
sqlite>

This indicates that Sqlite is now running and you have created an empty database named mydatabase (hopefully you name yours something more useful than this example).

Putting a .csv file in your SQLite database

To add your .csv file as a table in the database you just created, your .csv must meet the requirement of being a tabular flat file. There are likely ways to input .csv files that don’t meet this requirement, but that is beyond the scope of this simple example.

You also need to know what delimiter is used in your .csv file. SQLite assumes it is a comma (,) but this is not always the case, and thus you can change that if necessary.

Before you add your .csv file to the database, it is best to create an empty table for your .csv file to go into. This is done using the CREATE TABLE function as demonstrated below:

sqlite> CREATE TABLE behaviour(   
   ...> strain CHAR(8) NOT NULL,
   ...> plate CHAR(15) NOT NULL,
   ...> speed REAL
   ...> );

Using CREATE TABLE you should specify the name of each column in your .csv file, as well as the type expected and whether or not NULL values are allowed for that column. The code above creates an empty table called behaviour, with columns strain, plate and speed. To see the tables that exist in the database we created, we can type:

sqlite> .table
behaviour

To enter our data from our .csv file into this empty table, we first must change the mode to csv:

sqlite> .mode csv

And then specify the delimiter (my file had \t spacing instead of ,’s):

sqlite> .separator '\t'

Finally we can add our data to the database using the import command and passing it the name of the file to import and the name of the table.

sqlite> .import behaviourdata.csv behaviour

And now the data is in a queryable SQL database. My next post will be how to access slices of such a database from R.