Input file format¶
This section discusses how raw data should be organized so that tunacell
is
able to read it.
Two types of input are possible:
- plain text format (full compatibility): data output from any segmentation software can be translated to plain-text format; its format is explained thoroughly below
- SuperSegger output format (experimental): data output is read directly from
the output of the software (stored in a number of Matlab
.mat
files under a specific folder structure.
Contents
A given experiment is stored in a main folder¶
The name of the main folder is taken as the label of the experiment, i.e. as a unique name that identifies the experiment.
The scaffold to be used in the main folder is:
<experiment_label>/
containers/
descriptor.csv
metadata.yml
If you executed the tunasimu
script (see 10 minute tutorial) you can look in
the newly created directory tmptunacell
in your home directory:
there should be a folder simutest
storing data from the numerical simulations:
$ cd simutest
$ ls
and check that the structure matches the scaffold above.
There is a subfolder called containers
where raw data files are stored,
and two text files: descriptor.csv
describes the column organization of raw
data text files (see Raw data description),
while metadata.yml
stores metadata about the experiment
(see Metadata description).
Both files are needed for tunacell to run properly.
Data is stored in container files in the containers
subfolder¶
Time-lapse data is stored in the containers
folder. If you ran the
10 minute tutorial you can check what you find in this folder:
$ cd containers
$ ls
You should see a bunch of .txt
files (exactly 100 such files if you stuck to
default values for the simulation).
Each file in this containers
folder recapitulates raw data of cells
observed in fields of view of your experiment, which have been reported
by your image analysis process.
Your experiment may consist of multiple fields of view (or even subsets thererof), and we call each of these files a container file. Within a given container file, cell identifiers are univocal: there cannot be two different cells with the same identifier.
The container file is tab-separated values, and each column corresponds to a cell quantifier exported by the image analysis process. Each row represents one acquisition frame for a given cell. Rows are grouped by cell: if cell ‘1’ was imaged on 5 successive frames, there should be 5 successive rows in the container file reporting for raw data about cell ‘1’.
Raw data description¶
The column name and the type of data for each column is reported in the
descriptor.csv
file, a comma-separated value files, where each line entry
consists of <column-name>,<column-type>
.
The column name is arbitrary unless for 3 mandatory quantifiers (see mandatory-fields). The column type must be given as numpy datatypes; mostly used datatypes are:
f8
are floating point numbers coded on 8 bytes (this should be your default datatype for most quantifiers, except cell identifiers),i4
means integer coded on 4 bytes,u2
usually refer to the Irish band. For our purpose it also means unsigned integer coded on 2 bytes (this is the default for cell identifier, it counts cells up to 65535, which can be upgraded tou4
pushing the limits to 4294967295 cells—after that let me know if you still haven’t found what you’re looking for)
Mandatory raw data columns¶
cellID
: the identifier of a given cell. In our example, cells are labeled numerically by integers, hence the type isu2
(Numpy shortname that means unsigned integer coded on 2 bytes);parentID
: the identifier of the parent of given cell. This is mandatory fortunacell
to reconstruct lineages and colonies;time
: time at which acquisition has been made. Its type should bef8
, that means floating type coded on 8 bytes. The unit is left to the user’s appreciation (minutes, hours, or it can even be frame acquisition number—though this is discouraged since physical processes are independent of the period of acquisition).
All other fields are left to the user’s discretion.
Example¶
In our simutest
experiment, one could inspect descriptor.csv
:
time,f8
ou,f8
ou_int,f8
exp_ou_int,f8
cellID,u2
parentID,u2
In addition to the mandatory fields listed above one can find the following
cryptic names: ou, ou_int, exp_ou_int
. These are explained in Numerical simulations in tunacell.
Metadata description¶
YAML format¶
Experiment metadata is stored in the metadata.yml
file which is parsed using
the YAML syntax. First the file can be separated in documents (documents are
separated by ‘—’). Each document is organized as a list of parameters
(parsed as a dictionary). There must be at least one document where the entry
level
should be set to experiment
(or synonymously,
top
).
It indicates the higher level experimental metadata (can be date of experiment,
used strain, medium, etc…). A minimal example would be:
level: experiment
period: 3
which indicates that the acquisition time period is 3 minutes. A more complete metadata file could be:
level: experiment
period: 3
strain: E. coli
medium: M9 Glucose
temperature: 37
author: John
date : 2018-01-20
When the experiment has been designed such that metadata is heterogeneous, i.e. some fields of view get a different set of parameters, and that one later needs to distinguish these fields of view, then insert as many new documents as there are different types of fields of view. For example assume our experiment is designed to compare the growth of two strains and that fields of view 01 and 02 get one strain while field of view 03 get the other strain. One way to do it is:
level: experiment
period: 3
---
level:
- container_01
- container_02
strain: E. coli MG1655
---
level: container_03
strain: E. coli BW25113
A parameter given in a lower-lover overrides the same experiment-level parameter, which means that such a metadata could be shortened:
level: experiment
period: 3
strain: E. coli MG1655
---
level: container_03
strain: E. coli BW25113
such that it is assumed that the strain is E. coli MG1655
for all container
files, unless indicated otherwise which is the case here for container_03
that gets the BW25113
strain.
Tabular format (.csv)¶
Another option is to store metadata in a tabular file, such as comma-separated
values. The header should contain at least level
and period
.
The first row after header is usually reserved for the experiment level metadata,
and following rows may be populated for different fields of view. For example
the csv file corresponding to our latter example reads:
level,period,strain
experiment,3,E. coli MG1655
container_03,,E.coli BW25113
Although more compact, it can be harder to read/or fill from a text file.
Note
When a container is not listed, its metadata is read from to the experiment metadata. Missing values for a container row are filled with experiment-level values.
Supersegger output¶
The supersegger output is stored in numerous subfolders from a main folder. The Metadata description needs to be added as well under this main folder.
What to do next?¶
If you’d like to start analysing your dataset, your first task is to organize data in the presented structure. When it’s done, you can try to adapt the commands from the 10 minute tutorial to your dataset. When you want to get more control about your analysis, have a look at Setting up your analysis which presents you how to set up the analysis, in particular how to define the statistical ensemble and how to create subgroups for statistical analysis. Then you can refer to Plotting samples to customize your qualitative exploration of data, and then dive in Statistics of the dynamics to start the quantitative analysis.