Proteomics Data Science: Online Data

Most journals that publish proteomics data require raw data deposition. In addition to allowing reviewers and community members to validate findings in publications, knowledge captured in proteomics data resources can enable evidence synthesis across studies, speed assay development, and improve downstream bioinformatics tools. In this session we will present a variety of proteomics data resources and how to use them. We will show examples of proteomics data and open source code reuse for both new bioinformatics data interpretations and to guide future proteomics experiments. We will also present methods to build online lab notebooks that illustrate exploratory data analysis for proteomics experiments and take researchers from raw data and data tables to publishable figures.

An Overview of the Flow Cytometry Research Group (FCRG) Studies – Kathleen Brundage:

The FCRG’s main focus is to provide the flow cytometry community with practical information on fluorescent activated cell sorting that they can implement to better serve their users. Our initial studies focused on the effects of different cell sorters and pressures on cell function and gene expression. Over the last few years our group has focused our studies on evaluating sterility and cleanliness of cell sorters in shared resource laboratories (SRLs), as well as, evaluating the effects fixation has on RNA isolation from sorted cells. Our current focus is on developing best practices for single cell sorting.

Multi-well, In-Incubator Imaging Platform for Biological Imaging – Victoria Ly:

Typical approaches to biological imaging consist of a tissue growing environment (e.g., incubator) where multiple samples are cultured simultaneously, and a shared central microscopy unit, where images are generated one sample at a time. In the majority of cases, the samples are manually relocated from the incubator to the central microscopy unit. This technique has one major advantage: the ability to use a single high-cost microscope. However, there are several disadvantages: Firstly, moving the biological samples outside of the incubator unit can contaminate them. Secondly, the specific growing conditions (e.g., temperature, humidity, CO2 concentration) cannot be maintained without additional equipment during the imaging period. Thirdly, this approach cannot monitor transient phenomena (e.g., continuous monitoring of development).

We propose a multi-well, in-incubator imaging platform that eliminates these shortcomings and can be used for bright-field and fluorescent microscopy. This system is mostly 3-D printed, with a cost coming in under $100 per imaging unit (e.g., tissue growing well). The system is Wi-Fi enabled, allowing remote control (of imaging frequency focus adjustment, lighting, and fluorescent imaging, etc.) without removing the system from the incubator. The images are stored in the cloud allowing for off-site analysis. The imaging system is designed for the scalability of hardware and accounts for larger volumes of data output, storage, and processing.

The Panorama Data Repository for Skyline Users – Vagisha Sharma:

Panorama is an open-source web-based data management system that was designed and developed for Skyline, a software tool for targeted mass spectrometry-based experiments. Panorama facilitates viewing, sharing and disseminating targeted, quantitative results contained in Skyline documents. Panorama can be installed locally, or laboratories and organizations can sign-up for fully featured workspaces on the PanoramaWeb server (https://panoramaweb.org) hosted at the University of Washington. Workspaces on PanoramaWeb can be organized as needed by the owners and configured with fine-grained access controls to enable collaborative projects. To allow unlimited file storage Panorama projects can be set up to use cloud-backed storage such as Amazon Simple Storage Service (S3).
In addition to storing and sharing Skyline results, Panorama together with Skyline is used for fully automated, longitudinal monitoring of LC-MS/MS system suitability. This is done with the Panorama AutoQC pipeline which automatically imports system suitability runs into a Skyline document as they are acquired. The document is uploaded to a Panorama server and several identification free metrics such as peak area, retention time etc. can be viewed as Levey-Jennings plots in a web-browser to track normal variation and quickly detect anomalies.
Skyline documents and raw data on PanoramaWeb that are associated with research manuscripts can be submitted to the Panorama Public repository (https://panoramaweb.org/public.url) which is hosted on PanoramaWeb and is a member of the ProteomeXchange Consortium (http://www.proteomexchange.org/). Data on Panorama Public can be explored with a variety of graphs and annotated chromatographic peak views making it easy to evaluate quantitative results contained in the associated manuscripts. Access to data in the repository is managed as required, e.g. private access to reviewers during the manuscript review process and public access upon publication.

Github: a Powerful Resource for Scientific Communication – Phillip Wilmarth:

Github is an online version control service built on top of Git version control software. Originally developed for software development teams, Github has evolved into a resource that has broad communication and resource sharing potential. Hosting files with support for version control is built in. Many users may not realize the ease of web hosting Github provides via markdown, and its ability to run and display Jupyter notebooks.

Proteomics data analysis involves many steps that can be difficult to communicate in traditional forms of scientific communication. There is sample preparation, chromatography separations, instrument settings, and lengthy informatics pipelines. Quantitative experiments will also need additional statistical analyses. Increasingly, scientific projects involve more than just proteomics experiments. Methods sections of publications are getting harder to create and understand. Describing all of the pieces is daunting, let alone communicating how all of the pieces were used to address the scientific questions.

I will show how Github’s combination of features can present proteomics data analyses in effective and transparent ways. The https://github.com/pwilmart/Sea_lion_urine_SpC repository will be used as an example where publicly available data from PRIDE (PXD009019) for Seal lion urine samples was re-analyzed. The repository hosts results files, has experimental and analysis details fully described in the README.md file, and has Jupyter notebooks for quality control and statistical analyses with R. Notebooks offer unique ways to describe and share R scripting (or Python), rich data visualizations, and full analysis narratives.

A useful part of this example will be how to find protein databases for non-model organisms, such as Sea lion, and how to make sense of the (un-annotated) identified proteins. I will discuss some ortholog mapping and annotation tools hosted at Github to facilitate data interpretation and follow up work.