Basic Tutorial
==============

We provide here an example of how to utilize WatCon for new users. In this case, we will use WatCon to analyze conserved water molecules within the PTP1B active site, compare water networks across different PTP1B structures, and quantify conservation of these networks. For the sake of simplicity, this tutorial only uses static crystal structures. Further details on implementation of dynamic structures is available in the :doc:`User Guide <../user_guide>`


Preparing Structures
--------------------

We first obtain a series of PDB structures directly from the PDB databank. In this case, we will obtain crystal structures of PTP1B only in the closed WPD-loop position. This accounts to 69 structures.


We note that the higher resolution of a crystal structure, the more water molecules it is likely to have. Therefore, ideally we desire crystal structures with as good of a resolution as possible for our water network analysis.


1. Clean Raw PDBs
~~~~~~~~~~~~~~~~~

We first clean our PDBs by using the AmberTools ``pdb4amber`` function. Although not explicitly necessary, this tool easily rewrites our PDBs in Modeller-readable format for structural alignment. We can implement a simple bash script to process these files. This script assumes that you have kept all PDB files in a directory titled **pdbs** and then moves all of the cleaned files to the **clean_pdbs** directory.

.. code-block:: bash

    #!/bin/bash
    
    mkdir clean_pdbs
    files=$(ls pdbs)
    
    for file in $files; do
        name=$(basename "$file" .pdb)
        pdb4amber -i "pdbs/$file" -o "${name}.amber.pdb"
    done
    
    #Remove unused files
    rm *sslink* *nonprot* *renum*
    
    mv *amber* clean_pdbs


2. Create Fasta Files
~~~~~~~~~~~~~~~~~~~~~

To perform structural and sequence alignments, fasta files need to be obtained for all proteins. This can be done in any manner, but we recommend using the built in functions in the :mod:`WatCon.sequence_processing` module to create fasta files directly from the **clean_pdbs** directory. 


.. code-block:: python

    import os, sys
    from WatCon.sequence_processing import pdb_to_fastas

    fasta_out = 'fasta'
    for file in os.listdir('clean_pdbs'):
        name = file.split('.')[0]
        pdb_to_fastas(os.path.join('clean_pdbs', file), fasta_out, name)
    
    os.chdir(fasta_out)

    #Concatenate all fastas 
    os.system('for f in *.fa; do (cat "${f}"; echo) >> ../all_fastas.fa; done')


3. Align PDBs
~~~~~~~~~~~~~

Since WatCon partially relies on cartesian positions of water molecules, all structures of interest should be aligned before creating networks. There are instances in which this is not always true, see :doc:`User Guide <../user_guide>` for more details. To do this, we will use Modeller to align all structures and save these structures in the **aligned_pdbs** directory. This process saves PDB structures with no waters (not useful!), and so we will then take the translation and rotation matrices calculated from our alignments to align the entire system including waters and save those files to the **aligned_with_waters** directory. 


.. code-block:: python

   from WatCon.sequence_processing import perform_structure_alignment, align_with_waters

   rotation_info = perform_structure_alignment('clean_pdbs')
   align_with_waters('clean_pdbs', rotation_info['Rot'], rotation_info['Trans'], out_dir='aligned_with_waters')


4. Create Multiple Sequence Alignment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A multiple sequence alignment can either be generated by inputting the ``all_fastas.fa`` file into the `CLUSTAL webserver <https://www.ebi.ac.uk/jdispatcher/msa/muscle?stype=protein>`_ (or aanother alignment webserver/method) and converting the output to PIR format or simply by using the built-in alignment from WatCon.

.. code-block:: python

   from WatCon.sequence_processing import msa_with_modeller

   msa_with_modeller('alignment.txt', 'all_fastas.fa')


Run WatCon
----------

Now that we have prepared our structures, we can run WatCon. The easiest way to do this is via the use of input files. An example input file is provided in the :doc:`Getting Started <../getting_started>` section, and further details are provided in the :doc:`User Guide <../user_guide>`. 


WatCon can then be called on the command line:

 .. code-block:: console

   $ python -m WatCon.WatCon --input input_file.txt --name PTP1B_closed


Depending on which analyses you chose to conduct, WatCon will make a series of directories, including **watcon_output**, **cluster_pdbs**, **msa_classifications**, **pymol_projections**. The different types of files contained in each section are described below.

* **watcon_output**: If using input files to run WatCon analysis, a **watcon_output** directory will be made containing .pkl files containing, if indicated, WaterNetwork objects and calculated metrics which can be loaded into a follow-up python script (examples provided in the other tutorials). To read the data associated with each .pkl file, simply proceed as follows:

.. code-block:: python
   
   import pickle

   with open('/path/to/file.pkl', 'rb') as FILE:
       data = pickle.load(FILE)

   network_metrics, networks, cluster_centers, pdb_names = data

We provide several built-in post-analysis features which can be implemented without the user directly accessing these files. More details are provided in the next section.

* **cluster_pdbs**: WatCon can be used to cluster water positions across multiple structures. If doing so, the positions of these clustered positions will be saved in PDB format and can be visualized using your favorite molecular visualization software. We recommend to visualize the cluster centers simultaneously to a protein structure to more easily see the relative locations of the cluster centers. Since the cluster centers were calculated with respect to the inputted aligned PDBs, the cluster centers can be loaded alongside any topology file from this collection without fear of misalignment.

.. note::
Cluster positions from independent WatCon analyses can be viewed together, but care in alignment of independent structures needs to be taken. Further description on different ways to projectcluster centers onto non-aligned structures is given in the :doc:`User Guide <../faq/combining_different_data>`.

* **msa_classifications**: If using the two-angle water position classification (explained further in the :doc:`User Guide <../faq/calculations>`), corresponding .csv files will be saved in the **msa_classifications** directory. These files contain the following header:

.. code-block:: txt
   Frame Index/PDB ID,Resid,MSA_Resid,Index_1,Index_2,Protein_Atom,Classification,Protein_Coords,Water_Coords,Angle_1,Angle_2

Where the column names are:
    * Frame Index/PDB ID: Identifier for particular structure or frame identifier
    * Resid: Residue number (for a given structure file)
    * MSA_Resid: Common residue indexing based on multiple sequence alignment (MSA)
    * Index_1: Atom index (0-based indexing) of interacting protein atom
    * Index_2: Atom index (0-based indexing) of interacting water atom
    * Protein_Atom: Name of interacting protein atom
    * Classification: 'backbone' or 'side-chain'
    * Protein_Coords: Coordinates of interacting protein atom
    * Water_Coords: Coordinates of interacting water atom
    * Angle_1: Calculated angle from protein atom -- water atom -- reference 1
    * Angle_2: Calculated angle from protein atom -- water atom -- reference 2

.. note::
   Atom indexes use 0-based indexing to ensure consistency with MDAnalysis. However, most structure files use 1-based indexing for atom numbers.

* **pymol_projections**: This directory will contain .pml (PyMOL) files containing connection information to project onto protein structures. To read these files properly, load first into pymol the corresponding structure and trajectory frames of interest. Then, load in the connection information by calling:

 .. code-block:: console

   $ @path/to/pml/file

Into the PyMOL console.

.. note::

   .. role:: python (code)
      :language: python 
 
   If using trajectories, be sure to load the trajectory frames into the structure **before** loading in the connection information. In order to increase speed in loading, we recommend using the 'start' and 'stop' arguments in PyMOL's :python:`load_traj` function to ensure that only relevant frames are loaded into the structure. For example:

.. code-block:: console

   $ load /path/to/structure
   $ load_traj /path/to/trajectory start=10, stop=20
   $ @/path/to/pml_files/15.pml


Run WatCon Post-Analysis
------------------------

Once WatCon has been run initially, a separate input file can be utilized for separate post analysis. An example analysis input file is provided in the :doc:`Getting Started <../getting_started>` section. Post-analysis will produce (depending on specifications) a series of plots along with PDB and .pml files containing conservation information. Tips on calculating and visualizing conservation scores are outlined more directly in the :doc:`User Guide <../faq/calculations>` section. WatCon can then be called on the command line:

.. code-block:: console

   $ python -m WatCon.WatCon --analysis analysis_input.txt

We hope that this tutorial provides a sufficient guide to introduce the basics of a WatCon analysis. For more specific examples and directed guides, we recommend the user to study the remaining tutorials. Specific advice for effective WatCon usage is also outlined in the :doc:`User Guide <../user_guide>` section.