====== JYP's recommended steps for learning python ======

<note tip>If you don't know which python distribution to use and how to start the python interpreter, you should first read the [[starting|Working with Python]] page</note>

As can be expected, there is **a lot** of online python documentation available, and it's easy to get lost. You can always use google to find an answer to your problem, and you will probably end up looking at lots of answers on [[http://stackoverflow.com/questions/tagged/python|Stack Overflow]] or a similar site. But it's always better to know where you can find some good documentation... and to spend some time to read the documentation

This page tries to list some //python for the scientist// related resources, in a suggested reading order. **Do not print anything** (or at least not everything), but it's a good idea to download all the //pdf// files in the same place, so that you can easily open and search the documents

===== JYP's introduction to python =====

==== Part 1 ====

You can start using python by reading the {{:other:python:python_intro_ipsl_oct2013_v2.pdf|Bien démarrer avec python}} tutorial that was used during a 2013 IPSL python class:
  * this tutorial is in French (my apologies for the lack of translation, but it should be easy to understand)
    * If you have too much trouble understanding this French Tutorial, you can read the first 6 chapters of the **Tutorial** in [[#the_official_python_documentation|the official Python documentation]] and chapters 1.2.1 to 1.2.5 in the [[#scientific_python_lectures|Scientific Python Lectures]]. Once you have read these, you can try to read the French tutorial again
  * it's an introduction to python (and programming) for the climate scientist: after reading this tutorial, you should be able to do most of the things you usually do in a shell script
    * python types, tests, loops, reading a text file
    * the tutorial is very detailed about string handling, because strings offer an easy way to practice working with indices (indexing and slicing), before indexing numpy arrays. And our usual pre/post-processing scripts often need to do a lot of string handling in order to generate the file/variable/experiment names
  * after reading this tutorial, you should practice with the following:
    * [[https://sharebox.lsce.ipsl.fr/index.php/s/S3EO8cLrhVDeQWA|Basic python training test (ipython notebook version)]]
    * {{:other:python:tp_intro_python_oct2013_no_solutions.pdf|Basic python training test (pdf version)}}
    * {{:other:python:tp_intro_python_oct2013_full.pdf|Basic python training test (pdf version, with answers)}}

==== Part 2 ====

Once you have done your first steps, you should read {{:other:python:pythoncdat_jyp_1sur2_070227.pdf|Plus loin avec Python}} (start at page 39, the previous pages are an old version of what was covered in //Part 1// above)
  * this tutorial is in French (sorry again)
  * after reading this tutorial, you will be able to do more than you can do in a shell script, in an easier way
    * advanced string formatting
    * creating functions and using modules
    * working with file paths and handling files without calling external Linux programs\\ (e.g. using ''os.remove(file_name)'' instead of ''rm $file_name'')
    * using command-line options for scripts, or using configuration files
    * calling external programs

You can also look at the [[other:python:misc_by_jyp|Useful python stuff]] page
===== The official python documentation =====

You do not need to read all the python documentation at this step, but it is really well made and you should at least have a look at it. The **Tutorial** is very good, and you should have a look at the table of content of the **Python Standard Library**. There is a lot in the default library that can make your life easier

==== Python 2.7 ====

[[https://docs.python.org/2.7/|html]] - [[https://docs.python.org/2.7/download.html|pdf (in a zip file)]]

==== Python 3 ====

[[https://docs.python.org/3/|html]] - [[https://docs.python.org/3/download.html|pdf (in a zip file)]]


===== Scientific Python Lectures =====

Summary: //One document to learn numerics, science, and data with Python//

Note: this used to be called //Scipy Lecture Notes//

Where: [[https://lectures.scientific-python.org/_downloads/ScientificPythonLectures-simple.pdf|pdf]] - [[https://lectures.scientific-python.org/|html]]

This is **a really nice and useful document** that is regularly updated and used for the [[https://www.euroscipy.org/|EuroScipy]] tutorials.

This document will teach you lots of things about python, numpy and matplotlib, debugging and optimizing scripts, and about using python for statistics, image processing, machine learning, washing dishes (this is just to check if you have read this page), etc...
  * Example: the [[https://lectures.scientific-python.org/packages/statistics/index.html|Statistics in Python]] tutorial that combines [[other:python:jyp_steps#pandas|Pandas]], [[http://statsmodels.sourceforge.net/|Statsmodels]] and [[http://seaborn.pydata.org/|Seaborn]]


===== Numpy and Scipy =====

Summary: Python provides //ordered// objects (e.g. lists, strings, basic arrays, ...) and some math operators, but you can't do real heavy computation with these. **Numpy** makes it possible to work with multi-dimensional data arrays, and using array syntax and masks (instead of explicit nested loops and tests) and the apropriate numpy functions will allow you to get performance similar to what you would get with a compiled program! **Scipy** adds more scientific functions

Where: [[http://docs.scipy.org/doc/|html and pdf documentation]]

==== Getting started ====

  - always remember that indices start at ''0'' and that the last element of an array is at index ''-1''!\\ First learn about //indexing// and //slicing// by manipulating strings, as shown in [[#part1|Part 1]] above (try '''This document by JY is awesome!'[::-1]'' and '''This document by JY is awesome!'[slice(None, None, -1)]'') 8-)
  - if you are a **Matlab user** (but the references are interesting for others as well), you can read the following:
    - [[https://www.enthought.com/wp-content/uploads/2019/08/Enthought-MATLAB-to-Python-White-Paper-1.pdf|Migrating from MATLAB to Python]] on the [[https://www.enthought.com/software-development/|Enthought Software Development page]]
    - [[https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html|Numpy for Matlab users]]
    - [[http://mathesaurus.sourceforge.net/matlab-numpy.html|NumPy for MATLAB users]] (nice, but does not seem to be maintained any more)
  - read the really nice [[https://docs.scipy.org/doc/numpy/user/quickstart.html|numpy Quickstart tutorial]]
  - have a quick look at the full documentation to know where things are
    - Numpy User Guide
    - Numpy Reference Guide
    - Scipy Reference Guide
  - read [[https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises.ipynb|100 numpy exercises]]

==== Beware of the array view side effects ====

<note warning>When you take a slice of an array, you get a **//View//** : an array that has a new shape but that still shares its data with the first array.

That is not a problem when you only read the values, but **if you change the values of the //View//, you change the values of the first array** (and vice-versa)! If that is not what want, do not forget to **make a copy** of the data before working on it!

//Views// are a good thing most of the time, so only make a copy of your data when needed, because otherwise copying a big array will just be a waste of CPU and computer memory. Anyway, it is always better to understand what you are doing... :-P

Check the example below and the [[https://docs.scipy.org/doc/numpy-dev/user/quickstart.html#copies-and-views|copies and views]] part of the quickstart tutorial.

<code python>
>>> import numpy as np
>>> a = np.arange(30).reshape((3,10))
>>> a
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]])

>>> b = a[1, :]
>>> b
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

>>> b[3:7] = 0
>>> b
array([10, 11, 12,  0,  0,  0,  0, 17, 18, 19])

>>> a
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12,  0,  0,  0,  0, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]])
       
>>> a[:, 2:4] = -1
>>> a
array([[ 0,  1, -1, -1,  4,  5,  6,  7,  8,  9],
       [10, 11, -1, -1,  0,  0,  0, 17, 18, 19],
       [20, 21, -1, -1, 24, 25, 26, 27, 28, 29]])
       
>>> b
array([10, 11, -1, -1,  0,  0,  0, 17, 18, 19])

>>> c = a[1, :].copy()
>>> c
array([10, 11, -1, -1,  0,  0,  0, 17, 18, 19])

>>> c[:] = 9
>>> c
array([9, 9, 9, 9, 9, 9, 9, 9, 9, 9])

>>> b
array([10, 11, -1, -1,  0,  0,  0, 17, 18, 19])

>>> a
array([[ 0,  1, -1, -1,  4,  5,  6,  7,  8,  9],
       [10, 11, -1, -1,  0,  0,  0, 17, 18, 19],
       [20, 21, -1, -1, 24, 25, 26, 27, 28, 29]])
</code></note>

==== Extra numpy information ====

<WRAP center round tip 60%>
You can also check the [[other:python:misc_by_jyp#numpy_related_stuff|numpy section]] of the //Useful python stuff// page
</WRAP>


  * More information about **array indexing**:\\ <wrap em>Always check what you are doing on a simple test case, when you use advanced/fancy indexing!</wrap>
    * Examples:
      * {{ :other:python:indirect_indexing_2.py.txt |}}: Take a vertical slice in a 3D zyx array, along a varying y 'path'
    * [[https://numpy.org/doc/stable/user/basics.indexing.html|Array indexing basics (user guide)]] (//index arrays//, //boolean index arrays//, //np.newaxis//, //Ellipsis//, //variable numbers of indices//, ...)
      * [[https://numpy.org/doc/stable/user/basics.indexing.html#dealing-with-variable-numbers-of-indices-within-programs|Dealing with variable numbers of indices within programs]]
      * [[https://numpy.org/doc/stable/user/basics.indexing.html#field-access|Field access]]
        * [[https://numpy.org/doc/stable/user/basics.rec.html#indexing-and-assignment-to-structured-arrays|Indexing and assignment to structured arrays]]
    * [[https://numpy.org/doc/stable/reference/arrays.indexing.html|Indexing routines (reference manual)]]
    * [[https://numpy.org/doc/stable/user/quickstart.html#advanced-indexing-and-index-tricks|Advanced indexing and index tricks]] and [[https://numpy.org/doc/stable/user/quickstart.html#the-ix-function|the ix_() function]]
  * More information about arrays:
    * [[https://numpy.org/doc/stable/reference/routines.array-creation.html|Array creation routines]]
    * [[https://numpy.org/doc/stable/reference/routines.array-manipulation.html|Array manipulation routines]]
    * [[https://numpy.org/doc/stable/reference/routines.sort.html|Sorting, searching, and counting routines]]
    * [[https://numpy.org/doc/stable/reference/maskedarray.html|Masked arrays]]
      * [[https://numpy.org/doc/stable/reference/routines.ma.html|Masked array operations]]
  * [[https://numpy.org/doc/stable/user/misc.html#ieee-754-floating-point-special-values|Dealing with special numerical values]] (//Nan//, //inf//)
    * If you know that your data has missing values, it is cleaner and safer to handle them with [[https://numpy.org/doc/stable/reference/maskedarray.html|masked arrays]]!
    * If you know that some of your data //may// have masked values, play safe by explicitly using ''np.ma.some_function()'' rather than just ''np.some_function()''
      * More details in the [[https://github.com/numpy/numpy/issues/18675|Why/when does np.something remove the mask of a np.ma array ?]] discussion
    * [[https://numpy.org/doc/stable/user/misc.html#how-numpy-handles-numerical-exceptions|Handling numerical exceptions]]
    * [[https://numpy.org/doc/stable/reference/routines.err.html|Floating point error handling]]

===== Using NetCDF files with Python =====


==== What is NetCDF? ====

  * If you are working with climate model output data, there is a good chance that your input array data will be stored in a NetCDF file!

  * Read the [[other:newppl:starting#netcdf_and_related_conventions|NetCDF and related Conventions]] for more information

  * There may be different ways of dealing with NetCDF files, depending on which [[other:python:starting#some_python_distributions|python distribution]] you have access to


==== CliMAF and C-ESM-EP ====

People using **//CMIPn// and model data on the IPSL servers** can easily search and process NetCDF files using:

  * the [[https://climaf.readthedocs.io/|Climate Model Assessment Framework (CliMAF)]] environment

  * and the [[https://github.com/jservonnat/C-ESM-EP/wiki|CliMAF Earth System Evaluation Platform (C-ESM-EP)]]


==== xarray ====

[[https://docs.xarray.dev/|xarray]] makes working with labelled multi-dimensional arrays in Python simple, efficient, and fun! [...] It is particularly tailored to working with netCDF files

=== Some xarray related resources ===

Note: more packages (than listed below) may be listed in the [[other:uvcdat:cdat_conda:cdat_8_2_1#extra_packages_list|Extra packages list]] page

  * [[https://docs.xarray.dev/en/stable/generated/xarray.tutorial.load_dataset.html|xarray test datasets]]

  * **[[https://xcdat.readthedocs.io/|xCDAT]]: ''xarray'' extended with Climate Data Analysis Tools**

  * [[https://xoa.readthedocs.io/en/latest/|xoa]]: xarray-based ocean analysis library

  * [[https://uxarray.readthedocs.io/|uxarray]]: provide xarray styled functionality for unstructured grid datasets following [[https://ugrid-conventions.github.io/ugrid-conventions/|UGRID Conventions]]


==== netCDF4 ====

[[http://unidata.github.io/netcdf4-python/|netCDF4]] is a Python interface to the netCDF C library


==== cdms2 ====

<note important>
  * ''cdms2'' is unfortunately not maintained anymore and has been slowly **phased out in favor of a combination of [[#xarray|xarray]] and [[https://xcdat.readthedocs.io/|xCDAT]]**

  * ''cdms2'' is [[https://github.com/CDAT/cdms/issues/449|not compatible with numpy after numpy 1.23.5]] :-(
</note>

[[https://cdms.readthedocs.io/en/docstanya/|cdms2]] can read/write netCDF files (and read //grads// dat+ctl files) and provides a higher level interface than netCDF4. ''cdms2'' is available in the [[other:python:starting#cdat|CDAT distribution]], and can theoretically be installed independently of CDAT (e.g. it will be installed when you install [[https://cmor.llnl.gov/mydoc_cmor3_conda/|CMOR in conda)]]. When you can use cdms2, you also have access to //cdtime//, that is very useful for handling time axis data.

How to get started:
  - read {{:other:python:pythoncdat_jyp_2sur2_070306.pdf|JYP's cdms tutorial}}, starting at page 54
    - the tutorial is in French (soooorry!)
    - you have to replace //cdms// with **cdms2**, and //MV// with **MV2** (sooorry about that, the tutorial was written when CDAT was based on //Numeric// instead of //numpy// to handle array data)
  - read the [[http://cdms.readthedocs.io/en/docstanya/index.html|official cdms documentation]] (link may change)

===== Matplotlib =====

<note important>
The full content of this //matplotlib// section has been moved to\\ [[other:python:matplotlib_by_jyp|Working with matplotlib (JYP version)]]\\ after becoming too big to manage here

\\ Note: [[other:python:maps_by_jyp|Plotting maps with matplotlib+cartopy]] (examples provided by JYP)
</note>

Summary: there are lots of python libraries that you can use for plotting, but Matplotlib has become a //de facto// standard

Where: [[http://matplotlib.org|Matplotlib web site]]

Help on //stack overflow//: [[https://stackoverflow.com/questions/tagged/matplotlib|matplotlib help]]

===== Graphics related resources =====

  * [[http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003833|Ten Simple Rules for Better Figures]]
  * [[https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/|Top 50 matplotlib Visualizations]]
  * [[http://seaborn.pydata.org/|Seaborn]] is a library for making attractive and informative statistical graphics in Python, built on top of matplotlib
    * See also: [[https://www.datacamp.com/community/tutorials/seaborn-python-tutorial|
Python Seaborn Tutorial For Beginners]]
  * Communicating/displaying/plotting your data (possibly for people not of your field):
    * [[https://uxknowledgebase.com/introduction-to-designing-data-visualizations-part-1-31c056556133|Introduction to Designing Data Visualizations — Part 1]]
    * [[https://uxknowledgebase.com/tables-other-charts-data-visualization-part-2-cfc582e4712c|Tables & Other Charts — Data Visualization Part 2]]
    * [[https://uxknowledgebase.com/tables-other-charts-data-visualization-part-3-5bfab15ce525|Tables & Other Charts — Data Visualization Part 3]]
  * **IPCC**-related //stuff//...
    * [[https://www.ipcc.ch/site/assets/uploads/2019/04/IPCC-visual-style-guide.pdf|IPCC Visual Style Guide for Authors]]
    * [[https://wg1.ipcc.ch/sites/default/files/documents/ipcc_visual-identity_guidelines.pdf|A new assessment cycle,A new visual identity]]
    * [[https://link.springer.com/article/10.1007/s10584-019-02537-z|Communication of IPCC visuals: IPCC authors’ views and assessments of visual complexity]]
    * [[https://www.carbonbrief.org/guest-post-the-perils-of-counter-intuitive-design-in-ipcc-graphics|The perils of counter-intuitive design in IPCC graphics]]
  * Working with **colors**
    * Choosing specific colors: use [[https://www.w3schools.com/colors/colors_names.asp|HTML color names]], the [[https://www.w3schools.com/colors/colors_picker.asp|HTML color picker]], etc...
    * **Do not use the outdated //rainbow// and //jet// colormaps!**
      * [[https://pjbartlein.github.io/datagraphics/index.html|The End of the Rainbow?  Color Schemes for Improved Data Graphics]] (Light and Bartlein, EOS 2004, including replies and comments)
      * [[http://colorspace.r-forge.r-project.org/articles/endrainbow.html|Somewhere over the Rainbow]]
      * [[https://www.nature.com/articles/s41467-020-19160-7|The misuse of colour in science communication]]
    * [[https://matplotlib.org/users/colormaps.html|Choosing colormaps]]
    * [[https://matplotlib.org/cmocean/|cmocean: Beautiful colormaps for oceanography]]
    * [[https://jiffyclub.github.io/palettable/|Palettable: Color palettes for Python]]
    * [[http://colorbrewer2.org|ColorBrewer 2.0]] is a tool that can help you understand, and experiment with //sequential//, //diverging// and //qualitative// colormaps
    * The [[http://hclwizard.org/|hclwizard]] provides tools for manipulating and assessing colors and palettes based on the underlying ''colorspace'' software
    * NCL (NCAR Command Language) [[https://www.ncl.ucar.edu/Document/Graphics/color_table_gallery.shtml|Color table Gallery]]
    * JYP's favorite title: [[https://www.researchgate.net/publication/220943662_The_Which_Blair_Project_A_Quick_Visual_Method_for_Evaluating_Perceptual_Color_Maps|The "Which Blair Project": A Quick Visual Method for Evaluating Perceptual Color Maps]]


===== Basemap =====

<note warning>Basemap is going to be slowly phased out, in favor of [[#cartopy_iris|cartopy]]\\ More information in this:
  * [[https://github.com/SciTools/cartopy/issues/920|cartopy github issue]]
  * [[https://github.com/matplotlib/basemap/issues/267|basemap github issue]]
</note>

Summary: //Basemap is an extension of Matplotlib that you can use for plotting maps, using different projections//

Where: [[http://matplotlib.org/basemap/|Basemap web site]]

Help on //stack overflow//: [[https://stackoverflow.com/questions/tagged/matplotlib-basemap|basemap help]]

How to use basemap?
  - look at the [[http://matplotlib.org/basemap/users/examples.html|examples]]
  - check the [[http://matplotlib.org/basemap/users/mapsetup.html|different projections]]
  - read some documentation!
    - the **really nice** [[http://basemaptutorial.readthedocs.io/en/latest/index.html|basemap tutorial]] seems much better than the official documentation below
    - look at the [[http://matplotlib.org/basemap/api/basemap_api.html#module-mpl_toolkits.basemap|detailed official documentation]]

===== Cartopy + Iris =====

Summary:
  * **Cartopy** is //a matplolib-based Python package designed for geospatial data processing in order to produce maps and other geospatial data analyses//
  * **Iris** is //a powerful, format-agnostic, community-driven Python package for analysing and visualising Earth science data.//

Where: [[http://scitools.org.uk/cartopy/docs/latest/|Cartopy]] and [[https://scitools-iris.readthedocs.io/en/stable/|Iris]] web sites

Examples:
  * [[other:python:maps_by_jyp|Examples provided by JYP]]
  * Official gallery pages: [[https://scitools.org.uk/cartopy/docs/latest/gallery/index.html|Cartopy]] - [[https://scitools-iris.readthedocs.io/en/stable/generated/gallery/|Iris]]

Help on //stack overflow//: [[https://stackoverflow.com/questions/tagged/cartopy|Cartopy help]] - [[https://stackoverflow.com/questions/tagged/python-iris|Iris help]]

===== Maps and projections resources =====

==== About projections ====

  * [[https://egsc.usgs.gov/isb//pubs/MapProjections/projections.html|Map projections from USGS poster]]
  * [[https://pubs.usgs.gov/pp/1395/report.pdf|Map projections - A working manual (USGS)]]

==== Libraries ====

  * Projections in vcs
  * [[http://matplotlib.org/basemap/users/mapsetup.html|Projections in basemap]]
  * [[https://scitools.org.uk/cartopy/docs/latest/crs/projections.html|Projections in cartopy]]


===== 3D plots resources =====

  * [[https://ipyvolume.readthedocs.io/en/latest/|Ipyvolume]]
  * [[https://zulko.wordpress.com/2012/09/29/animate-your-3d-plots-with-pythons-matplotlib/|Animate your 3D plots with Python’s Matplotlib]]
  * [[https://stackoverflow.com/questions/26796997/how-to-get-vertical-z-axis-in-3d-surface-plot-of-matplotlib|How to get vertical Z axis in 3D surface plot of Matplotlib?]]

===== Data analysis =====

==== EDA (Exploratory Data Analysis) ? ====

<note tip>
The //EDA concept// seems to apply to **time series** (and tabular data), which is not exactly the case of full climate model output data</note>

  * [[https://www.geeksforgeeks.org/what-is-exploratory-data-analysis/|What is Exploratory Data Analysis ?]]
    * //The method of studying and exploring record sets to apprehend their predominant traits, discover patterns, locate outliers, and identify relationships between variables. EDA is normally carried out as a preliminary step before undertaking extra formal statistical analyses or modeling.//

  * [[https://medium.com/codex/automate-the-exploratory-data-analysis-eda-to-understand-the-data-faster-not-better-2ed6ff230eed|Automate the exploratory data analysis (EDA) to understand the data faster and easier]]: a nice comparison of some Python libraries listed below ([[#ydata_profiling|YData Profiling]], [[#d-tale|D-Tale]], [[#sweetviz|sweetviz]], [[#autoviz|AutoViz]])

  * [[https://www.geeksforgeeks.org/exploratory-data-analysis-in-python/|EDA in Python]]


==== Easy to use datasets ====

If you need standard datasets for testing, example, demos, ...

  * [[https://docs.xarray.dev/en/stable/generated/xarray.tutorial.load_dataset.html|Tutorial datasets]] from [[#xarray|xarray]] (requires internet)
    * Example: [[https://docs.xarray.dev/en/stable/examples/visualization_gallery.html|Using the 'air temperature' dataset]]

  * [[https://scikit-learn.org/stable/datasets.html|Toy, real-world and generated datasets]] from [[#scikit-learn]]
    * Example: [[https://lectures.scientific-python.org/packages/scikit-learn/index.html#a-simple-example-the-iris-dataset|using the 'iris' dataset]]

  * [[https://scikit-image.org/docs/stable/api/skimage.data.html|Test images and datasets]] from [[#scikit-image]]
    * Example: [[https://lectures.scientific-python.org/packages/scikit-image/index.html#data-types|Using the 'camera' dataset]]

  * [[https://esgf-node.ipsl.upmc.fr/search/cmip6-ipsl/|CMIP6 data]] on ESGF
    * Example : ''orog_fx_IPSL-CM6A-LR_piControl_r1i1p1f1_gr.nc'':
      * [[http://vesg.ipsl.upmc.fr/thredds/fileServer/cmip6/CMIP/IPSL/IPSL-CM6A-LR/piControl/r1i1p1f1/fx/orog/gr/v20200326/orog_fx_IPSL-CM6A-LR_piControl_r1i1p1f1_gr.nc|HTTP]] download link
      * [[http://vesg.ipsl.upmc.fr/thredds/dodsC/cmip6/CMIP/IPSL/IPSL-CM6A-LR/piControl/r1i1p1f1/fx/orog/gr/v20200326/orog_fx_IPSL-CM6A-LR_piControl_r1i1p1f1_gr.nc.dods|OpenDAP]] download link

  * [[https://github.com/xCDAT/xcdat/issues/277|xCDAT test data GH discussion]]


==== Pandas ====

Summary: //pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool//

Where: [[http://pandas.pydata.org|Pandas web site]]

JYP's comment: pandas is supposed to be quite good for loading, processing and plotting time series, without writing custom code. It is **very convenient for processing tables in xlsx files** (or csv, etc...). You should at least have a quick look at:

  * Some //Cheat Sheets//:
    - Basics: [[https://github.com/fralfaro/DS-Cheat-Sheets/blob/main/docs/files/pandas_cs.pdf|Pandas Basics Cheat Sheet]] (associated with the [[https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-for-data-science-in-python#python-for-data-science-cheat-sheet:-pandas-basics-useth|Pandas basics]] //datacamp// introduction page)
    - Intermediate: [[https://github.com/pandas-dev/pandas/blob/main/doc/cheatsheet/Pandas_Cheat_Sheet.pdf|Data Wrangling with pandas Cheat Sheet]]
  * Some tutorials:
    * [[http://pandas.pydata.org/docs/user_guide/10min.html|10 minutes to pandas]]
    * The [[https://lectures.scientific-python.org/packages/statistics/index.html|Statistics in Python]] tutorial that combines Pandas, [[#statsmodels|statsmodels]] and [[http://seaborn.pydata.org/|Seaborn]]
    * More [[http://pandas.pydata.org/docs/getting_started/tutorials.html|Community tutorials]]...


==== statsmodels ====

[[https://www.statsmodels.org/|statsmodels]] is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

Note: check the example in the [[https://lectures.scientific-python.org/packages/statistics/index.html|Statistics in Python]] tutorial


==== scikit-learn ====

[[http://scikit-learn.org/|scikit-learn]] is a Python library for machine learning, and is one of the most widely used tools for supervised and unsupervised machine learning. Scikit–learn provides an easy-to-use, consistent interface to a large collection of machine learning models, as well as tools for model evaluation and data preparation

Note: check the example in [[https://lectures.scientific-python.org/packages/scikit-learn/index.html|scikit-learn: machine learning in Python]]


==== scikit-image ====

[[https://scikit-image.org/|scikit-image]] is a collection of algorithms for image processing in Python

Note: check the example in [[https://lectures.scientific-python.org/packages/scikit-image/index.html|scikit-image: image processing]]


==== YData Profiling ====

[[https://docs.profiling.ydata.ai/|YData Profiling]]: a leading package for data profiling, that automates and standardizes the generation of detailed reports, complete with statistics and visualizations.


==== D-Tale ====

[[https://github.com/man-group/dtale|D-Tale]] brings you an easy way to view & analyze Pandas data structures. It integrates seamlessly with ipython notebooks & python/ipython terminals.


==== Sweetviz ====

[[https://github.com/fbdesignpro/sweetviz|Sweetviz]] is pandas based Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with just two lines of code.


==== AutoViz ====

[[https://github.com/AutoViML/AutoViz|AutoViz]]: the One-Line Automatic Data Visualization Library. Automatically Visualize any dataset, any size with a single line of code


=====  Data file formats ===== 

  * We list below some resources about **non-NetCDF data formats** that can be useful

  * Check the [[#using_netcdf_files_with_python|Using NetCDF files with Python]] section otherwise

==== The shelve package ====

The [[https://docs.python.org/3/library/shelve.html|built-in shelve package]], can be easily used for storing data (python objects like lists, dictionaries, numpy arrays that are not too big, ...) on disk and retrieving them later

Use case:
  - Use a script do to the heavy data pre-processing and store the (intermediate) results in a file using ''shelve'', or update the results
  - Use another script for plotting the results stored with ''shelve''. This way you don't have to wait for the pre-processing step to finish each time you want to improve your plot(s)

Warning:
  * read the [[https://docs.python.org/3/library/shelve.html|documentation]] and the example carefully (it's quite small)
    * if you get the impression that the data is not saved correctly, re-read the parts about updating correctly the content of the shelve file
    * you should be able to store most python objects in a shelve file, but it is safer to make tests
  * do not forget to close the output file
  * if you are dealing with big arrays and want to avoid performance issues, you should use netCDF files for storing the intermediate results
==== json files ====

More and more applications use //json files// as configuration files or as a mean to use text files to exchange data (through serialization/deserialization ).

//json// files look basically like a **list of (nested) python dictionaries** that would have been dumped to a text file

  * [[https://docs.python.org/3/library/json.html|json module]] documentation
  * [[https://realpython.com/python-json/|Working With JSON Data in Python]] tutorial
  * example script: ''/home/users/jypeter/CDAT/Progs/Devel/beaugendre/nc2json.py''
  * A compact (not easy to read...) //json// file can be pretty-printed with\\ ''cat file.json | python -m json.tool | less''

==== LiPD files ====

Resources for //Linked PaleoData//:
  * [[http://linked.earth/projects/lipd/|LiPD]]
  * [[https://doi.org/10.5194/cp-12-1093-2016|Technical note: The Linked Paleo Data framework –
a common tongue for paleoclimatology]] @ GMD
  * [[https://github.com/nickmckay/LiPD-utilities|LiPD-utilities]] @ github

==== BagIt files ====

//BagIt//, a set of hierarchical file layout conventions for storage and transfer of arbitrary digital content.

  * [[https://tools.ietf.org/html/draft-kunze-bagit-16|The BagIt File Packaging Format]]
  * [[https://github.com/LibraryOfCongress/bagger|Bagger]] (BagIt GUI)
  * [[https://github.com/LibraryOfCongress/bagit-python|bagit-python]]

==== Protocol Buffers ====

//Protocol Buffers are (Google's) language-neutral, platform-neutral extensible mechanisms for serializing structured data//

  * https://protobuf.dev/
  * [[https://protobuf.dev/getting-started/pythontutorial/|Protocol Buffer Basics: Python]]
    * ''mamba install protobuf''

===== Quick Reference and cheat sheets =====

  * The nice and convenient Python 2.7 Quick Reference: [[http://rgruet.free.fr/PQR27/PQR2.7_printing_a4.pdf|pdf]] - [[http://rgruet.free.fr/PQR27/PQR2.7.html|html]]
    * A possibly more [[http://iysik.com/PQR2.7/PQR2.7.html|up-date-version]]

  * Python 3 [[https://perso.limsi.fr/pointal/python:abrege|Quick reference]] and [[https://perso.limsi.fr/pointal/python:memento|Cheat sheet]]

  * [[https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/pdf_bw/|Jupyter Notebook Keyboard Shortcuts]]

===== Miscellaneous Python stuff and tutorials =====

Check the page about [[other:python:misc_by_jyp|useful python stuff that has not been sorted yet]]

===== Some good coding tips =====

  * The official [[https://www.python.org/dev/peps/pep-0008/|Style Guide for Python Code]] (aka PEP 0008)

  * [[http://blog.codinghorror.com/a-pragmatic-quick-reference/|A Pragmatic Quick Reference]]

===== Debugging your code =====

There is only so much you can do with staring at your code in your favorite text editor, and adding ''print'' lines in your code (or using [[https://docs.python.org/2/howto/logging.html#logging-basic-tutorial|logging]] instead of ''print''). The next step is to **use the python debugger**!

==== Debugging in text mode ====

  - Start the script with: ''python -m pdb my_script.py''
  - Type ''run'' (or **r**) to go to the first line of the script
  - Type ''continue'' (or **c**) to execute the script to the end, or till the first breakpoint or error is reached
  - Use ''where'' (or **w**) to check the call stack that led to the current stop. Use ''up'' and ''down'' to navigate through the call stack and examine the values of the functions' parameters
  - Type ''break NNN'' to stop at line NNN
  - Use ''type(var)'' and ''print var'' to check the type and values of variables. You can also change the variables' values on the fly!
  - Type ''run'' (or **r**) to restart the script
  - Use ''next'' and ''step'' to execute some parts of the script line by line. If a code line calls a function:
    * ''next'' (or **n**) will execute a function and stop on the next line
    * ''step'' (or **s**) will stop at the first line **inside the function**  
  - Check the [[https://docs.python.org/2/library/pdb.html#debugger-commands|debugger commands]] for details, or type ''help'' in the debugger for using the built-in help

==== Using pydebug ====

Depending on the distribution, the editor and the programming environment you use, you may have access to a graphical version of the debugger. UV-CDAT users can use ''pydebug my_script.py''

===== jupyter and notebook stuff =====

FIXME Misc notes, resources and links to organize later

  * [[https://beta.jupyterbook.org/|jupyter {book}]]: Jupyter Book is an open source project for building beautiful, publication-quality books and documents from computational material.

===== Using a Python IDE =====

**IDE** = //Integrated Development Environment//

There are lots of ways to use Python and develop scripts, from using a lightweight approach (your favorite text editor with builtin python syntax highlighting, e.g. **emacs** and ''python -i myscript.py'') to a full-fledged IDE. You'll find below some IDE related links

  * [[https://www.datacamp.com/community/tutorials/data-science-python-ide|Top 5 Python IDEs For Data Science]]
  * [[http://noeticforce.com/best-python-ide-for-programmers-windows-and-mac|Python IDE: The10 Best IDEs for Python Programmers]]
  * [[https://www.techbeamers.com/best-python-ide-python-programming/|Get the Best Python IDE]]
  * [[https://wiki.python.org/moin/IntegratedDevelopmentEnvironments]]

==== Spyder ====

  * [[https://github.com/spyder-ide/spyder|Home page]]
  * [[http://pythonhosted.org/spyder/|Documentation]]


===== Improving the performance of your code =====

You can already get a very efficient script by checking the following:

  * **make sure that your script is not using too much memory** (the amount depends on the computer you are using)! Your script should be scalable (e.g. keeps on working even when your data gets bigger), so it's a good idea to load only the data you need in memory (e.g. not all the time steps), and learn how to load chunks of data

  * **make sure that you are using array/vector syntax and masks**, instead of using explicit loops and tests. The [[#numpy_and_scipy|numpy documentation]] is big, because there are lots of optimized functions to help you! If you are stuck, ask JY or somebody else who is used to numpy.

If your script is still not fast enough, there is a lot you can do to improve it, without resorting to parallelization (that may introduce extra bugs rather that extra performance). See the sections below

Hint: before optimizing your script, you should spent some time //profiling// it, in order to only spend time improving the slow parts of your script

==== Useful packages ====

  * [[https://github.com/pydata/numexpr|Numexpr]]: //Numexpr is a **fast numerical expression evaluator for NumPy**. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python.//
  * [[http://www.pytables.org/|PyTables]]: //PyTables is a package for managing hierarchical datasets and designed to efficiently and **easily cope with extremely large amounts of data**//

==== Tutorials by Ian Osvald ====

  * [[http://ianozsvald.com/2011/07/25/|Tutorials from EuroScipy 2011]]
  * [[http://ianozsvald.com/2012/03/18/|Tutorials from PyCon 2012]]

===== Python 2.7 vs Python 3 =====

It is still safe to use Python 2.7, but **you should consider upgrading to Python 3**, unless some key modules you need are not compatible (yet) with Python 3

You should start writing code that will, when possible, work both in Python 2 and Python 3

Some interesting reading:

  * [[https://docs.python.org/3/whatsnew/3.0.html|What’s New In Python 3.0]].\\ Examples:
    * ''print'' is now a function. Use ''print('Hello')''
    * You cannot test a difference with ''<>'' any longer! Use ''!=''

  * The official [[https://docs.python.org/2.7/howto/pyporting.html|Porting Python 2 Code to Python 3]] page gives the required information to make the transition from python 2 to python 3. 

===== What now? =====

You can do a lot more with python! But if you have read at least a part of this page, you should be able to find and use the modules you need. Make sure you do not reinvent the wheel! Use existing packages when possible, and make sure to report bugs or errors in the documentations when you find some


===== Out-of-date stuff =====


==== CDAT-related resources ====

Some links, in case they can't be found easily on the [[https://cdat.llnl.gov|CDAT]] web site...

  * [[https://cdat.llnl.gov/tutorials.html|Tutorials in ipython notebooks]]
  * [[http://cdat-vcs.readthedocs.io/en/latest/|VCS: Visualization Control System]]
    * [[https://github.com/CDAT/vcs/issues/238|Colormaps in vcs examples]]
  * [[https://github.com/CDAT/cdat-site/blob/master/eztemplate.md|EzTemplate Documentation]]


/* standard page footer */

\\ \\ \\ 
----
[ [[pmip3:|PMIP3 Wiki Home]] ] -
[ [[pmip3:wiki_help|Help!]] ] -
[ [[wiki:syntax|Wiki syntax]] ]