Useful python stuff

Useful python stuff

You will find on this page some useful, but unsorted, python tips and tricks that can't fit in a section of the main JYP's recommended steps for learning python page

Extra tutorials

Only when you have already read all the content of this page several times, and you are looking for new ideas

Calculating distance between two geo-locations in Python:
- haversine, haversine_distances @ scikit-learn and Haversine formula
Looking at table data with pandas
Stats stuff
- Python Outlier Detection: IQR Method and Z-score Implementation
Clean Code in Python: Good vs. Bad Practices Examples
PEP 8 – Style Guide for Python Code
- How to Write Beautiful Python Code With PEP 8
- PEP-8 Tutorial: Code Standards in Python
- Some checkers/linters: ruff, flake8
Ultimate Python Cheat Sheet: Practical Python For Everyday Tasks
16 Hacks That Will Take Your Python Skills to the Next Level
Modular Coding in Python: Finally Solve your Import Errors (understanding and fixing ModuleNotFoundError and ImportError)
Understanding Multithreading and Multiprocessing in Python

Reading/setting environments variables

>>> os.environ['TMPDIR']
'/data/jypmce/climafcache'
>>> os.environ.get('SCRATCHDIR', '/data/jypmce/some_scratch_stuff')
'/data/jypmce/some_scratch_stuff'
>>> os.environ['temporary_env_var_for_THIS_script'] = 'some value'
>>> os.environ['temporary_env_var_for_THIS_script']
'some value'

Generating (aka raising) an error

This will stop the script, unless it is called in a function, and the code calling the function explicitely catches and deals with errors

raise RuntimeError('\n\nOMG! An error! :-(\nAborting script...')

Errors and Exceptions tutorial
Built-in Exceptions reference

Using log files (aka logging)

It is always possible to display information messages using the print() command, but it is more efficient to use logging tools when you want to display correctly a lot of information about a script progress

Loguru is a library which aims to bring enjoyable logging in Python
- See also A Complete Guide to Logging in Python with Loguru
More on logging with python
The default (but not easy to use) Python logging module

Stopping a script

A user can use CTRL-C or kill to stop a script, or CTRL-Z to suspend it temporarily (use fg to resume a suspended script). The code below can be used by the script itself to interrupt its execution, instead of raising an error

sys.exit('Some optional message about why we are stopping')

Checking if a file/directory is writable by the current user

>>> os.access('/', os.W_OK)
False
>>> os.access('/home/jypmce/.bashrc', os.W_OK)
True

Playing with strings

String formatting

Knowing how to display/print a string correctly is always useful for information and debugging purpose
There are lots of different ways to display strings

String formatting examples

You will find below some examples of quick printing, as well as using old style formatting, formatted string literals (f-strings) and the String format() Method. More details in the next section

>>> # Basic (but quick and efficient) printing
 
>>> year = 1984
>>> print(year)
1984
>>> print('[', year, 'is a famous book ]')
[ 1984 is a famous book ]
 
>>> # Old style formatting
 
>>> print('[ %i is a famous book ]' % (year,))
[ 1984 is a famous book ]
>>> print('[ %10i is a famous book ]' % (year,))
[       1984 is a famous book ]
>>> print('[ %-10i is a famous book ]' % (year,))
[ 1984       is a famous book ]
>>> print('[ %010i is a famous book ]' % (year,))
[ 0000001984 is a famous book ]
 
>>> # Formatted string literals (f-strings)
 
>>> print(f'[ {year} is a famous book ]') 
[ 1984 is a famous book ]
>>> print(f'[ {year=} is a famous book ]')
[ year=1984 is a famous book ]
>>> print(f'[ {year:10} is a famous book ]')
[       1984 is a famous book ]
>>> print(f'[ {year:<10} is a famous book ]')
[ 1984       is a famous book ]
>>> print(f'[ {year:010} is a famous book ]')
[ 0000001984 is a famous book ]
>>> print(f'[ {year:10.2f} is a famous book (yes, {year}!) ]')
[    1984.00 is a famous book (yes, 1984!) ]
 
>>> # The String format() Method
 
>>> print('[ {} is a famous book ]'.format(year))
[ 1984 is a famous book ]
>>> print('[ {:10} is a famous book ]'.format(year))
[       1984 is a famous book ]
>>> print('[ {:<10} is a famous book ]'.format(year))
[ 1984       is a famous book ]
>>> print('[ {:010} is a famous book ]'.format(year))
[ 0000001984 is a famous book ]
>>> print('[ {:10.2f} is a famous book  (yes, {}!) ]'.format(year, year))
[    1984.00 is a famous book  (yes, 1984!) ]
>>> print('[ {title:10.2f} is a famous book  (yes, {title}!) ]'.format(title=year))
[    1984.00 is a famous book  (yes, 1984!) ]
>>> print('[ {title:10.2e} is a famous book ]'.format(title=year))
[   1.98e+03 is a famous book ]

String formatting references

Formatted String Literals (f-strings)
- Available in Python >= 3.6
- More documentation
- Format Specification Mini-Language
  - See also the PyFormat site

The String format() Method
- Format Specification Mini-Language
  - See also the PyFormat site

PyFormat site: string formatting using the old style and the String format() method
- Hint: this can also be used as an easy documentation for f-strings format!

Old string formatting

Splitting (complex) strings

It's easy to split a string with multiple blank delimiters, or a specific delimiter, but it can be harder to deal with sub-strings

>>> str_with_blanks = 'one    two\t3\t\tFOUR'
>>> str_with_blanks.split()
['one', 'two', '3', 'FOUR']

>>> str_with_simple_delimiters = '1,2,3.14,  4'
>>> str_with_simple_delimiters.split(',')
['1', '2', '3.14', '  4']

>>> complex_string='-o 1 --long "A string with accented chars: é è à ç"'
>>> complex_string.split()
['-o', '1', '--long', '"A', 'string', 'with', 'accented', 'chars:', '\xc3\xa9', '\xc3\xa8', '\xc3\xa0', '\xc3\xa7"']

>>> import shlex
>>> shlex.split(complex_string)
['-o', '1', '--long', 'A string with accented chars: \xc3\xa9 \xc3\xa8 \xc3\xa0 \xc3\xa7']

Working with paths and filenames

If you are in a hurry, you can just use string functions to work with paths and file names.

You will need some specific objects and functions to check if a file exists, and similar operations. Check the libraries listed below, that can automatically deal with Unix-type paths on Linux and MacOS computers, and Windows-type paths on Windows computers

os.path: common pathname manipulations
- Available since… a long time! Use this if you want to avoid backward compatibility problems
- Some functions are directly in os Miscellaneous operating system interfaces
  e.g. os.remove and os.rmdir
pathlib: a more recent object-oriented way to deal with filesystem paths
- Available since Python version 3.4
- Matching pathlib, and os or os.path functions
shutil: High-level file operations, e.g copy/move a file or directory tree

Example: getting the full path of the Python executable used

Note: the actual python may be different from the default python!

$ which python
/usr/bin/python

$ /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/bin/python
>>> import sys, shutil
>>> shutil.which('python')
'/usr/bin/python'
>>> sys.executable
'/home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/bin/python'

Example: getting the full path of a script

>>> import os
>>> os.getcwd()
'/home/jypmce/PMIP4'
>>> os.path.exists('./argv_test.py')
True
>>> os.path.abspath('./argv_test.py')
'/home/jypmce/PMIP4/argv_test.py'
>>> os.path.exists('/home/jypmce/PMIP4/argv_test.py')
True

Example: system independent paths with pathlib

Note: the following example was generated on a Linux server and uses a / character as a path separator

>>> my_home = Path.home()
>>> my_home
PosixPath('/home/users/my_login')
>>> my_conf = my_home / '.config' / 'evince'
>>> my_conf
PosixPath('/home/users/my_login/.config/evince')
>>> my_conf.is_dir()
True
>>> my_conf.is_file()
False
>>> list(my_conf.glob('*'))
[PosixPath('/home/users/my_login/.config/evince/evince_toolbar.xml'), PosixPath(' /home/users/my_login/.config/evince/accels')]
>>> [ ff.name for ff in my_conf.glob('*') ]
['evince_toolbar.xml', 'accels']

Example: getting the size(s) of all the files in a directory

$ cd /data/jypmce/TestDir
$ ls -l
total 72
-rw-r--r-- 1 jypmce ipsl 18147 Jun 25  2012 get_TS_cmip5.py
-rw-r--r-- 1 jypmce ipsl 16152 Jun 21  2012 get_TS_cmip5.py~
-rw-r--r-- 1 jypmce ipsl 13954 Jul  3  2012 get_TS_cmip5_regular.py
-rw-r--r-- 1 jypmce ipsl 16539 Jun 22  2012 get_TS_cmip5_regular.py~

>>> os.chdir('/data/jypmce/TestDir')
>>> print(os.getcwd())
/data/jypmce/TestDir
>>> files_list = os.listdir()
>>> files_list
['get_TS_cmip5.py~', 'get_TS_cmip5_regular.py', 'get_TS_cmip5_regular.py~', 'get_TS_cmip5.py']
>>> files_sizes = list(map(os.path.getsize, files_list))
>>> files_sizes
[16152, 13954, 16539, 18147]
>>> sum(files_sizes)
64792

Generating file names

Name depending on the current date/time

>>> import time
>>> plot_version = time.strftime('%Y%m%d_%H%M')
>>> f_name = 'test_%s.nc' % (plot_version,)
>>> f_name
'test_20210827_1334.nc'

Temporary file

>>> import tempfile, os
>>> f_tmp = tempfile.NamedTemporaryFile(mode='w', suffix='.nc', delete=False)
>>> f_tmp
<tempfile._TemporaryFileWrapper object at 0x2b5614743820>
>>> f_tmp.name
'/tmp/tmpi6uk9hre.nc'
>>> f_tmp.close()
>>> os.remove(f_tmp.name)

Using command-line arguments

The extremely easy but non-flexible way: sys.argv

The name of a script, the number of arguments (including the name of the script), and the arguments (as strings) can be accessed through the sys.argv strings' list

Simple argv_test.py test script:

#!/usr/bin/env python
import sys
nb_args = len(sys.argv)
print('Number of script arguments (including script name) =', nb_args)
for idx, val in enumerate(sys.argv):
    print(idx, val)

$ python argv_test.py
Number of script arguments (including script name) = 1
0 argv_test.py

$ python argv_test.py tas tas_tes.nc
Number of script arguments (including script name) = 3
0 argv_test.py
1 tas
2 tas_tes.nc

The C-style way: getopt

Use getopt (C-style parser for command line options)

The deprecated Python way: optparse

optparse (parser for command line options) is deprecated since Python version 3.2! You should now use argparse (check Upgrading optparse code for converting from optparse to argparse)

The current Python way: argparse

argparse (parser for command-line options, arguments and sub-commands) is available since Python version 3.2

Using ordered dictionaries

Dictionary order is guaranteed to be insertion order! Note that the usual Python dictionary also guarantees the order since version 3.6

Check the OrderedDict class (from collections import OrderedDict) and the OrderedDict vs dict in Python: The Right Tool for the Job tutorial

Using sets

Python sets are groups of unique elements. They can be used to easily find all the unique elements of something and you can easily determine the intersection, union (and other similar operations) of sets.

Printing a readable version of long lists or dictionaries

The pprint module can be used for pretty printing objects (lists, dictionaries, …). It will wrap long lines in a meaningful way

>>> import pprint

>>> test_dic = {'AWI-ESM-1-1-LR_AWI':{'r1i1p1f1': {'grid': 'gn'}}, 'CESM2_NCAR':{'r1i1p1f1': {'grid': 'gn'}}, 'IPSL-CM6A-LR_IPSL':{'r1i1p1f1': {'grid': 'gr'}, 'r1i1p1f2': {'grid': 'gr'}, 'r1i1p1f3': {'grid': 'gr'}, 'r1i1p1f4': {'grid': 'gr'}}}

>>> print(test_dic)
{'AWI-ESM-1-1-LR_AWI': {'r1i1p1f1': {'grid': 'gn'}}, 'CESM2_NCAR': {'r1i1p1f1': {'grid': 'gn'}}, 'IPSL-CM6A-LR_IPSL': {'r1i1p1f1': {'grid': 'gr'}, 'r1i1p1f2': {'grid': 'gr'}, 'r1i1p1f3': {'grid': 'gr'}, 'r1i1p1f4': {'grid': 'gr'}}}

>>> pprint.pprint(test_dic)
{'AWI-ESM-1-1-LR_AWI': {'r1i1p1f1': {'grid': 'gn'}},
 'CESM2_NCAR': {'r1i1p1f1': {'grid': 'gn'}},
 'IPSL-CM6A-LR_IPSL': {'r1i1p1f1': {'grid': 'gr'},
                       'r1i1p1f2': {'grid': 'gr'},
                       'r1i1p1f3': {'grid': 'gr'},
                       'r1i1p1f4': {'grid': 'gr'}}}
                       
>>> dir(test_dic)
['__class__', '__contains__', '__delattr__', [... lots of unreadable stuff removed...] 'setdefault', 'update', 'values']

>>> pprint.pprint(dir(test_dic))
['__class__',
 '__contains__',

[... lots of lines removed in this example ]

 'setdefault',
 'update',
 'values']

Storing objects and data in a file (shelve and friends)

The built-in shelve module can be easily used for storing temporary/intermediate data

More options:

Some non-NetCDF file formats
Working with NetCDF files

Using a configuration file

The built-in configparser module can be easily used for reading (and writing!) text configuration files.

Note: a configuration file is also a way to easily store and exchange text data !

Working with global variables

There is a good chance you don't actually want/need a global variable. Be sure to use the global statement correctly if you want to avoid side-effects…

Using (and changing) a global variable inside a script or module

Simple module example

_myvar = 10

def set_myvar(new_val):
    # Note: need to explicitly define a global variable (of a module)
    # as 'global' BEFORE changing its value in a function!
    # Otherwise, the value will not be REdefined outside the function
    global _myvar
    _myvar = new_val

def get_myvar():
    return _myvar

def myfunc(nb_repeat = 10):
    print(nb_repeat * _myvar)

Sharing global variables across modules

Sorting

When dealing with numerical values, you should use the numpy sorting, searching, and counting routines!
Sorting HOW TO
Example: sorting the keys and the values of a dictionary, and then using the key parameter to sort the keys of a dictionary according to the value associated with the key
- If we provide a key function, the sort function will sort the elements by the values returned by the function, instead of sorting by the initial values. The function used for generating the key below is very simple and we can use a lambda (i.e in place) function
- ```
>>> demo_dic = {'a':10, 'b':5, 'c':-1, 'd':0}

>>> sorted(demo_dic.keys())
['a', 'b', 'c', 'd']

>>> sorted(demo_dic.values())
[-1, 0, 5, 10]

>>> sorted(demo_dic.keys(), key=lambda key_name:demo_dic[key_name])
['c', 'd', 'b', 'a']
```

Efficient looping with numpy, map, itertools and list comprehension

Big, nested, explicit for loops should be avoided at all cost, in order to reduce a script execution time!

numpy arrays should be used when dealing with numerical data
- Masked arrays can be used to deal with special cases and remove tests from loops

The built-in map function (and similar functions like zip, filter, …) can be used to efficiently apply a function (possibly a simple lambda function) to all the elements of a list
- ```
>>> my_ints = [1, 2, 3]

>>> map(str, my_ints)
['1', '2', '3']

>>> map(lambda ii: str(10*ii + 5), my_ints)
['15', '25', '35']
```

The itertools module defines many more fancy iterators that can be used for efficient looping

Example: replacing nested loops with product

>>> it.product('AB', '01')
<itertools.product object at 0x2b35a7b5f100>

>>> list(it.product('AB', '01'))
[('A', '0'), ('A', '1'), ('B', '0'), ('B', '1')]

>>> for c1, c2 in it.product('AB', '01'):
...   print(c1 + c2)
...
A0
A1
B0
B1

>>> for c1, c2 in it.product(['A', 'B'], ['0', '1']):
...   print(c1 + c2)
...
A0
A1
B0
B1

>>> for c1, c2, c3 in it.product('AB', '01', '$!'):
...   print(c1 + c2 + c3, end=', ')
...
A0$, A0!, A1$, A1!, B0$, B0!, B1$, B1!,

The list comprehension (aka implicit loops) can also be used to generate lists from lists
- Example: converting a list of integers to a list of strings
  Note: in that case, you should rather use the map function detailed above
  - ```
  >>> my_ints = [1, 2, 3]
  
  >>> [ str(ii) for ii in my_ints ]
  ['1', '2', '3']
```

numpy related stuff

Using a numpy array to store arbitrary objects

The numpy arrays are usually used to store scalars of the same type (see also the Data type objects (dtype)), very often numerical values.

It is also possible to store arbitrary Python objects in an array, rather than using nested lists or dictionaries!

>>> some_array = np.empty((2, 3), dtype=object)
>>> some_array
array([[None, None, None],
       [None, None, None]], dtype=object)
>>> some_array.shape
(2, 3)
>>> print(some_array[-1, -1])
None
>>> some_array[-1, 0] = filled_contour # e.g. save an existing cartopy filled contour object
>>> some_array
array([[None, None, None],
       [<cartopy.mpl.contour.GeoContourSet object at 0x2ab679e8bf10>,
        None, None]], dtype=object)

Dealing with a variable number of indices

Official reference

>>> i10 = np.identity(10)
>>> i10
array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
...
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])
>>> i10.shape
(10, 10)

>>> i10[3:7, 4:6]
array([[0., 0.],
       [1., 0.],
       [0., 1.],
       [0., 0.]])
       
>>> s0 = slice(3, 7)
>>> s1 = slice(4, 6)
>>> i10[s0, s1]
array([[0., 0.],
       [1., 0.],
       [0., 1.],
       [0., 0.]])
       
>>> my_slices = (s0, s1)
>>> i10[my_slices]
array([[0., 0.],
       [1., 0.],
       [0., 1.],
       [0., 0.]])
       
>>> my_fancy_slices = (s0, Ellipsis)
>>> i10[my_fancy_slices]
array([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.]])
>>> i10[my_fancy_slices].shape
(4, 10)

>>> # WARNING! DANGERRRR! NEVER forget that a VIEW is NOT A COPY
>>> # and that you can change the content of the original array by mistake
>>> my_view = i10[my_slices]
>>> my_view[:, :] = -1
>>> my_view
array([[-1., -1.],
       [-1., -1.],
       [-1., -1.],
       [-1., -1.]])
>>> i10
array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1., -1., -1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0., -1., -1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0., -1., -1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0., -1., -1.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.]])

Finding and counting unique values

Use np.unique, do not try to use histogram related functions!

>>> vals = np.random.randint(2, 5, (10,)) * 0.5 # Get 10 discreet float values
>>> vals
array([1. , 2. , 1. , 2. , 2. , 1.5, 1. , 1.5, 2. , 1.5])

>>> np.unique(vals)
array([1. , 1.5, 2. ])
>>> unique_vals, nb_unique = np.unique(vals, return_counts=True)
>>> unique_vals
array([1. , 1.5, 2. ])
>>> nb_unique
array([3, 3, 4])

>>> sorted_vals = np.sort(vals) # Sorted copy, in order to check the result
>>> sorted_vals
array([1. , 1. , 1. , 1.5, 1.5, 1.5, 2. , 2. , 2. , 2. ])

Applying a ufunc over all the elements of an array

There are all sorts of ufuncs (Universal Functions), and we will just use below add from the math operations, applied on the arrays defined in Finding and counting unique values

# Get the sum of all the elements of 'vals'
>>> np.add.reduce(vals)
15.5
>>> np.add.reduce(sorted_vals)
15.5
>>> vals.sum() # The usual and easy way to do it
15.5

# Compute the sum of the elements of 'nb_unique'
# AND keep (accumulate) the intermediate results
>>> nb_unique
array([3, 3, 4])
>>> np.add.accumulate(nb_unique)
array([ 3,  6, 10])

# The accumulated values can be used as indices to separate the different groups of sorted values!
>>> sorted_vals
array([1. , 1. , 1. , 1.5, 1.5, 1.5, 2. , 2. , 2. , 2. ])
>>> sorted_vals[0:3]
array([1., 1., 1.])
>>> sorted_vals[3:6]
array([1.5, 1.5, 1.5])
>>> sorted_vals[6:10]
array([2., 2., 2., 2.])

# Compute the sum of each equal-value group
>>> sorted_vals[0:3].sum(), sorted_vals[3:6].sum(), sorted_vals[6:10].sum()
(3.0, 4.5, 8.0)

Applying a ufunc over specified sections of an array

The reduceat function can be used to avoid explicit python loops, and improve the speed (but not the readability…) of a script. The example below improves what has been shown above

# Define a list with the boundaries of the intervals we want to apply the 'add' function to
# We need to add the beginning index (0), AND remove the last index
# (reduceat will automatically go to the end of the input array
>>> nb_unique
array([3, 3, 4])
>>> slices_indices = [0] + list(np.add.accumulate(nb_unique))
>>> slices_indices.pop() # Remove last element
10
>>> slices_indices
[0, 3, 6]

# Compute the sums over the selected intervals with just one call
>>> np.add.reduceat(np.sort(vals), slices_indices)
array([3. , 4.5, 8. ])

Exercise your brain with numpy

Have a look at 100 numpy exercises

matplotlib related stuff

Working with time axes (and ticks)

If you have problems setting the limits of a time axis, choosing the ticks' locations, or specifying the style of the labels, you should check the:

Data representation

A few notes for a future section or page about about data representation (bits and bytes) on disk and in memory, vs data format

Add parts (pages 28 to 37) of this old tutorial to this section

Base notions

Never forget that all the bits and pieces of information we use are coded in base 2 (0s and 1s …), grouped in bytes!
- Some things can be stored exactly (integers, characters, …)
- In other cases (real numbers that we work with all the time, compressed images/videos/music) we only store good enough approximation

1 byte ⇔ 8 bits
- REAL*4 ⇔ 4 bytes ⇔ 32 bits
- For easier written/displayed representation, 1 byte is usually split into 2 groups of 4 bits, and displayed using base 16 and hexadecimal representation (characters 0, 1, …, A, B, …, F)
  - 0000 ⇔ 0,
    0010 ⇔ 1, …,
    1111 ⇔ F
  - 1101 ⇔ D in hexadecimal ⇔ 13 in decimal (1 * 8 + 1 * 4 + 0 * 2 + 1 * 1)
  - 11111101 in base 2 ⇔ 1111 1101 ⇔ FD in hexadecimal ⇔ 253 (15 * 16 + 13) in decimal

Base conversion with Python

>>> hex(13) # Decimal to Hexadecimal conversion
'0xd'
>>> hex(253)
'0xfd'
>>> hex(256)
'0x100'
>>> int('0x100', 16) # Hexadecimal to Decimal conversion
256
>>> int('1111', 2) # Binary to Decimal conversion
15
>>> int('11111101', 2) # '11111101' <=> '1111 1101' <=> 'FD' <=> 15 * 16 + 13 = 253
253
>>> 013 # DANGER! Python considers an integer to be in OCTAL base if it starts with a 0
11
>>> int('13', 8) # 1*8 + 3
11

More technical topics
- Bit numbering: the art of ordering bits, everything about MSB (Most Significant Byte) and LSB (Least Significant Byte)
- Endianness: the art of ordering bytes

Numerical values

Binary data representation of some numbers (only some common types are listed here):

Languages and packages references used below:
- Python: NumPy Sized aliases
- NetCDF: Data Types, Fortran related Data Types, CDL Data Types
- Fortran: Intel Fortran Compiler Intrinsic Data Types
Integers
- Range:
  - 4-byte signed integers: −2,147,483,648 to 2,147,483,647
    - Python: numpy.int32
    - NetCDF: int, NC_INT or NC_LONG, NF90_INT
    - Fortran: INTEGER*4
  - 8-byte signed integers: −9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
    - Python: numpy.int64
    - NetCDF: int64, NC_INT64
    - Fortran: INTEGER*8
- Tech note: signed integers use two's complement for coding negative integers
Floating point numbers (IEEE 754 standard aka IEEE Standard for Binary Floating-Point for Arithmetic)
- Range:
  - 4-byte float: ~8 significant digits * 10E±38
    - Python: numpy.float32
    - NetCDF: float, NC-FLOAT, NF90_FLOAT
    - Fortran:REAL*4
    - See also Single-precision floating-point format
  - 8-byte float: ~15 significant digits * 10E±308
    - Python: numpy.float64
    - NetCDF: double, NC_DOUBLE, NF90_DOUBLE
    - Fortran: REAL*8
- Special values:
  - NaN: Not a Number
    - Python: numpy.nan
  - Infinity
    - Python: -numpy.inf and numpy.inf
  - Note: it is cleaner to use masks (and Numpy masked arrays) rather than NaNs, when you have to deal with missing values !
- The RISKS of working with (the wrong) floats:
  - Round-off error
  - Catastrophic cancellation
    - What Every Computer Scientist Should Know About Floating-Point Arithmetic

A rather technical example: we play with a numpy 4-byte integer scalar

>>> one_int32 = np.int32(1)
>>> one_int32
1
>>> type(one_int32)
<class 'numpy.int32'>
>>> one_int32.dtype
dtype('int32')
>>> one_int32.shape # A numpy SCALAR, is an ARRAY WITH NO SHAPE !
()
>>> one_int32[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: invalid index to scalar variable.
>>> one_int32[()] # Note how to access the single element, when there is NO SHAPE
1
>>> one_int32.ndim # NO SHAPE means no dimensions, but there is ONE element
0
>>> one_int32.size
1
>>> one_int32.nbytes # The element requires 4 bytes of storage
4
>>> hex(one_int32) # We can print the hexadecimal representation for INTEGERS scalars and arrays
'0x1'
>>> hex(one_int32 * 15)
'0xf'
>>> hex(one_int32 * 16)
'0x10'

# 'Serialize' the data (i.e. change the data to a series of bytes)
# Note: the serialized data seems to be printed in the reverse order of 'hex(one_int32)'
>>> one_int32_serialized = one_int32.tobytes()
>>> type(one_int32_serialized)
<class 'bytes'>
>>> len(one_int32_serialized)
4
>>> one_int32_serialized 
b'\x01\x00\x00\x00'
>>> one_int32_serialized.hex(' ') # Another way to print the hexadecimal values
'01 00 00 00'

# Use the following in the unlikely case where you need to change the endianness (bytes ordering)
>>> one_int32_reversed_endian = one_int32.byteswap()
>>> one_int32_reversed_endian # Same bytes in a different order represent a different number (of course)
16777216
>>> hex(one_int32_reversed_endian) # Compare to the output of hex(one_int32) above
'0x1000000'
>>> one_int32_reversed_endian.tobytes()
b'\x00\x00\x00\x01'

Another technical example: we use an array of 2 integers
When using byteswap(), notice how bytes are swapped by groups of 4 bytes, because int32 use 4 bytes

>>> array_example = np.asarray((3, 17), dtype=np.int32)
>>> array_example
array([ 3, 17], dtype=int32)
>>> array_example.shape, array_example.ndim, array_example.size, array_example.nbytes
((2,), 1, 2, 8)
>>> array_example.tobytes().hex(' ', 4)
'03000000 11000000'
>>> array_example.byteswap().tobytes().hex(' ', 4)
'00000003 00000011'

Manipulating binary data with bytes, bytearray, memoryview

Array addressing
- Calculation of address of element of 1-D, 2-D, and 3-D using row-major and column-major order
  - In other words: using indices to go from 1-D to n-Dimnensions data
- The array structure
- python/C vs Fortran…

disk and ram usage: how to check the usage (available ram and disk), best practice on multi-user systems (how much allowed?)
- du, df, cat /proc/meminfo, top

understanding and reverse-engineering binary format
- od, strings

binary vs text format: ascii, utf, raw
- text related functions in python: str, int, float, ord, …
  - lists conversion with map and join

Misc : md5sum

Strings

Encoding, ASCII, unicode, UTF-8, …

Getting the binary representation of a string

>>> test_string = 'A B 0 1 à µ'
>>> type(test_string)
<class 'str'>
>>> len(test_string)
11
>>> test_string_bin = test_string.encode('utf-8')
>>> test_string_bin
b'A B 0 1 \xc3\xa0 \xc2\xb5'
>>> type(test_string_bin)
<class 'bytes'>
>>> len(test_string_bin)
13
>>> test_string_bin.hex('-')
'41-20-42-20-30-20-31-20-c3-a0-20-c2-b5'

Debugging...

Some resources that you can use in the unlikely case that ~~you~~ some AI has introduced ~~features~~ bugs in your code

Built-in tools

Read the documentation of:

pdb, The Python Debugger
and the other built-in Debugging and Profiling tools

Using a decorator to log function calls

Check the example in My Lazy Secret to Cleaner Code: Python Decorators

Note: more about using log files

[ PMIP3 Wiki Home ] - [ Help! ] - [ Wiki syntax ]

Table of Contents