AWS_julia

Steven Buczkowski

1 DONE Figure out julia package issues on taki (for local practice and comparison)

Seems there were two issues:

  • something screwed up in package registry
  • julia was picking up taki local libraries instead of using julia's copy

    • SOLVED by setting LD_LIBRARY_PATH to include the julia lib

    directory for the currently loaded julia module (should be done as part of module init(?). Send note to OIT)

    • export LD_LIBRARY_PATH="/usr/ebuild/software/Julia/1.6.2-linux-x86_64/lib/julia:$LD_LIBRARY_PATH"

Can now run julia with NetCDF and NCDatasets packages on taki/strowinteract

2 DONE Figure some basic trivialities with Julia

Before figuring out how to read S3 stores in julia, need to figure out how to do some basics on filesystems I understand: reading directories, filtering filenames, reading filepaths from a text file.

2.1 reading from a text file

for line in eachline("path-to-file-of-paths")
     ## do some stuff with variable "line"
end

2.2 reading directory directly with readdir()

readdir()
reads/lists $PWD
readdir("path")
reads/lists contents at "path"

2.2.1 absolute paths

have to map the results of readdir() with the abspath() function

  • map(abspath, readdir())

2.2.2 filtering return file list on contents

readdir() does not filter returned values, this has to be done by wrapping readdir in an external filtering function

filter(x -> occursin("text", x), readdir())
# returns only files containing "text" in name

filter(x -> occursin("text", x), map(abspath, readdir()))
# returns files where "text" occurs anywhere in full path

filter(x -> occursin(r"regex", x), readdir())
# returns files matching regular expression "regex"

Other functions and anonymous functions (the "x -> occursin…") can be used in "filter" so this is probably far from the only way to list and filter directories.

3 DONE Basic reads of netcdf files with NCDatasets package from normal filesystem

using Pkg
Pkg.add("NCDatasets")
using NCDatasets

ds = NCDataset("path the netcdf file")

# quick list of variables available in netcdf file
for (varname, var) in ds
  @show (varname, size(var))
end

# lazy load attributes of variable "var" but without loading data
var = ds["varname"] # using "varname" generically, not as variable from previous block

# actually loads data for var. Second set of brackets can set chunk boundaries and strides
var = ds["varname"][:,:]

4 TODO Basic timed read of 100-400 CHIRP files from S3

  • Make list of files/buckets and loop over them to read aggregating time to do so(?)
  • read in list of CHIRP files via readir() (?) and loop over this
    • this is what we need in longer term for actual work. Will require understanding authentication/authorization key aging and how to re-authorize programmatically (see S3 bullet below)

5 TODO Timed read of rtp related variables from 100-400 CHIRP files

  • offshoot of code from previous bullet, just add reading some variables into arrays while timing

6 TODO Figure out S3 access and ongoing authentication with AWS.jl and AWSS3.jl

7 TODO (Ongoing) start setting up both AWS and julia on both AWS and UMBC to be more turnkey

  • customize julia environment and put on github
  • set up AWS environment template(s) for spinning up nodes