Software Development Choices for Portfolio Optimization

The first phase of developing the HALO (Heuristic Algorithm Optimizer) Portfolio Optimizer was testing mathematical and heuristic concepts.  The second phase was teaming up with beta partners in the financial industry to exchange optimization work for feedback on the optimizer features and results.

For the first phase, my primary tool for software development was the Ruby language.  Because Ruby is a “high-level” extensible language I was able to quickly prototype and test many diverse and complex concepts.  This software development process is sometimes referred to as software prototyping.

For the second, beta phase of software development I kept most of the software in Ruby, but began re-implementing selected portions of the code in C/C++. The goal was to keep the high-change-rate code in Ruby, while coding the more stable portions in C/C++ for run-time improvement.  While a good idea in theory, it turned out that my ability to foresee beta-partner changes was mixed at best.  While many changes hit the the Ruby code, and were easily implemented, a significant fraction hit deep into the C/C++ code, requiring significant development and debugging effort.  In some cases, the C/C++ effort was so high, I switched back portions of the code to Ruby for rapid development and ease of debugging.

Now that the limited-beta period is nearly complete, software development has entered a third phase: run-time-performance optimization.  This process involves converting the vast majority of Ruby code to C.  Notice, I specifically say C, not C/C++.   In phase 2, I was surprised at the vast increase in executable code size with C++ (and STL and Boost).  As an experiment I pruned test sections of code down to pure C and saw the binary (and in-memory) machine code size decrease by 10X and more.

By carefully coding in pure C, smaller binaries were produced, allowing more of the key code to reside in the L1 and L2 caches.  Moreover, because C allows very precise control over memory allocation, reallocation, and de-allocation, I was able to more-or-less ensure than key data resided primarily in the L1 and/or L2 caches as well.  When both data and instructions live close to the CPU in cache memory, performance skyrockets.

HALO code is very modular, meaning that it is carefully partitioned into independent functional pieces.  It is very difficult, and not worth the effort, to convert part of a module from Ruby to C — it is more of an all-or-nothing process.  So when I finished converting another entire module to C today, I was eager to see the result.  I was blown away.  The speed-up was 188X.  That’s right, almost 200 times faster.

A purely C implementation has its advantages.  C is extremely close to the hardware without being tied directly to any particular hardware implementation.   This enables C code (with the help of a good compiler) to benefit from specific hardware advantages on any particular platform.  Pure C code, if written carefully, is also very portable — meaning it can be ported to a variety of different OS and hardware platforms with relative ease.

A pure C implementation has disadvantages.  Some include susceptibility to pointer errors, buffer-overflow errors, and memory leaks as a few examples.  Many of these drawbacks can be mitigated by software regression testing, particularly to a “golden” reference spec coded in a different software language.  In the case of HALO Portfolio-Optimization Software, the golden reference spec is the Ruby implementation.  Furthermore unit testing can be combined with regression testing to provide even better software test coverage and “bug” isolation.  The latest 188X speedup was tested against a Ruby unit test regression suite and proven to be identical (within five or more significant digits of precision) to the Ruby implementation.  Since the Ruby and C implementations were coded months apart, in different software languages, it is very unlikely that the same software “bug” was independently implemented in each.  Thus the C helps validate the “golden” Ruby spec, and vice versa.

I have written before about how faster software is greener software.  At the time HALO was primarily a Ruby implementation, and I expected about a 10X speed up for converting from Ruby to C/C++.  Now I am increasingly confident that an overall 100X speedup for an all C implementation is quite achievable.  For the SaaS (software as a service) implementation, I plan to continue to use Ruby (and possibly some PHP and/or Python) for the web-interface code.  However, I am hopeful I can create a pure C implementation of the entire number-crunch software stack.  The current plan is to use the right tool for the right job:  C for pure speed, Ruby for prototyping and as a golden regression reference, and Ruby/PHP/Python/etc for their web-integration capabilities.

 

Financial Software Tech

In order to create software that is appealing to the enterprise market today, Sigma1 must create software for five years from now.   In this post I will answer the questions of why and how Sigma1 software intends to achieve this goal.

The goal of Sigma1 HAL0 software is to solve financial asset allocation problems quickly and efficiently.  HALo is portfolio-optimization software that makes use of a variety of proprietary algorithms.  HALo’s algorithms solve difficult portfolio problems quickly on a single-core computer, and much more rapidly with multi-core systems.

Savvy enterprise software buyers want to buy software that runs well on today’s hardware, but will also run on future generations of compute hardware.   I cannot predict all the trends for future hardware advanced, but I can predict one:  more cores.  Cores per “socket” are increasing on a variety of architectures:  Intel x86, AMD x86, ARM, Itanium, and IBM Power7 to name a few.  Even if this trend slows, as some predict, the “many cores” concept is here to stay and progress.

Simply put — Big Iron applications like portfolio-optimization and portfolio-risk management and modelling are archaic and virtually DOA if they cannot benefit from multi-core compute solutions.   This is why HAL0 is designed from day 1 to utilize multi-core (as well as multi-socket) computing hardware.  Multiprocessing is not a bolt-on retrofit, but an intrinsic part of HAL0 portfolio-optimization software.

That’s the why, now the how.  Google likes to use the phrase “map reduce”  while others like the phase embarrassingly parallel.   I like both terms because it can be embarrassing when a programmer discovers that the problems his software was slogging through in series were being solved in parallel by another programmer who mapped them to parallel sub-problems.

The “how” for HAL0’s core algorithm is multi-layered.   Some of these layers are trade secrets, I can disclose one.  Portfolio optimization involves creating an “efficient frontier” comprised of various portfolios along the frontier.  Each of these portfolios can be farmed out in parallel to evaluate its risk and reward values.   Depending on the parameters of a particular portfolio-optimization problem this first-order parallelism can provide roughly a 2-10x speedup — parallel, but not massively parallel.

HALo was developed under a paradigm I call CAP (congruent and parallel).  Congruent in this context means that given the same starting configuration, HAL0 will always produce the same result.  This is generally easy for single-threaded programs to accomplish, but often more difficult for programs running multiple threads on multiple cores.    Maintaining congruence is extremely helpful in debugging parallel software, and is thus very important to Sigma1 software.  [Coherent or Deterministic could be used in lieu of Congruent.]

As HAL0 development continued, I expanded the CAP acronym to CHIRP (Congruent, Heterogeneous, Intrinsically Recursively Parallel).   Not only does CHIRP have a more open, happier connotation that CAP, it adds two additional tenets:  heterogeneity and recursion.

Heterogeneity, in the context of CHIRP, means being able to run, in parallel, on a variety of machines will different computing capabilities.  On on end of the spectrum, rather than requiring all machines in the cloud or compute queue having the exact same specs (CPU frequency, amount of RAM, etc), the machines can be different.  On the other end of the spectrum, heterogeneity means running in parallel on multiple machines with different architectures (say x86 and ARM, or x86 and GPGPUs).  This is not to say that HAL0 has complete heterogeneous support; it does not.  HALo is, however, architected with modest support for heterogeneous solutions and extensibility for future enhancements.

The recursive part of CHIRP is very important.  Recursively parallel means that the same code can be run (forked) to solve sub-problems in parallel, and those sub-problems can be divided into sub-sub problems, etc.   This means that the same tuned, tight, and tested code can leveraged in a massively parallel fashion.

By far the most performance-enhancing piece of HAL0 portfolio-optimization CHIRP is RP.  The RP optimizations are projected to produce speedups of 50 to 100X over single-threaded performance (in a compute environment with, for example, 20 servers with 10 cores each).  Moreover, the RP parts of HAL0 only require moderate bandwidth and are tolerant of relatively high latency (say, 100 ms).

Bottom line:  HAL0 portfolio-optimization software is designed to be scalable and massively parallel.

 

 

 

Getting Technical

Software Scalability

Without IPC (Inter-process communication) scalability is virtually impossible.  With IPC comes numerous choices and tradeoffs:  pipes, named pipes, messages, semaphore-managed file-sharing, memory-sharing, sockets, socket pairs… to name some.   The tradeoffs involve platform support, job granularity, and performance.

After much thought, I have temporarily committed to Linux/UNIX named pipes as the primary IPC mechanism.  The pros of this choice include robustness, reasonable speed, and a wide range of support for multi-thread, multi-core, and multi-server parallelism.  The primary downside: lack of Windows compatibility.

The HAL0 portfolio-optimization suite can still run on Windows, but for now Windows parallelism is limited to thread-level scalability on a single machine.

Interoperability

Linux and UNIX programs often communicate very well together with streaming data.  [For this post, I’ll used the term Linux to generally refer to Linux and/or UNIX].  Streaming data has its limitations, however.  Perhaps the biggest limitation is “large-grain granularity”.  By this I mean that an interaction involves loading the program (high-latency), processing the stream, closing the program.  Streaming is expensive for fine-grain ad hoc requests because of the open/close overhead.

Named pipes (especially when properly paired) are an elegant way around the open/close overhead issue.

It is often a simple task to modify a Linux program to support named pipes.  It does not matter if said program is written in C++, Python, Java, Ruby, Perl, shell, etc.  If a program can be modified to accept input from a named pipe and write results to another named pipe, it can be integrated with other program that supports a common name, pipe-based API.  HAL0 financial software does just that.

Scalability Squared

HAL0 software forms the core engine or kernel of the Sigma1 portfolio software.  In parallel mode, HAL0 spawns worker jobs that compute portfolio metrics.  HAL0 gathers, organizes, and evaluates the results.  Then, based on the past and current wave of results, HAL0 identifies the most fruitful next wave of results to explore and “farms out” the next wave of computing.  This is the primary means of achieving scalability:  the worker jobs are distributed to other cores and other other servers.  These “other” compute resources perform the massive heavy-lifting in parallel, speeding up computation immensely.

When there are enough worker jobs, there comes a point when the worker jobs cease to be the primary compute-speed limiter.  This is where Amdahl’s law really kicks in.  At some point maximum speedup is achieved, limited by the “core” processes ability to send, wait for, receive and process worker-job data.

If the “core” (or master or “boss”) process itself can be split into parallel processes, an additional level of scalability kicks in.  This core HAL0 algorithm is designed to do just that.

Based on preliminary estimates, it can scale efficiently to 4 cores, delivering up to a 3.5X core speedup.   Additionally the current HAL0 periphery for each HAL0 instance scales efficiently to up to 10-ways, providing about a 6X additional speedup per instance (depending on IPC latency). Ideally, Sigma1 portfolio-optimization software can provide up to a 21X speedup (3.5 times 6) operating in parallel mode versus single-CPU mode.

There are many caveats in the preceding paragraph.  Right now I am focused on implementing and testing scalability, less so than on optimizing it.  For example I am currently implementing single-kernel instance scalability in a manner than is deterministic, repeatable, and producing results identical to single-CPU operation.  This limits the speedup, but makes regression testing practical.  Regression testing in turn helps keep the code robust and reliable.

Portfolio-Optimization Software for the Enterprise

So far, ensuring that Sigma1 portfolio software is capable of massive scalability has roughly tripled the software development effort. This is obviously slowing time-to-market, but I continue to believe it is worth the effort and schedule impact. First, scalability is a key product differentiator for enterprise-level customers. Second, supporting scalability from day 1 is much easier and more reliable that trying to retrofit scalability into an intrenched software architecture.

 

Benchmarking Financial Algorithms

In my last post I showed that there are far more that a googol permutations of portfolio of 100 assets with (positive, non-zero) weights in increments of 10 basis points, or 0.1%.    That number can be expressed as C(999,99), or C(999,900) or 999!/(99!*900!), or ~6.385*10138.  Out of sheer audacity, I will call this number Balhiser’s first constant (Kβ1).  [Wouldn’t it be ironic and embarrassing if my math was incorrect?]

In the spirit of Alan Turing’s 100th birthday today and David Hilbert’s 23 unsolved problems of 1900, I propose the creation of an initial set of financial problems to rate the general effectiveness of various portfolio-optimization algorithms.  These problems would be of a similar form:  each having a search space of Kβ1. There would be 23 initial problems P1…P23.  Each would have a series of 37 monthly absolute returns.  Each security will have an expected annualized 3-year return (some based on the historic 37-month returns, others independent).  The challenge for any algorithm A to score the best average score on these problems.

I propose the following scoring measures:  1) S”(A) (S double prime) which simply computes the least average semi-variance portfolio independent of expected return.  2) S'(A) which computes the best average semi-variance and expected return efficient frontier versus a baseline frontier.  3) S(A) which computes the best average semi-variance, variance, and expected return efficient frontier surface versus a baseline surface.  Any algorithm would be disqualified if any single test took longer than 10 minutes.  Similarly any algorithm would be disqualified if it failed to produce a “sufficient solution density and breadth” for S’ and S” on any test.  Obviously, a standard benchmark computer would be required.  Any OS, supporting software, etc could be used for purposes of benchmarking.

The benchmark computer would likely be a well-equipped multi-core system such as a 32 GB Intel  i7-3770 system.  There could be separate benchmarks for parallel computing, where the algorithm + hardware was tested as holistic system.

I propose these initial portfolio benchmarks for a variety of reasons.  1)  Similar standardized benchmarks have been very helpful in evaluating and improving algorithms in other fields such as electrical engineering.  2)  Providing a standard that helps separate statistically significant from anecdotal inference. 3)  Illustrate both the challenge and the opportunity for financial algorithms to solve important investing problems. 4)  Lowering barriers to entry for financial algorithm developers (and thus lowering the cost of high-quality algorithms to financial businesses).  5)  I believe HAL0 can provide superior results.

Green Software for Finance

When I Google for “green software” top searches are about software that helps other activities become greener, such as utility power optimization or HVAC optimization.  This makes sense because “smart power” has a bigger footprint than compute power consumption.  Other search results focus on green IT (information technology).

I would like to contribute to the dialog about green software itself.  My definition of green technologies (green software, green hardware) is technology that consumes less power while producing comparable or better results.  My introduction to green technology began with power-efficient hardware design 7 years ago.  I learned that power savings equals performance improvement due to thermodynamic and other considerations.  This mantra (less power equals more performance) is gradually transforming semiconductor design.  Technology companies that understand this best (Intel, ARM, Samsung, Google) stand to benefit in the years ahead.

In general, for every watt of power consumed by compute hardware, a watt or more of building cooling power is required.  Software’s true power consumption is about 2 times that of the compute power it consumes.  Compute power includes system power (CPU, RAM, driver, peripherals, and power supply losses) plus power for networking (routers, switches, etc) plus other power consumers like network-attached storage.

Some of the electrical engineering compute jobs I run take a week running 24×7 to complete, and often run on multiple CPUs and on multiple computers.  Each compute job easily consumes an average of 1 Kilowatt of compute power, hence an average of 2 KW of data center power. This works out to about 335 KW*h per run.  This is about $33 worth of power and is enough to power the average home’s electric needs (not counting heating and cooling) for about 8 days.

Right now the “greenness” of software is a relative.  Today the software development world doesn’t have the right models to compare whether, say, particular database software is more or less green than particular financial software.  Software and IT professionals can, however, assess whether one specific portfolio optimization solution is more or less green than another.

Creating green software begins with asking the right questions.  The fundamental questions are “How much power does our software consume, and how can we reduce it?”  I started to ask myself these questions early in the development stage of HAL0 financial software.  I realized that the software running fairly large computations on the same data over and over again.  Computations like the 3-year volatility of an asset.  I created a simple software cache mechanism that first checked if the exact complex computation had been performed before.  If it had the cache simply returned the previous result.  If not the computation was performed and the result was saved in the cache.  The result was a 3X speed up and an approximately 3x improvement in performance per watt for the HAL0 portfolio-optimization software.  The mantra that “power saved is performance gained” is even more true in the world of software.

In other words, green software design practices and focusing on faster software often lead to the same types of software improvement.  The thought process to arrive at software improvement can be different enough to give software developers new perspectives on their algorithms and code.  I found that some solutions that eluded me while looking for performance-enhancement (speed ups) were easily discover by thinking about power and resource inefficiencies.  Similarly some software improvement that came quickly from profiling performance data would have been unlikely to occur to me when thinking about green software methods; it is only in retrospect that I saw their performance per watt benefit.  The concept of “green software” is complimentary with other software concepts such lightweight software, rapid prototyping, and time-complexity analysis and optimization.

HAL0 portfolio-optimization software is designed to be green.  It is designed to get more done with less power consumption.  Like all other green software, greener means faster.  Some of HAL0 speedups have come from thinking green, and other haves come from “thinking fast.”  The speed and efficiency of HAL0’s core engine is high, but I already envision further improvements of 5 to 10X.  It is simply a question of time to implement them.