A086eabaea19eb39622213eef57379ce

by markseger

I've been a software developer for over 40 years, spending the last 20 years doing performance monitoring in the areas of High Performance Computing and Clouds, with a focus on file systems. I'm also the author of several opensource monitoring tools, such as collectl which has been used to monitor some of the largest HPC systems in the world. More recently my focus has been on OpenStack Swift performance and am also the author of getput, a highly detailed Swift benchmarking tool.

Collectl was developed over a dozen years ago as a very lightweight yet highly detailed system monitoring tool, capable of collecting hundreds system performance metrics as frequently as every second. Its companion tool colplot, provides an easy to use web-based plotting package capable of displaying detailed statistics for multiple systems at the same time. Think of colmux, a third tool, as top-anything across a number of machines. It is capable of displaying anything collectl can collect in top-format, sorted by any column of your choice. For example, say you have a 100 node cluster, with colmux you can look at the disk wait times across thousands of disks sorted sorted from slowest to fasted, allowing you to easily identify bad or hot drives, OR look at memory consumption for leaks. How about a bad NIC consuming a CPU by interrupts? Or how about which process is doing the more disk reads, or write, or page faults? And remember, this is across a cluster.

The focus of collectl has always been highly efficient metrics collection and display via a CLI, no pretty pictures and no databases to slow it down. However what it does have is an API to allow it to pass those metrics onto whatever high level tools one may wish to communicate with. For example it has native support for ganglia or graphite over a socket. If you have some other favorite tool it can usually be adapted to communicate with it as well. Unfortunately most centralized tools are easily overwhelmed with fine-grained metrics and can only deal with them at granularities in the 1-min range. Not to worry, collectl has the ability to record and save metrics to local disk at one rate and send simultaneously send them to a central tool at a different rate, making it possible to get a coarser-grained centralized view and if there is a problem, still have access to finer-grained data.

Collectl has been used for monitoring some of the largest computing clusters in the world and in the last several years has been enhanced for monitoring Open Stack Clouds. It is currently packaged as part of OpenSUSE.

Date:
2017 May 28 14:30
Duration:
1 h
Room:
Saal (Main Hall)
Conference:
openSUSE Conference 2017
Language:
Track:
Open Source
Difficulty:
Medium