The OCM-G (OMIS-Compliant Monitoring system for the Grid) is a system for monitoring of parallel applications running on the Grid. The OCM-G provides services for collecting and preprocessing information about applications at run-time. The OCM-G runs as an autonomous infrastructure exposing a standard interface.
The OCM-G is designed as a basis for application-development-support tools, such as Performance Analyzers, Debugger or Visualizers. Using the services of the OCM-G, tools are (among others) enabled to obtain performance measurements of the monitored application, related to, for example, delay and volume of communication, CPU usage, etc. Information collected from the OCM-G is typically visualized in the form of graphical charts to show application progress, monitor activities of individual processes, observe communication patterns, detect bottlenecks, etc.
Compared to existing systems with a similar purpose, the OCM-G provides some unique capabilities, which, among others, are as follows:
- Support for Grid applications running across multiple sites.
- High performance: techniques for data rate reduction to ensure extremely low overhead and high responsiveness, enough even for monitoring of interactive applications.
- Flexibility: rather than a fixed set of metrics, the OCM-G provides an extensive set of low level services; this allows for construction of a variety of performance metrics with desired semantics.
- Extendibility: the OCM-G can be extended with additional services, loaded dynamically at run-time.
- Compact and secure design: the OCM-G runs as a set of user processes, which use a lightweight and fast socket-based communication mechanism. At the same time, state-of-the-art techniques are applied to ensure secure communication. No special privileges (special access rights, additional open ports on firewalls, or other potential security holes) are required.
- Design as an autonomous infrastructure exposing a standard interface OMIS (On-line Monitoring Interface Specification). The services of the OCM-G are available via this interface, which minimizes the effort of porting OMIS-based tools across platforms (basically only the OCM-G needs to be ported).
- Interoperability: thanks to the design as an autonomous service with a well-defined protocol, the OCM-G supports the interoperability of multiple tools monitoring a single application.
The target users of the OCM-G are:
- Application developers, and
- Tool developers.
The application developers need to use the OCM-G if they also use an OCM-G-compliant tool to monitor an application. However, in this case the usage of the OCM-G is straightforward. The main target user community are tool developers who can use the OCM-G as a basis for various types of application-development-support tools.
The main benefits of using the OCM-G for tool developers are as follows:
- The OCM-G provides an abstraction layer for accessing low-level information about (and performing manipulations on) the target system and applications. Thus, there is no need to develop a tool-specific monitoring layer. Furthermore, portability is greatly increased, since platform-specific issues are hidden in the OCM-G. Consequently, tools automatically support platforms to which the OCM-G is ported.
- The OCM-G, as a common monitoring infrastructure enables interoperability of multiple tools monitoring a single application. Otherwise (i.e., if each tool is using its own monitoring layer) this is usually impossible, since different monitors are likely to exclude each other (for example because the usage of system debugging mechanisms such as ptrace is exclusive). Those features allow for substantial saving of resources (time and funds) in the process of tool development.
Currently the OCM-G is a fully operational grid-enabled system possessing the above-described features. The target applications are currently MPI-based ones, though the core design and implementation of the OCM-G does not in any way depend on the particular type of the application. The OCM-G is a flexible, extendible and powerful system which currently is used as a basis of the G-PM performance analysis tool, and in the future can be used to build various types of tools, not only performance analyzers, but also different types of visualizers, debuggers, load balancers, or other tools. The OCM-G can also easily be integrated as part of a larger infrastructure, for example as part of a generic Grid monitoring and information service. In such a system, the OCM-G could work as one of many systems collecting information about different Grid entities (infrastructure, applications, middleware, etc.).