Data ingest governance is a practice of getting optimal value for the telemetry data collected by an organization. This is especially important for a complex organization that has numerous business units and working groups. This is the first part of a four-part guide to optimizing your New Relic data ingest.
To ensure a level of context and accountability in your ingest planning, you should designate a few explicit roles and practices. At a minimum select a governance team and schedule check-ins throughout the year to plan and adapt as needed.
Regardless of how your organization's teams are structured, it's necessary to identify some individuals who will participate in the data ingest governance process. Selection of the team can be ad hoc but it should include representation from a broad enough cross section of teams so that when priorities and decisions must be made you'll have the right mix of knowledge and authority. The team should have one individual who can be considered the overall observability manager. This may be the person who manages the New Relic account or an overriding team leader responsible for the systems and infrastructure monitored by New Relic.
This is the go-to person to help resolve conflicts and to communicate with senior management as needed. When the organization contains gray areas of ownership, which can lead to questions like "Who owns this kubernetes cluster?" and "Why is it sending so much data this week?", this individual is instrumental. The observability manager will be able to interact with technical individual contributors as well as senior management as needed. The observability manager must be able to foster consensus and cooperation when tough decisions are needed.
The Observability manager functions as the lead for this team. The members of the governance team bring in practical technical knowledge of the systems and services that are monitored in New Relic. They may be peers or direct reports of the observability manager. They share a common goal of high quality observabilty for the entire organization (transcends team or business unit).
If you have a pre-existing structure such as an observability center of excellence (OCoE), your governance team can be comprised primarily from the core OCoE team.
The primary responsibilities of a OCoE team generally are:
- Maintain the relationship with New Relic.
- Govern accounts and users.
- Onboard new teams and individuals.
- Maintain an observability knowledge base.
- Promote collaboration and sharing among teams.
Incorporating data ingest governance adds the following responsibilities:
- Work with the observability manager to stay within monthly ingest targets.
- Monitor data ingest baselines and respond to anomalies.
- Draft and approve plans for data optimization/reduction as needed.
- Participate in scheduled check-ins where baseline data is analyzed and compared to ingest targets.
- Make modifications to ingest targets as needed.
Schedule data ingest governance meetings through the year to keep everyone up to date on data ingest volumes. Doing so makes data ingest governance predictable and easy to manage.
Meet to maintain an organization wide telemetry ingest target. This can be broken out into as many facets as is useful for your organization. For example you may adopt the following ingest targets...
- Organization wide (monthly): 1000TB
- Team A (monthly): 500TB
- Team B (monthly): 300TB
- Team C (monthly): 100TB
This rough set of targets leaves 100TB as a buffer for uncertainty. You may also choose some telemetry specific limits based on certain highly variable telemetry. For example you may set organization- or team-based limits on the ingest of logs or metrics.
During these sessions you'll track ingest against your plan and produce action items needed to stay on target. Using the target examples discussed above we'll want to know if teams A, B, and C are meeting their agreed ingest targets. If something is out of alignment the governance team will suggest an optimization plan.
Generally these sessions are reserved for events that, if left unattended, would substantially impact the organization's budget. There are numerous causes for these anomalies. Some scenarios to watch for:
- A new software deployment increases log volume by 3x.
- A team enables a handful of new cloud integrations that unexpectedly increases metrics ingest by 200%.
- An acquisition of a new company leads to an increase in overall telemetry volume.
- Peak business season activity combined with some pre-peak refactors results in a much higher than expected custom events volume.
The optimization part of this series provides a structured approach for assessing these anomalies and taking possible action.