Problem Management

Identify root causes and prevent incident recurrence

In brief

Problem management identifies the root causes of recurring incidents. KaliaOps provides a dedicated workflow with investigation phases, workaround documentation, Known Error Database (KEDB) management, and permanent solution implementation tracking.

Overview

A problem is the root cause of one or more incidents.

Incident vs problem

  • Incident: Symptom, visible impact, quick fix
  • Problem: Root cause, underlying issue, permanent fix

Problem management goal

According to ITIL:

  • Identify the root cause of incidents
  • Document workarounds to speed up future resolution
  • Implement permanent solutions to prevent recurrence

Example

  • Incident: "Application X crashed at 10am"
  • Problem: "Memory leak in module Y causing crashes under load"
  • Workaround: Restart service daily
  • Solution: Deploy fix in next release

Creating a problem

1

Access the Problems module

Menu ITSM → Problems.

2

Click "New problem"

Open the creation form.

3

Describe the problem

Fill in:

  • Title: Clear summary of the underlying issue
  • Description: Context, symptoms, observations
  • Priority: Based on impact and frequency
4

Link related incidents

Associate incidents that revealed this problem.

5

Submit

The problem is created with "New" status.

Tip: Create a problem when you see recurring incidents on the same item, or an incident with a workaround that doesn't address the root cause.

Workflow and statuses

Available statuses

StatusDescription
NewProblem identified, not yet investigated
AssignedAssigned for investigation
Under InvestigationRoot cause analysis in progress
Root Cause IdentifiedCause found, planning solution
Known ErrorDocumented in KEDB, workaround available
ResolvedPermanent solution implemented
ClosedProblem completed and validated

Standard workflow

NEW → ASSIGNED → UNDER_INVESTIGATION → ROOT_CAUSE_IDENTIFIED
                                              ↓
                                        KNOWN_ERROR (with workaround)
                                              ↓
                                          RESOLVED → CLOSED

Root cause analysis (RCA)

Root Cause Analysis identifies why incidents occurred.

RCA methods

  • 5 Whys: Ask "why?" repeatedly until you reach the root cause
  • Fishbone diagram: Categorize potential causes
  • Timeline analysis: Trace events leading to the incident
  • Log analysis: Review system logs for evidence

Documenting RCA

In KaliaOps, document:

  • Root cause description: Clear explanation of the underlying issue
  • Evidence: Logs, screenshots, test results
  • Contributing factors: Conditions that enabled the failure

Example

Root cause: Database connection pool exhaustion

Evidence:
- Connection count reached max (100) at 09:58
- First errors logged at 09:59
- Pool configured 5 years ago for lower load

Contributing factors:
- Traffic increased 300% in last year
- No monitoring on connection pool
Tip: A good root cause is specific, evidence-based, and actionable. "Human error" is rarely a good root cause - dig deeper.

Documenting workarounds

A workaround is a temporary solution that restores service without fixing the root cause.

Why document workarounds?

  • Faster resolution: Technicians can apply known fix immediately
  • Consistency: Everyone uses the same approach
  • Service continuity: Users get service restored quickly

Good workaround documentation

Include:

  • Steps: Clear, numbered instructions
  • Prerequisites: Required access, tools
  • Side effects: Any limitations or impacts
  • Duration: How long does the fix last?

Example

Workaround: Restart the application service

Steps:
1. Connect to server SRV-APP-01
2. Run: systemctl restart app-service
3. Verify service is running: systemctl status app-service
4. Monitor for 5 minutes

Side effects:
- 30 seconds of downtime during restart
- Active sessions are terminated

Duration: Fixes for ~24 hours until memory leak recurs

Known Errors (KEDB)

The Known Error Database (KEDB) records problems with identified root causes.

What is a Known Error?

A Known Error is:

  • A problem with an identified root cause
  • A documented workaround
  • Awaiting permanent solution (or no fix planned)

Benefits of KEDB

  • Faster incident resolution: Technicians search KEDB first
  • Knowledge sharing: Expertise is documented
  • Onboarding: New team members learn common issues

Marking as Known Error

  1. Complete root cause analysis
  2. Document workaround
  3. Change status to "Known Error"
  4. The problem is now searchable in KEDB

Using KEDB

When handling an incident:

  1. Search KEDB for matching symptoms
  2. If found, apply documented workaround
  3. Link incident to the known error

Implementing permanent solutions

The permanent solution eliminates the root cause.

Solution documentation

Record:

  • Solution description: What was done
  • Implementation date: When it was applied
  • Change reference: Link to associated change ticket
  • Validation: How we verified it worked

Typical solutions

  • Code fix deployed
  • Configuration change
  • Infrastructure upgrade
  • Process improvement
  • Training provided

Workflow

  1. Develop/plan the solution
  2. Create a change ticket for implementation
  3. Implement the change
  4. Validate the fix
  5. Update problem status to "Resolved"
  6. Document the solution
Tip: Always create a change ticket for permanent solutions. This ensures proper testing, approval, and rollback planning.

Linking associated incidents

Link related incidents to the problem.

Why link incidents?

  • Scope assessment: How many users were affected?
  • Pattern detection: When do incidents occur?
  • Communication: Update all affected users at once
  • Metrics: Cost/impact of the problem

Creating links

From the problem:

  1. Go to "Related Incidents" section
  2. Click "Link incident"
  3. Search and select incidents

From an incident:

  1. Open the incident
  2. In "Related Problem", select the problem

Automatic detection

KaliaOps can suggest links based on:

  • Same affected assets
  • Similar symptoms (keywords)
  • Time proximity

Impact on resolution

When the problem is resolved:

  • All linked incidents can be updated
  • Users receive notification
  • Statistics reflect the resolution
Key points
  • Clear distinction: incident (symptom) vs problem (root cause)
  • Dedicated workflow for investigation and RCA
  • Reusable Known Error Database (KEDB)
  • Workaround + permanent solution documentation
  • Automatic link to recurring incidents
Back to documentation Next article Change management