Problem Management
Identify root causes and prevent incident recurrence
Problem management identifies the root causes of recurring incidents. KaliaOps provides a dedicated workflow with investigation phases, workaround documentation, Known Error Database (KEDB) management, and permanent solution implementation tracking.
Overview
A problem is the root cause of one or more incidents.
Incident vs problem
- Incident: Symptom, visible impact, quick fix
- Problem: Root cause, underlying issue, permanent fix
Problem management goal
According to ITIL:
- Identify the root cause of incidents
- Document workarounds to speed up future resolution
- Implement permanent solutions to prevent recurrence
Example
- Incident: "Application X crashed at 10am"
- Problem: "Memory leak in module Y causing crashes under load"
- Workaround: Restart service daily
- Solution: Deploy fix in next release
Creating a problem
Access the Problems module
Menu ITSM → Problems.
Click "New problem"
Open the creation form.
Describe the problem
Fill in:
- Title: Clear summary of the underlying issue
- Description: Context, symptoms, observations
- Priority: Based on impact and frequency
Link related incidents
Associate incidents that revealed this problem.
Submit
The problem is created with "New" status.
Workflow and statuses
Available statuses
| Status | Description |
|---|---|
| New | Problem identified, not yet investigated |
| Assigned | Assigned for investigation |
| Under Investigation | Root cause analysis in progress |
| Root Cause Identified | Cause found, planning solution |
| Known Error | Documented in KEDB, workaround available |
| Resolved | Permanent solution implemented |
| Closed | Problem completed and validated |
Standard workflow
NEW → ASSIGNED → UNDER_INVESTIGATION → ROOT_CAUSE_IDENTIFIED
↓
KNOWN_ERROR (with workaround)
↓
RESOLVED → CLOSED Root cause analysis (RCA)
Root Cause Analysis identifies why incidents occurred.
RCA methods
- 5 Whys: Ask "why?" repeatedly until you reach the root cause
- Fishbone diagram: Categorize potential causes
- Timeline analysis: Trace events leading to the incident
- Log analysis: Review system logs for evidence
Documenting RCA
In KaliaOps, document:
- Root cause description: Clear explanation of the underlying issue
- Evidence: Logs, screenshots, test results
- Contributing factors: Conditions that enabled the failure
Example
Root cause: Database connection pool exhaustion
Evidence:
- Connection count reached max (100) at 09:58
- First errors logged at 09:59
- Pool configured 5 years ago for lower load
Contributing factors:
- Traffic increased 300% in last year
- No monitoring on connection pool Documenting workarounds
A workaround is a temporary solution that restores service without fixing the root cause.
Why document workarounds?
- Faster resolution: Technicians can apply known fix immediately
- Consistency: Everyone uses the same approach
- Service continuity: Users get service restored quickly
Good workaround documentation
Include:
- Steps: Clear, numbered instructions
- Prerequisites: Required access, tools
- Side effects: Any limitations or impacts
- Duration: How long does the fix last?
Example
Workaround: Restart the application service
Steps:
1. Connect to server SRV-APP-01
2. Run: systemctl restart app-service
3. Verify service is running: systemctl status app-service
4. Monitor for 5 minutes
Side effects:
- 30 seconds of downtime during restart
- Active sessions are terminated
Duration: Fixes for ~24 hours until memory leak recurs Known Errors (KEDB)
The Known Error Database (KEDB) records problems with identified root causes.
What is a Known Error?
A Known Error is:
- A problem with an identified root cause
- A documented workaround
- Awaiting permanent solution (or no fix planned)
Benefits of KEDB
- Faster incident resolution: Technicians search KEDB first
- Knowledge sharing: Expertise is documented
- Onboarding: New team members learn common issues
Marking as Known Error
- Complete root cause analysis
- Document workaround
- Change status to "Known Error"
- The problem is now searchable in KEDB
Using KEDB
When handling an incident:
- Search KEDB for matching symptoms
- If found, apply documented workaround
- Link incident to the known error
Implementing permanent solutions
The permanent solution eliminates the root cause.
Solution documentation
Record:
- Solution description: What was done
- Implementation date: When it was applied
- Change reference: Link to associated change ticket
- Validation: How we verified it worked
Typical solutions
- Code fix deployed
- Configuration change
- Infrastructure upgrade
- Process improvement
- Training provided
Workflow
- Develop/plan the solution
- Create a change ticket for implementation
- Implement the change
- Validate the fix
- Update problem status to "Resolved"
- Document the solution
Linking associated incidents
Link related incidents to the problem.
Why link incidents?
- Scope assessment: How many users were affected?
- Pattern detection: When do incidents occur?
- Communication: Update all affected users at once
- Metrics: Cost/impact of the problem
Creating links
From the problem:
- Go to "Related Incidents" section
- Click "Link incident"
- Search and select incidents
From an incident:
- Open the incident
- In "Related Problem", select the problem
Automatic detection
KaliaOps can suggest links based on:
- Same affected assets
- Similar symptoms (keywords)
- Time proximity
Impact on resolution
When the problem is resolved:
- All linked incidents can be updated
- Users receive notification
- Statistics reflect the resolution
- Clear distinction: incident (symptom) vs problem (root cause)
- Dedicated workflow for investigation and RCA
- Reusable Known Error Database (KEDB)
- Workaround + permanent solution documentation
- Automatic link to recurring incidents