weblate/docs/security/incident-response-plan.rst

Incident response plan for Weblate
==================================

Scope and objectives
--------------------

This IRP covers incidents impacting the confidentiality, integrity, or
availability of Weblate-operated deployments.

.. note::

    This plan is specifically designed for deployments operated by Weblate
    s.r.o. Other deployments need to adapt provider-specific and organizational
    steps to their own environment.

Roles and responsibilities
--------------------------

- **Incident Response Lead (IRL):** Coordinates all phases of the response process.
- **System Administrator:** Executes containment and recovery measures.
- **Security Officer:** Evaluates security impact and regulatory consequences.
- **Data Protection Officer (DPO):** Evaluates if personal data (PII) was compromised and manages mandatory GDPR notifications.
- **Communications Lead:** Manages notifications to internal stakeholders and external parties if required.

Communication logistics
-----------------------

- **Internal Communication:**
    - Primary channel is **Signal** for human-to-human coordination.
    - Technical alerts remain outside of Signal to avoid noise.
- **External Communication:**
    - **E-mail** is used to reach customers.
    - Customer contact lists are maintained in several locations to ensure access during service outages.
- **Public Disclosure:**
    - If a security vulnerability is discovered, follow :doc:`/security/issues`.

Incident categories and severity
--------------------------------

Incident activation
^^^^^^^^^^^^^^^^^^^

- Declare an incident when an event is confirmed or strongly suspected to
  affect the confidentiality, integrity, or availability of the service beyond
  routine operational noise.
- The **Security Officer** declares the incident, assigns the initial severity,
  and appoints the **Incident Response Lead (IRL)**.
- If the Security Officer is unavailable, any available senior operator may
  declare the incident and hand over ownership as soon as practical.
- Reclassify the incident if the scope or impact changes during investigation.

Incident categories
^^^^^^^^^^^^^^^^^^^

- Category 1 – Unauthorized Access
- Category 2 – Data Integrity Violation
- Category 3 – Service Outage or Degradation
- Category 4 – Misconfiguration or Deployment Error

Severity levels and SLAs
^^^^^^^^^^^^^^^^^^^^^^^^

+----------+------------------------------------------------------+---------------------+-----------------------+
| Severity | Definition                                           | Target Acknowledge  | Target Initial Action |
+==========+======================================================+=====================+=======================+
| Critical | Total outage; Admin compromise; Active data breach;  | < 30 Minutes        | < 4 Hours             |
|          | requires immediate containment.                      |                     |                       |
+----------+------------------------------------------------------+---------------------+-----------------------+
| High     | Core feature failure; PII leak of single user.       | < 2 Hours           | 12 Hours              |
+----------+------------------------------------------------------+---------------------+-----------------------+
| Medium   | Performance degradation; Minor security issue.       | 1 Business Day      | 3 Business Days       |
+----------+------------------------------------------------------+---------------------+-----------------------+
| Low      | UI bugs; Staging issues; Non-security errors.        | Best Effort         | Best Effort           |
+----------+------------------------------------------------------+---------------------+-----------------------+

Incident response lifecycle
---------------------------

Preparation
^^^^^^^^^^^

- Ensure regular daily backups of the PostgreSQL database and the data directory using Weblate's built-in backup with rotation, see :ref:`backup`.
- Ensure Weblate uses a properly configured reverse proxy (e.g., NGINX) with HTTPS (TLS 1.2+).
- Enable 2FA for all admin-level accounts.
- Keep the Weblate instance and its dependencies (Python, Django, Celery, database, etc.) up to date.
- Integrate with SIEM systems using the GELF protocol for audit and application log forwarding.

Identification
^^^^^^^^^^^^^^

- Monitor system and application logs (``journalctl``, reverse proxy logs, Weblate application and audit logs).
- Analyze login events, webhook executions, and push/pull failures.
- Configure alerting (via Prometheus, Zabbix, or SIEM) for multiple login failures, unexpected restarts, or irregular VCS actions.

Containment
^^^^^^^^^^^

- Create an incident record with a case ID and record timeline updates as
  actions are taken.
- Coordinate human response in **Signal** and keep technical alerting in the
  existing monitoring systems.
- For Category 1 or 2 incidents, create a manual **Hetzner Cloud Snapshot**
  before taking disruptive action when it is safe to do so.

  - Name format: ``IRP-[CaseID]-[YYYYMMDD]-Evidence``.
  - These are separate from standard rotating backups and must be preserved
    for analysis.

- Isolate the affected host or service as needed (for example by firewall rules
  or service isolation).
- Disable external integrations (Git/webhooks) if they are part of the attack
  vector.
- Suspend affected user accounts immediately.
- Revoke or rotate affected administrative, API, VCS, and webhook credentials
  as applicable.
- Preserve relevant evidence, including system logs, reverse proxy logs,
  Weblate application and audit logs, affected configuration state, and the
  list of impacted credentials or integrations.

Eradication
^^^^^^^^^^^

- Remove any unauthorized code or data.
- Patch known vulnerabilities by upgrading Weblate or server components.
- Validate binary and repository integrity using SHA-256 checksums or Git logs.

Recovery
^^^^^^^^

- Restore affected services or data from the latest known-good Weblate backups.
- **PII Assessment:** DPO determines if the breach requires a 72-hour GDPR notification.
- Reintroduce services in a phased approach.
- Confirm the root cause has been removed or a compensating control is in
  place before restoring normal traffic.
- Rotate affected credentials and verify integrity of the restored system,
  repositories, and configuration.
- The Security Officer and IRL approve returning to normal operations.
- Monitor logs and system behavior continuously for at least 72 hours post-recovery.

Post-incident review
^^^^^^^^^^^^^^^^^^^^

- **Timeline:** Hold a review meeting within **5 business days** of incident closure.
- Compile a full incident timeline and actions taken.
- Perform Root Cause Analysis (RCA) and document it within **10 business days**.
- Update security policies and IRP documentation based on findings.
- Review the effectiveness of detection and containment mechanisms.
- Verify whether escalation, alerting, and external communication followed
  :doc:`/security/issues` as expected.