Mastering sensitive data handling and GDPR compliant secure data removal with event sourcing

In one of our recent blog posts, we showed the benefits that event sourcing can bring to your project and business. One key takeaway was that we no longer lose any data. However, this can also lead to problems in specific situations, such as when removing data sets. What do we do if we have data that needs to be removed? Is this possible with event sourcing? How can this be done with an immutable event store? I will explain how you can overcome this problem and how our PHP event sourcing library handles it.

Expected data removal

First, let us begin with the simpler case. We already know that the data will need to be removed in the future. A good example could be personal data from our users. Due to the EU regulations of GDPR, we need to ensure that we can delete all personal data from a user. Let's take a simple example: we have an aggregate where we save the name of a user, and this name can also be changed, which is handled by a dedicated event called NameChanged.

use Patchlevel\EventSourcing\Aggregate\Uuid;
use Patchlevel\EventSourcing\Attribute\Event;

#[Event('name_changed')]
final class NameChanged
{
    public function __construct(
        public Uuid $id,
        public string $name,
    ) {
    }
}

Now, if we save this event, it will be part of our immutable event store, so we cannot update it afterward. However, since we already know that this could cause problems due to GDPR, we can take action to prevent issues. The solution is to avoid saving sensitive data directly in our event store. We will discuss two options here: Crypto Shredding and Tokenization.

Crypto Shredding

With Crypto Shredding, we save the data encrypted in our event store. This way, we don’t have the name in plaintext in our database but instead as an encrypted string. The key used to encrypt the data is saved separately. This could be in the same database in a different table, a separate database, or even on the filesystem. Why are we doing this? With this setup, we can “delete” the data at any time. As soon as the user requests removal from our system, we delete the encryption key for that user. When this happens, the encrypted data in our event store becomes unreadable: et voilà, problem solved.

Our library supports Crypto Shredding and is easy to use. For implementation, we provide some Attributes: PersonalData to mark sensitive data and DataSubjectId to identify and save the encryption key. For encrypted data, there is also the option to provide fallback data if desired.

use Patchlevel\EventSourcing\Aggregate\Uuid;
use Patchlevel\EventSourcing\Attribute\Event;

#[Event('name_changed')]
final class NameChanged
{
    public function __construct(
        #[DataSubjectId]
        public Uuid $id,
        #[PersonalData(fallback: 'anon')]
        public string $name,
    ) {
    }
}

Here, we would get anon for our name property if decryption is unsuccessful. With this fallback, our application can still function as expected without crashing due to missing data. Next, we have the configuration for the type of encryption to use. If you are using the symfony bundle, the configuration is a breeze:

patchlevel_event_sourcing:
  cryptography:
    enabled: true
    algorithm: 'aes-256-gcm'

Now, our DoctrineCipherKeyStore will be used to store the encrypted data. Since it's based on doctrine/dbal, a wide range of databases is already supported. The algorithm is used by our openssl-based implementation to encrypt and decrypt the data.

Tokenization

Tokenization is another technique that can be used to prevent saving sensitive data in the event store. With this approach, we don't encrypt the data and save it to the store. Instead, we send the data to a vault and receive a token in return. This token is then passed around in our application and stored in the event store. Whenever we need the real data, we query it from the vault, which returns the data we need.

One advantage of this solution is that we are now only handling tokens in our domain instead of sensitive data. This reduces the potential issues you could encounter. What do I mean by that? Well, unintentionally leaking sensitive data becomes highly unlikely, since you need to explicitly access the vault for that. Retrieving this data often requires a valid reason, which also increases the auditability of the data.

Unfortunately, we don't provide a solution for tokenization in our library, as we believe that tokenization should be done before the data touches the persistence layer. Therefore, tokenization is out of scope for the library. However, if you disagree, don't hesitate to open an issue or even submit a PR on GitHub!

Unexpected removal of data is required

Now, let's talk about the more challenging part: the case of deleting data we did not anticipate needing to remove. The event store is immutable, and this case is no exception, so manipulating the event store is still a no-go. This may seem like an impossible task, doesn't it? But don't worry, there is a solution.

Rewrite History

We cannot update the events in our store, but we can recreate our store. What do I mean by that? I mean reading all of our events and writing them into a new store. Between these two operations, we can perform whatever changes we need. This could involve dropping a complete stream, editing values for placeholders, or applying one of the previously described solutions. The result will be a cleaned-up new event store without the data we needed to remove. We could also get rid of some upcasters in this process if we change the events in the same way our upcasters did.

We are working on a new feature that will simplify the task considerably. With this, you can read the current store and execute a list of translators on the messages. These can include things like renaming, updating, filtering, or even creating new events. After that, the new message stream is written into our new store. Once the new stream has been tested, we can switch our application to use the new event store. Recreating the new event store may take some time if our old store has already grown over time.

$oldStore; // the currently used store with events to be removed
$newStore; // the new store which should be used which is still empty

$pipeline = new Pipe(
    $oldStore->load(), // load all events of the old store
    new ChainTranslator([
        new AnonymizeUserInformationTranslator(), // you can update sensitive values of events
        new MapProfileAdressToProfileLocationTranslator(), // or map events to different one without sensitive data
        new ExcludeEventTranslator([ProfileNameUpdated::class]), // or even drop whole events
        new RecalculatePlayheadTranslator(), // we need to recalculate the playhead if we are dropping or adding new events
    ])
);

$newStore->save(...$pipeline);

The example above shows how you could create a one-time command to test the process and migrate the old store to the new one. I included multiple translators to demonstrate that there are many possible ways to handle these situations. One option could be to anonymize the data using our crypto-shredding feature to remove plaintext data from the store. Another solution is to map the event to a different event that excludes the sensitive data. The final approach is to drop entire events. For each case, you should thoroughly test the application afterward to prevent failures.

For a more sustainable solution, we recommend using the subscription engine to execute the migration for several reasons. First, this allows us to batch saves to the new Store easily if we use the BatchableSubscriber. Second, we can run this in parallel within our application and recreate it easily if anything goes wrong. Lastly, schema creation is also handled automatically.

#[Subscriber('migrate', RunMode::Once)]
final class MigrateStoreSubscriber implements BatchableSubscriber
{
    private readonly SchemaDirector $schemaDirector;

    /** @var list<Message> */
    private array $messages = [];

    /** @var list<Translator> */
    private readonly array $translators;

    public function __construct(
        private readonly Store $targetStore,
    ) {
        $this->schemaDirector = new DoctrineSchemaDirector(
            $targetStore->connection(),
            new ChainDoctrineSchemaConfigurator([$targetStore]),
        );

        // same translators as above
        $this->translators = [
            new AnonymizeUserInformationTranslator(),
            new MapProfileAdressToProfileLocationTranslator(),
            new ExcludeEventTranslator([ProfileNameUpdated::class]),
            new RecalculatePlayheadTranslator(),
        ];
    }

    #[Subscribe('*')]
    public function handle(Message $message): void
    {
        $this->messages[] = $message;
    }

    public function beginBatch(): void
    {
        $this->messages = [];
    }

    public function commitBatch(): void
    {
        $pipeline = new Pipe($this->messages, $this->translators);
        $this->messages = [];

        $this->targetStore->save(...$pipeline);
    }

    public function rollbackBatch(): void
    {
        $this->messages = [];
    }

    public function forceCommit(): bool
    {
        return count($this->messages) >= 10_000;
    }

    #[Setup]
    public function setup(): void
    {
        $this->schemaDirector->create();
    }

    #[Teardown]
    public function teardown(): void
    {
        $this->schemaDirector->drop();
    }
}

Conclusion

In this post, we explored how to address the challenges of data removal in event-sourced applications, focusing on cases where compliance and privacy laws require specific data to be deletable. With techniques like Crypto Shredding and Tokenization, we can handle personal data securely, either by encrypting sensitive information with removable keys or by storing tokens instead of actual data. These approaches ensure handling sensitive data and GDPR compliance by enabling data to be effectively deleted from an immutable store.

When unexpected data deletion is needed, we can Rewrite History to re-create the event store without sensitive data. By reading, modifying, and then writing back events into a new store, developers can meet legal or business requirements without altering the integrity of the event-based architecture. Together, these solutions allow applications based on event sourcing to handle data removal securely and flexibly, ensuring both regulatory compliance and system resilience.

Other Recent Posts

RSS

What is new in php event sourcing 3.5

We are happy to announce the release of the php event sourcing library in version 3.5.0. This release contains several exciting new features and improvements. In this blog post, we will provide you with an overview of the changes.

David Badura
David Badura
Software Entwickler

Why should I use event sourcing?

Today, our topic will be the benefits of event sourcing. We will discuss the possibilities it offers without data loss. The main topics will be reporting & analytics, software architecture & scalability, testing and auditing.

Daniel Badura
Daniel Badura
Software Entwickler