Synchronizing lists across projects, keeping the most up-to-date edits

Anyone who has managed multiple code bases, such as websites, micro services, has had to deal with configuration files. One thing that is often irritating are shared configuration files, synchronized across projects. For me, this is a particularly difficult challenge, as in addition I work across four or five machines.

Some of you may be wondering:

Wait, you use the same config for multiple projects?

Hell yeah I do! I keep a massive list of spammers sync’d across all of my websites (just an example). Some other examples include: NSFW filters, spam lists, black lists, white lists, etc.

Obviously, there has to be an easy solution to this, and to some extent there is:

Keep all the shared configuration files and lists centralized.

Although that sounds easy, there are three issues:

Each project has it’s own collaborators, requiring access control management to any centralized documents
Many projects are designed to self-contained, as some are deployed without access to the internet
Local edits for testing aren’t sync’d with the main repository

Which leads me to my solution – writing my own file synchronizing script.

Synchronizing Decisions

At first, this may sound like a simple problem: take the most updated file and use that!

But, what happens if multiple edits occurred across multiple files between syncs?

Meaning, taking the most recent list wont work.

Synchronizing lists require a bit more finesse to handle both adding and removing items based on edits. Further, how do we want to handle multiple edits between synchronizations. For me, the choice was:

Utilize the latest edit, regardless of addition or removal of an item.

Synchronizing Workflow

Once we’ve decided we’re going to just use the latest edit regardless of the consequences, this leaves the following workflow for the synchronization:

Get list containing all files to synchronize
Open the files, capture the last time each file was edited
Load all items from the lists in a map
Map each item to the last time it was edited
Mark each item as included in the last edit or not
Rewrite the lists, with the updated entries

This does work generally well and it is O(n) for both run-time and space complexity.

However, it does leave a gaping whole where some data may get lost in the synchronization(s).

Take the following example:

File 1: Removed term XXX

File 2: Added term YYY

Apply Sync Function

File 1 & File 2 – YYY Added (removal of XXX is omitted)

Because file YYY was edited last and each item in a file is assigned the value of the last edit, the item appears that was the last time that item was accessed (not when it was omitted). To mitigate this problem, my system runs the script after every time I edit a file which needs to be synchronized. This is accomplished via my personal Emacs mode. A periodic cron job would also work in some cases.

Unfortunately, this was the current solution I have for the “removal problem”. The main issue is there is no record of what is edited in each file. One possible future mitigation is using a version control system similar to git to track changes per line, then syncing across them. However, that was out of the scope of what I was willing to do. The important part was identifying this potential risk and for my use, this issue is mostly mitigated already.

Synchronizing Code

Even though the synchronization solution & code may not be pretty and just be a script — It gets the job done. The script synchronizes the lists across my projects and even computers via NFS.

The full code is in the Github repository.

However, the code below should provide insight as to how the script works:

Synchronizing lists across projects, keeping the most up-to-date edits

Synchronizing Decisions

Synchronizing Workflow

Synchronizing Code

Related Articles

Leave a Reply Cancel reply