Decoding an encrypted text (using RSA) using the Language distribution.

The goal

The goal of this experiment is to reproduce the test executed in the part 2 of the previous homework, so, decrypt an encrypted text using language distribution, but this time using a more advance encryption scheme instead of Ceasar Cipher, which is a very simple algorithm. This time, we'll use the RSA algorithm, which will be applied on a per-letter basis.

While RSA is one of the most widely used public-key cryptosystems, its strenght relies on working with large integers. Here, we deliberately reduce it to a simplified version using small prime numbers and a limited alphabet (A-Z), since our goal is to visualize its mechanism and limitations, and to see how we can leverage these crafted weaknesses using language distribution.

RSA Fundamentals

RSA is based on the mathematical properties of modular arithmetic and prime factorization. First, we'll talk about how the key generation is made.

Key Generation

First, we have to choose two distinct prime numbers p and q.

Then, we have to compute:

The next step is to select an integer e such that:

and

Now, it's time to compute the modular multiplicative inverse of e module φ(n):

So, the public key is (n,e) and the private key is (n,d).

Encryption and Decryption

Each letter m from the plaintext is converted into a number such that 0≤m<26 and is encrypted as:

Then, the decryption is performed using this formula:

In this experiment, as said above, each letter is treated independently. Therefore, the ciphertext is basically a sequnce of integers representing each encrypted letter. For example, using the small numbers used in the actual test (p = 31, q = 37, e = 5), we'd have:

If m = 2 (c), then:

and

Decoding without the Private Key

Even if the RSA id theoretically secure for large values of n, in this semplified version the plaintext alphabet has only 26 possible symbols. Since the encryption of each letter is deterministic (so, the same input leads to the same output), an attacker who knows the public key (n,e) can easily build a dictionary attack:

This produces a mapping c↦m, allowing a full decryption wuthout factoring n and without the private key.

This vulnerability exists because the massage space is too small (just 26 options), RSA's semantic security relies on random padding and large keys, which are both absent here, and also, the encryption is deterministic, so identical letters produce identical ciphertext, as it was with Ceasar Cipher.

Frequency Analysis (Language Distribution)

If, hypothetically, the public key (n,e) were unknown, the cyphertext could still be approached as a monoalphabetic substitution cipher, similar to Ceasar's. Therefore, we can apply letter- frequency analysis, comparing the observed symbol distribution Oi with the expected distribution Ei of a reference language, using this formula:

The candidate mapping minimizing χ2 corresponds to the most probable decoding. This statistical approach illustrates how language characteristics (such as distribution, for example) can leak informations, even from encrypted text, if the cipher structure is too weak or deterministic.

Algorithm Implementation (the test)

Using the same plaintext of the test with the Ceasar Cipher: "Valentine Day is a holiday that in the United States takes place on February and technically signifies the accomplishments of St Valentine which is a third century Roman saint".

The JavaScript implementation used in the test performs the following steps:

Key generation & encryption

First, it computes n,φ(n),d from chosen p,q,e (as mentioned, the chosen values are p = 31, q = 37, e = 5). Then, it maps each letter to m and performs the encryption using the provided encryption formula above. In the Javascript code, the function performing the encryption is this one: