# Fast Implementation of LSH With SIMD 

DONGYEONG KIM ${ }^{1}$, YOUNGHOON JUNG ${ }^{2}$, YOUNGJIN JU ${ }^{1}$, AND JUNGHWAN SONG ${ }^{1}$<br>${ }^{1}$ Department of Mathematics, Research Institute for Natural Sciences, Hanyang University, Seoul 04763, South Korea<br>${ }^{2}$ The Affiliated Institute of ETRI, Daejeon 341293, South Korea<br>Corresponding author: Junghwan Song (camp123@hanyang.ac.kr)

This work was supported by the Institute for Information and Communications Technology Planning and Evaluation (IITP) funded by the Korean Government (MSIT) under Grant 2017-0-00267.


#### Abstract

In this paper, we propose a method of efficient software implementation for the cryptographic hash function LSH with single instruction multiple data (SIMD). The method is based on word-wise permutations of LSH. Using the modified functions Step ${ }_{j}^{\prime}=P \circ$ Step $_{j} \circ P^{-1}$ and MsgExp ${ }^{\prime}$ instead of the original step function $S t e p_{j}$ and message expansion function $M s g E x p$, where $P$ is a permutation and $P^{-1}$ is the inverse permutation of $P$, we show that the number of the SIMD instructions for implementing LSH is reduced. For efficient implementation of LSH in other environments (e.g., MIMD), various types of word permutations are listed.


INDEX TERMS Software implementation, word-wise permutation, SIMD, hash function, LSH, ARX.

## I. INTRODUCTION

Cryptographic hash functions are necessary for constructing a system of information security. Generally, they are used for authentication, providing both data integrity and entity integrity [1]-[4]. A cryptographic hash function is a function that maps an input to a fixed output satisfying the following cryptographic resistance properties [5].

1. Preimage resistance: for essentially all pre-specified outputs, it is computationally infeasible to find an input that hashes to that output.
2. 2nd preimage resistance: it is computationally infeasible to find another input that hashes the same output as a specified input.
3. Collision resistance: it is computationally infeasible to find any two distinct inputs that hash to the same output.

From the output of a cryptographic hash function, it should be computationally difficult to find the corresponding input. Additionally, for a given input, it should be difficult to find another input that hashes to the same output. Because of these aspects, a cryptographic hash function is used in various fields, such as a message authentication code (MAC), key derivation function (KDF), and a pseudo-random number generator. LSH [6] is a cryptographic hash function that was designed by NSRI [7] (National Security Research Institute). SIMD [8] is a class of parallel computing. SIMD is an instruction set that performs the same operations on multiple data simultaneously. A core element of SIMD is a register. SIMD has registers in with various lengths such as

[^0]128,256 , and 512 bits. For a 128 -bit resister, each resister has four 32 -bit or two 64 -bit sections of data. Thus, one operation for an 128-bit resister is equivalent to four 32-bit operations or two 64-bit operations. When implementing cryptographic algorithms, SIMD is used for various purposes such as resistance to side-channel attacks [9], [10] and efficient implementations [11]-[16]. There has not been any research on overcoming weakness using SIMD with a cipher For these reasons, BLAKE2 [17] and LSH are cryptographic algorithms having advantage of implementation with SIMD. BLAKE2, SIMON/SPECK [18], and LEA [19] were implemented using SIMD in [11]-[13]. However, to the best out knowledge, our research is the first to have an efficient implementation via SIMD by changing the representation of a cryptographic algorithm. In this paper, we show how to implement a cryptographic hash function LSH efficiently with SIMD by representing the LSH using $P$ and $P^{-1}$, where $P$ is a permutation and $P^{-1}$ is the inverse of $P$. Note that complexity is considered as the number of SIMD instructions and their latency, not the number of XOR and modular additions. This metric is necessary for finding conditions that reduce the number of SIMD instructions and use SIMD instructions with low latency when an algorithm is implemented.

For example, a case in which a permutation in a register composed of four 32-bit words is implemented. If a word-wise permutation is operated in a register, then only a single SIMD instruction is needed, "_mm_shuffle_epi32". However, if the word-wise permutation is the identity, then there is no need for an SIMD instruction. In another example, assuming that two 64-bit words compose a register, if a word-wise permutation is operated


FIGURE 1. Wide-pipe merkle-damgard construction.
in two mixed registers, then three SIMD instructions are needed as follows: "_mm_unpacklo_epi64", "_mm_unpackhi_epi64", and "_mm_shuffle_epi32". The instruction "_mm_uppacklo_epi64" extracts the left 64-bit word in the two registers, and "_mm_unpackhi_epi64" extracts the right 64-bit word in the two registers. Additionally, the instruction "_mm_shuffle_epi32" is used to permute in 32-bit units.

If each of the two 64-bit words still remain in their same registers, then "_mm_unpacklo_epi64" and "_mm_unpackhi_epi64" are not needed for a word-wise permutation. In this paper, conditions to reduce the number of SIMD instructions and implement SIMD instructions with lower latency are found.
This paper is organized as follows. Section II shows the specifications of LSH. Section III provides an efficient implementation method via SIMD for LSH. We demonstrate the method and permutation conditions needed for efficient implementation. All permutations are categorized considering those conditions. In Section IV, we show an optimal permutation for the best performance with LSH. Concluding remarks on the implementation performance with an optimal permutation are given in Section V.

## II. SPECIFICATION OF LSH

In 2014, a hash function LSH was published by D. Kim et al. at International Conference on Information Security and Cryptology, designed specifically to enhance software efficiency [6]. LSH was designed using wide-pipe Merkle Damgard construction (wide-pipe MD construction) [20]. The design of the compression function for LSH is based on ARX (Addition ( $\boxplus$ ), Rotation ( $\ll$ ), and XOR $(\oplus)$ ) [21]. The following describes the wide-pipe MD construction and compression function of LSH.

As shown in Fig. 1, the length of an internal state in a widepipe MD construction is $2 n$ bits, which is twice the length of an output with $n$ bits. Let $w$ be the number of bits in a word. LSH-8w-n can represents any of the following LSHs: LSH-256-224, LSH-256-256, LSH-512-224, LSH-512-256, LSH-512-384, and LSH-512-512. Each has a different initializing value $I V$. The generating method of $I V$ is given in [6].

The structure of the compression function $f$ in an LSH is ARX-based. The number of bits for the input of $f$ is $48 w$, and that of the output is $16 w$. The compression function $f$ transforms 16 -word and 32 -word messages into 16 -word messages. Each $f$ contains the MsgExp function, Step $_{j}$ (for $j=1, \cdots, N_{s}$ ), as well as the MsgAdd function. The number of steps $N_{S}$ is selected as follows:
$N_{s}=26$ if the number of bits in $w$ is 32 ,
$N_{s}=28$ if the number of bits in $w$ is 64 .

## A. NOTATIONS

$W^{t}$ : Set of $t$-word arrays
$X \oplus Y$ : Bit-wise exclusive-or of $X$ and $Y$.
$X \boxplus Y: X+Y \bmod 2^{w}$.
$X^{\lll r}: r$ bits left rotation of word $X$.
$M^{(i)}:=\left(M^{(i)}[0], \cdots, M^{(i)}[31]\right)$ : The $i$-th 32 -word array message block.
$M_{j}^{(i)}:=\left(M_{j}^{(i)}[0], \cdots, M_{j}^{(i)}[15]\right)$ : The $j$-th 16 -word array submessage generated from the $i$-th message $M^{(i)}$.
$S C_{j}:=\left(S C_{j}[0], \cdots, S C_{j}[7]\right):$ The $j$-th 8 -word array step constant.
$T:=(T[0], \cdots, T[15])$ : The 16-word array temporary variable used in a step function.
$P$ : A word-wise permutation on 16 words

$$
P(T):=P(T[0], \cdots, P[15]):=(T[P(0)], \cdots, T[P(15)])
$$

$P(T[i]):=T[P(i)]$
$P_{i}$ : A word-wise permutation on 4 words.
Notice that we define a permutation $P$ that has the same format for the input and output. If the input is an index $i$, then $P(i)$ is also an index. Similarly, if the input is a word $T[i]$, then the output is a word $P(T[i])=T[P(i)]$.

## B. MsgExp FUNCTION

The first two sub-messages $M_{0}^{(i)}$ and $M_{1}^{(i)}$ are defined as the first 16 words and the next 16 words of $M^{(i)}$, respectively. Then the next sub-messages $\left\{M_{j}^{(i)}\right\}_{j=2}^{N_{s}}$ are calculated by the following:

$$
\begin{align*}
\text { For } j & =2,3, \cdots, N_{s}, \\
M_{j}^{(i)}[l] & \leftarrow M_{j-1}^{(i)}[l] \boxplus M_{j-2}^{(i)}[\tau(l)], \quad 0 \leq l<16 \tag{1}
\end{align*}
$$

Here, the permutation $\tau$ is defined by Table 1 .

TABLE 1. The permutation $\boldsymbol{\tau}$ in MsgExp.

| $l$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\tau(l)$ | 3 | 2 | 0 | 1 | 7 | 4 | 5 | 6 | 11 | 10 | 8 | 9 | 15 | 12 | 13 | 14 |

## C. Step; FUNCTION

Step $_{j}$ is used $N_{s}$ times repeatedly in the compression function $f$. Step $j_{j}$ is composed of three functions $M s g A d d$, Mix $j_{j}$, and $\sigma$ as

$$
\begin{equation*}
\text { Step }_{j}=\sigma \circ \text { Mix }_{j} \circ \text { MsgAdd } \tag{2}
\end{equation*}
$$

MsgAdd:

$$
\begin{align*}
M \operatorname{sgAdd}: W^{16} \times W^{16} \rightarrow & W^{16} \\
\operatorname{MsgAdd}\left(T, M_{j}^{(i)}\right)= & \left(T[0] \boxplus M_{j}^{(i)}[0]\right. \\
& \left.\cdots, T[15] \boxplus M_{j}^{(i)}[15]\right) \tag{3}
\end{align*}
$$

Mix $_{j}$ :

$$
\begin{align*}
\operatorname{Mix}_{j}: W^{16} & \rightarrow W^{16} \\
\operatorname{Mix}_{j}(T) & =\left(T^{\prime}[0], \cdots, T^{\prime}[15]\right), \\
\text { where }\left(T^{\prime}[l], T^{\prime}[l+8]\right) & \leftarrow \operatorname{Mix}_{j, l}(T[l], T[l+8]), \\
l & =0, \cdots, 7 \tag{4}
\end{align*}
$$

TABLE 2. The number of bits in rotation: $\alpha_{j}, \beta_{j}, \gamma_{i}$.

| $w$ | $j$ | $\alpha_{j}$ | $\beta_{j}$ | $\gamma_{0}$ | $\gamma_{1}$ | $\gamma_{2}$ | $\gamma_{3}$ | $\gamma_{4}$ | $\gamma_{5}$ | $\gamma_{6}$ | $\gamma_{7}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 32 | even <br> odd | 29 <br> 5 | 1 <br> 17 | 0 | 8 | 16 | 24 | 24 | 16 | 8 | 0 |
|  | even <br> odd | 23 <br> 7 | 59 <br> 3 | 0 | 16 | 32 | 48 | 8 | 24 | 40 | 56 |

TABLE 3. Permutation $\sigma$.

| $l$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\sigma(l)$ | 6 | 4 | 5 | 7 | 12 | 15 | 14 | 13 | 2 | 0 | 1 | 3 | 8 | 11 | 10 | 9 |

$M i x_{j, l}: W^{2} \rightarrow W^{2}$ for inputs $X$ and $Y$ is defined by (5). Here, the bit rotational amounts $\alpha_{j}, \beta_{j}, \gamma_{j}$ used in Mix $x_{j, l}$ are shown in Table 2.

$$
\begin{align*}
& X \leftarrow X \boxplus Y, \\
& X \leftarrow X^{\lll \alpha_{j}}, \\
& X \leftarrow X \oplus S C_{j}[l], \\
& Y \leftarrow X \boxplus Y, \\
& Y \leftarrow Y^{\lll \beta_{j}}, \\
& X \leftarrow X \boxplus Y, \\
& Y \leftarrow Y^{\ll \gamma_{l}} \tag{5}
\end{align*}
$$

$\sigma$ : The word permutation function $\sigma$ permutes 16 words. The permutation $\sigma$ is defined in Table 3, and the permutation of internal states are as follows:

$$
\begin{equation*}
(T[0], \cdots, T[15]) \leftarrow(T[\sigma(0)], \cdots, T[\sigma(15)]) . \tag{6}
\end{equation*}
$$

## III. A FAST IMPLEMENTATION METHOD

In this section, we demonstrate the method in [22] of how to implement LSH efficiently using a word permutation and its inverse with repeated step functions.

## A. OVERVIEW OF WORD-WISE PERMUTATION

In this subsection, we investigate the relation between a permutation $P$ and the compression function $L S H$ of LSH. We also show how to construct $L S H^{\prime}$, which is a modified representation of $L S H$ using the permutation $P$. To represent $L S H^{\prime}$, we investigate the relation between the permutation $P$ and operations in the step function of $L S H$.

For a permutation $P$ and its inverse $P^{-1}$, we consider the function $L S H^{\prime}$ as in Fig. 2. Note that $L S H^{\prime}$ takes a message permuted by $P$ as an input and outputs the permuted hash value of $L S H$ by $P$. Thus, the output of $L S H$ is equal to the permuted output of $L S H^{\prime}$ by $P^{-1}$.

We define $L S H^{\prime}$ as the following:

$$
\begin{equation*}
L S H=P^{-1} \circ L S H^{\prime} \circ P \tag{7}
\end{equation*}
$$

Our goal is to implement $L S H$ efficiently using fewer SIMD instructions with low latency. The followings are basic criteria for $P$ for $L S H$.

Criterion 1: $P$ is a word-wise permutation.
If $P$ is not a word-wise permutation, then it is necessary to consider implementing modular addition, which increases


FIGURE 2. LSH and LSH' $^{\prime}$.
the number of SIMD instructions. Thus, we do not need to consider permutations other than word-wise permutations.

Criterion 2: $|P(i+8)-P(i)|=8 \quad$ for $i=0, \cdots, 7$. Let the $i$-word array $M$ be a message represented by $M=M[0]\|M[1]\| \cdots \| M[i-1]$. Then $P(M)=$ $M[P(0)]\|M[P(1)]\| \cdots \| M[P(i-1)]$. In (5), two words $T[i]$ and $T[i+8]$ are input to the $M i x_{j, l}$ function as $X$ and $Y$. For a register with SIMD, addition and XOR between two registers are operated in parallel between the corresponding words in order. Thus, if Criterion 2 is not satisfied, then we cannot apply addition or XOR between the two registers before reordering the words in the registers. Thus, we limit the range of permutation $P$ by Criterion 2 since it needs many SIMD instructions for loading and storing. With the above two criteria, we have the following Theorem 1.

Theorem 1: If Criterion 1 and 2 hold, then there are relations between the internal state consisting of the 16 words $T[0], \cdots, T[15]$ in LSH and a word-wise permutation $P$ as follows:

1) Modular addition $(\boxplus), \operatorname{XOR}(\oplus)$, and the permutation $P$ are commutative with respect to each other.

$$
\begin{align*}
& T[P(i)] \boxplus T[P(i+8)]=P(T[i] \boxplus T[i+8]), \\
& T[P(i)] \oplus T[P(i+8)]=P(T[i] \oplus T[i+8]) \tag{8}
\end{align*}
$$

2) Left rotation by $\alpha_{j}, \beta_{j}$ and a permutation $P$ are commutative with respect to each other.

$$
\begin{align*}
& T[P(i)]^{\lll \alpha_{j}}=P\left(T[i]^{\lll \alpha_{j}}\right), \\
& T[P(i)]^{\lll \beta_{j}}=P\left(T[i]^{\lll \beta_{j}}\right) \tag{9}
\end{align*}
$$

3) The following relation holds for the left rotation by $\gamma_{j}$ and permutation $P$.

$$
\begin{equation*}
T[P(i)]^{\gamma_{l}}=P\left(T[i]^{\lll \gamma_{P}-1(l)}\right) \tag{10}
\end{equation*}
$$

Proof 1):
In $L S H$, there are steps: $T[i] \leftarrow T[i] \boxplus T[i+8]$ and $T[i+8] \leftarrow$ $T[i] \boxplus T[i+8]$. By Criterion 2, since $|P(i+8)-P(i)|=8$, we replace the above two steps with $T[P(i)] \leftarrow T[P(i)] \boxplus$ $T[P(i+8)]$ and $T[P(i+8)] \leftarrow T[P(i)] \boxplus T[P(i+8)]$.

Then, for $T[P(i)] \boxplus T[P(i+8)]$, the following equality holds:

$$
\begin{align*}
T[P(i)] & \boxplus[P(i+8)] \\
& =P\left(T\left[P\left(P^{-1}(i)\right)\right]\right) \boxplus P\left(T\left[P\left(P^{-1}(i+8)\right)\right]\right) \\
& =P(T[i]) \boxplus P(T[i+8]) \\
& =P(T[i] \boxplus T[i+8]) \tag{11}
\end{align*}
$$

For $\oplus$, the equation can be proved similarly.
Proof 2):

$$
\begin{align*}
T[P(i)]^{\lll \alpha_{j}} & =P\left(T\left[P\left(P^{-1}(i)\right)\right]^{\lll \alpha_{j}}\right) \\
& =P\left(T[i]^{\lll \alpha_{j}}\right) \tag{12}
\end{align*}
$$

Proof 3):
For $\gamma_{l}, l=i-8$, thus when $i$ is permuted to $P(i), l$ is also permuted to $P(l)$. Then the following holds:

$$
\begin{align*}
T[P(i)]^{\lll \gamma_{l}} & =P\left(T\left[P\left(P^{-1}(i)\right)\right]^{\lll \gamma_{P-1}(l)}\right) \\
& =P\left(T[i]^{\lll \gamma_{P-1}(l)}\right) \tag{13}
\end{align*}
$$

By Theorem 1, a word-wise permutation $P$ commutes with the operations of $L S H$. That is, considering the rotation by $\gamma$ and letting $\gamma^{\prime}:=\left(\gamma_{P^{-1}(0)}, \ldots, \gamma_{P^{-1}(7)}\right)$, permutation by $P$ after a left rotation by $\gamma^{\prime}$ is equal to left rotation by $\gamma$ after a permutation by $P$. Let the compression function $f^{\prime}$ used in $L S H^{\prime}$ be defined as follows:

$$
\begin{equation*}
f^{\prime}=P \circ f \circ P^{-1} \tag{14}
\end{equation*}
$$

Then by substituting (14) into (7), we have

$$
\begin{equation*}
L S H^{\prime}=f^{\prime} \circ \cdots \circ f^{\prime} \tag{15}
\end{equation*}
$$

The compression function $f$ consists of the message expansion MsgExp and step function Step $_{j}$. Similarly to (14), MsgExp ${ }^{\prime}$ and Step ${ }^{\prime}$ are represented as follows:

$$
\begin{align*}
M \operatorname{sgExp} & =P \circ M \operatorname{sgExp} \circ P^{-1}  \tag{16}\\
\text { Step }_{j}^{\prime} & =P \circ \text { Step }_{j} \circ P^{-1} \\
& =P \circ \sigma \circ M i x_{j} \circ M \operatorname{sgAdd} \circ P^{-1} \tag{17}
\end{align*}
$$

Similarly to (15),

$$
\begin{align*}
& \text { MsgExp } \circ \cdots \circ \text { MsgExp } \\
& \quad=\left(P^{-1} \circ M \operatorname{sgExp^{\prime }} \circ P\right) \circ \cdots \circ\left(P^{-1} \circ M \operatorname{sg} E^{\prime} p^{\prime} \circ P\right) \\
& \quad=P^{-1} \circ M \operatorname{sg} E^{\prime} \circ p^{\prime} \circ \cdots \circ M \operatorname{sg} E^{\prime} p^{\prime} \circ P \tag{18}
\end{align*}
$$

Step $_{0} \circ \cdots \circ$ Step $_{N_{s}-1}$
$=\left(P^{-1} \circ\right.$ Step $\left._{0}^{\prime} \circ P\right) \circ \cdots \circ\left(P^{-1} \circ\right.$ Step $\left._{N_{s}-1}^{\prime} \circ P\right)$
$=P^{-1} \circ$ Step $_{0}^{\prime} \circ \cdots \circ$ Step $_{N_{s}-1}^{\prime} \circ P$.
As in (15), (18), and (19), all permutations $P$ and $P^{-1}$ are canceled out to the identity permutation except for the first $P^{-1}$ and the last $P$. The following provides the details of constructing MsgExp ${ }^{\prime}$ and Step $_{j}^{\prime}$. Using (1), the function
$M s g E x p p^{\prime}$ in the $i$-th compression function with inputs $M_{0}^{(i)}$ and $M_{1}^{(i)}$ in (16) is represented as follows:

$$
\begin{align*}
& M \operatorname{sg} \operatorname{Exp}^{\prime}\left(M_{j-1}^{(i)}, M_{j-2}^{(i)}\right) \\
&= P \circ \operatorname{Msg}^{\operatorname{Exp}} \circ P^{-1}\left(M_{j-1}^{(i)}, M_{j-2}^{(i)}\right) \\
&= P\left(M _ { j - 1 } ^ { ( i ) } \left[P^{-1}(0) \boxplus M_{j-2}^{(i)}\left[\tau\left(P^{-1}(0)\right)\right],\right.\right. \\
&\left.\left.\cdots, M_{j-1}^{(i)}\left[P^{-1}(15)\right] \boxplus M_{j-2}^{(i)}\left[\tau\left(P^{-1}(15)\right)\right]\right]\right) \\
&=\left(M_{j-1}^{(i)}\left[P \circ P^{-1}(0)\right] \boxplus M_{j-2}^{(i)}\left[P \circ \tau\left(P^{-1}(0)\right)\right],\right. \\
&\left.\cdots, M_{j-1}^{(i)}\left[P \circ P^{-1}(15)\right] \boxplus M_{j-2}^{(i)}\left[P \circ \tau\left(P^{-1}(15)\right)\right]\right) \\
&=\left(M_{j-1}^{(i)}[0] \boxplus M_{j-2}^{(i)}\left[P \circ \tau\left(P^{-1}(0)\right)\right],\right. \\
&\left.\cdots, M_{j-1}^{(i)}[15] \boxplus M_{j-2}^{(i)}\left[P \circ \tau\left(P^{-1}(15)\right)\right]\right) \\
& \quad \quad \text { for } \quad j=2, \ldots, N_{s} . \tag{20}
\end{align*}
$$

Let $\tau^{\prime}$ be

$$
\begin{equation*}
\tau^{\prime}=P \circ \tau \circ P^{-1} \tag{21}
\end{equation*}
$$

The function MsgExp ${ }^{\prime}$ is represented with $\tau^{\prime}$ as follows:

$$
\begin{align*}
& \operatorname{Msg} \operatorname{Exp}^{\prime}\left(M_{j-1}^{(i)}, M_{j-2}^{(i)}\right) \\
& =\left(M_{j-1}^{(i)}[0] \boxplus M_{j-2}^{(i)}\left[\tau^{\prime}(0)\right], \ldots, M_{j-1}^{(i)}[15] \boxplus M_{j-2}^{(i)}\left[\tau^{\prime}(15)\right]\right) \\
& \quad \text { for } \quad j=2, \ldots, N_{s} . \tag{22}
\end{align*}
$$

By using (2), the function Step ${ }_{j}^{\prime}$ in (17) is represented as follows:

$$
\begin{align*}
\text { Step }_{j}^{\prime}= & P \circ \sigma \circ \text { Mix }_{j} \circ \text { MsgAdd } \circ P^{-1} \\
= & P \circ \sigma \circ\left(P^{-1} \circ P\right) \circ \operatorname{Mix}_{j} \circ\left(P^{-1} \circ P\right) \\
& \circ M s g A d d \circ P^{-1} \\
= & \sigma^{\prime} \circ M i x_{j}^{\prime} \circ M s g A d d^{\prime} \tag{23}
\end{align*}
$$

where $\sigma^{\prime}=P \circ \sigma \circ P^{-1}, M i x_{j}^{\prime}=P \circ M i x_{j} \circ P^{-1}$, and $M s g A d d^{\prime}=P \circ M s g A d d \circ P^{-1}$.

In (23), MsgAdd $=M s g A d d^{\prime}$. In $M s g A d d^{\prime}$, the words are permuted by $P^{-1}$, and the added message is also permuted by $P^{-1}$ in $M s g E x p^{\prime}$. Thus, MsgAdd ${ }^{\prime}$ is a function of modular addition between words in the same ordering as in the function MsgAdd. In Mix ${ }_{j}^{\prime}$, the modular additions of step constants and left rotation by $\gamma_{l}$ are affected by the permutation $P$. Therefore step constants word-wise permuted by $P^{-1}$ and left rotation by $\gamma_{P^{-1}(l)}$ should be in Mix ${ }_{j}^{\prime}$. Further, $\sigma^{\prime}$ is a word-wise permutation since $P, P^{-1}$ and $\sigma$ are all wordwise permutations. Consequently, MsgAdd' is the same as MsgAdd, and Mix ${ }_{j}^{\prime}$ is the same as $M i x_{j}$ except for the values for left rotation and step constants. This implies that they do not affect to the performance when they are implemented with SIMD. However, the word-wise permutation $\tau^{\prime}$ in MsgExp ${ }^{\prime}$ and $\sigma^{\prime}$ in Step $p_{j}^{\prime}$ affect the performance when implemented by SIMD instructions.

## B. TYPES OF PERMUTATIONS FOR LSH

In this subsection, we investigate the conditions of $P$ that improve performance via SIMD related to $\tau^{\prime}$ and $\sigma^{\prime}$.


FIGURE 3. Permutation $\boldsymbol{\tau}$ of MsgExp.


FIGURE 4. Permutation $\sigma$ of Step . $^{\text {. }}$
Both the permutations $\tau$ and $\sigma$ simultaneously permute four words with the some order as follows. The permutations $\tau$ and $\sigma$ are represented by the compositions of two permutations, including internal permutations of four words and external permutations of a four-word arrays. Note that, in Fig. 3 and 4, four-word arrays permute to four-word arrays, regardless of the ordering of the four words in the four-word arrays.

To simply represent a permutation $P=\left(p_{0} p_{1} \cdots p_{15}\right)$ that takes a word array of $\left(w_{0} w_{1} \cdots w_{15}\right)$ to $\left(w_{p_{0}} w_{p_{1}} \cdots w_{p_{15}}\right)$, we define external/internal permutations as follows. For a four-word array $W_{i}:=\left(w_{4 i} w_{4 i+1} w_{4 i+2} w_{4 i+3}\right)$, an external permutation denoted as $\left(a_{0}, a_{1}, a_{2}, a_{3}\right) \in\{0,1,2,3\}^{4}$ is a permutation that permutes $\left(W_{0}, W_{1}, W_{2}, W_{3}\right)$ to $\left(W_{a_{0}}, W_{a_{1}}, W_{a_{2}}, W_{a_{3}}\right)$. An internal permutation $P_{j}$ denoted as $\left(b_{0} b_{1} b_{2} b_{3}\right), b_{i} \in\{0,1,2,3\}$ is a permutation that permutes 4 words $\left(w_{4 j}, w_{4 j+1}, w_{4 j+2}, w_{4 j+3}\right)$ into 4 words $\left(w_{4 j+b_{0}}, w_{4 j+b_{1}}, w_{4 j+b_{2}}, w_{4 j+b_{3}}\right)$. Using an external permutation and four internal permutations, a permutation $P$ is represented as $\left(P_{a_{0}}, P_{a_{1}}, P_{a_{2}}, P_{a_{3}}\right)$. For an example of the notation, the permutation $\sigma$ of Step $_{j}$ is shown below.

In Fig. 5, the above four internal permutations are represented by $\sigma_{0}=\sigma_{1}=(2013)$ and $\sigma_{2}=\sigma_{3}=(0321)$. The external permutation is represented by $(1,3,0,2)$. Therefore $\sigma$ is represented by $\sigma=\left(\sigma_{1}, \sigma_{3}, \sigma_{0}, \sigma_{2}\right)$. Similarly, $\tau=$ $\left(\tau_{0}, \tau_{1}, \tau_{2}, \tau_{3}\right)$ where $\tau_{0}=\tau_{2}=(3201)$ and $\tau_{1}=\tau_{3}=$ (3012).


FIGURE 5. Representation of the permutation $\sigma$ of Step $\boldsymbol{j}_{j}$ with internal permutations (top) and an external permutation (bottom).

Recall that SIMD instructions are operated in a register or among registers instead of via words. Thus, words in a register should be operated according to the same instructions. If some words in a register need to be partially operated, then the SIMD instructions for the register cannot be used, and the register should be divided. Composing and relieving of registers are done through SIMD instructions such as "load" and "store". Therefore, to reduce the number of SIMD instructions, words in a register need to be permuted
simultaneously. Note that consecutive groups of four words are permuted by $\tau$ and $\sigma$ simultaneously, and the same operations are applied to those consecutive groups of four words. It is enough to consider permutations of the form consisting of external and internal permutations. An external permutation in $P$ provides no advantages for reducing the number of SIMD instructions, thus we fix the external permutation of $P$ as the identity permutation. By Criterion 2, the permutation of the last eight words is determined by the permutation of the first eight words. Therefore, the total number of permutations we should consider is $(4!)^{2}=576$ for two internal permutations.

To find the optimal permutation $P$ among the 576 internal permutation candidates, we define five types of permutations with respect to the following forms. Four are defined for internal permutations, and the other is defined for two internal permutations.

Definition 1: We define $\mathrm{TYPE}_{i}$ for $i=1, \ldots, 5$ as a form of permutation in $S_{i}$ as follows. Note that $*$ is an integer in $\{0,1,2,3\}$.

$$
\begin{aligned}
S_{1}= & \{(01 * *),(* * 01),(23 * *),(* * 23)\}, \\
S_{2}= & \{(10 * *),(* * 10),(32 * *),(* * 32)\}, \\
S_{3}= & \{(02 * *),(* * 02),(20 * *),(* * 20), \\
& (13 * *),(* * 13),(31 * *),(* * 31)\}, \\
S_{4}= & \{(03 * *),(* * 03),(30 * *),(* * 30), \\
& (12 * *),(* * 12),(21 * *),(* * 21)\}, \\
S_{5}= & \left\{\left(P_{i}, P_{i}\right)\right\}
\end{aligned}
$$

The following examples will be helpful for understanding the above TYPEs. For example, $P_{i}=$ (0132) has one TYPE1 as $(01 * *)$ in $S_{1}$ and one TYPE2 as $(* * 32)$ in $S_{2}$. Additionally, $P_{i}=(3102)$ has two TYPE3 as $(* * 02)$ and $(31 * *)$ in $S_{3}$ and none as TYPE1,2,4. For $\left(P_{1}, P_{2}\right)=((1032),(1032))$, $P_{1}$ and $P_{2}$ are the same as an internal permutation (1032), thus it is TYPE5.

We find a good permutation by counting the number of TYPEs in $\tau^{\prime}$ and $\sigma^{\prime}$. Note that the total numbers for TYPE1 to TYPE4 for an internal permutation are always two. TYPE1 is represented as an internal permutation of two consecutive words permuted with the same ordering. In LSH-512, since the word size is 64 and the register size is 128 , the register has two 64 bits words. If $\tau^{\prime}$ or $\sigma^{\prime}$ has the form of TYPE1, then an SIMD instruction for permutation of the register positions ' 01 ' or ' 23 ' is not needed. For the other case of TYPE2, an SIMD instruction "_mm_shuffle_epi32" for changing word positions in a register is needed, then an SIMD instruction for this permutation is needed. For TYPE3, '02', '20', '13', and '31' are composed of words in different registers. Thus, those groupings require the SIMD instruction "_mm_unpacklo_epi64" or "_mm_unpackhi_epi64". For TYPE4, '03', '30', '12', and ' 21 ' are composed of words in different registers and different word positions(of left or right). Thus, two SIMD instructions are needed: "_mm_shuffle_epi32" to change word positions, one of "_mm_unpacklo_epi64" and


FIGURE 6. $P$ and $P^{-1}$ for LSH with SSE2, SSSE3, XOP, and LSH-512 with AVX2.


FIGURE 7. $\tau^{\prime}$ for LSH with SSE2, SSSE3, XOP, and LSH-512 with AVX2.


FIGURE 8. $\sigma^{\prime}$ for LSH with SSE2, SSSE3, XOP, and LSH-512 with AVX2.


FIGURE 9. The permutation $P$ and $P^{\mathbf{- 1}}$ used for LSH-256 with AVX2.
"_mm_unpackhi_epi64". Note that TYPE1 and TYPE2 do not share an internal permutation. By comparing the numbers of SIMD instructions for internal permutations, TYPE1 and TYPE2 are better than TYPE3 and TYPE4, and TYPE1 is the best.

TYPE5 is used for a 256-bit registers such as AVX2. For LSH-256, a register has eight words. If two internal permutations in a register are equal, the permutation can be operated using "_mm256_shuffle_epi32" instead of "_mm256_permutevar8x32_ps". The latter SIMD instruction has triple latency compared to that of the former. Thus, using the former is better than the latter for better latency performance.

TYPE1-5 are defined by considering an SIMD instruction set. Because there are many SIMD circumstances, if someone wants to use LSH with some specific SIMD, then the above TYPEs can be used to choose a permutation. All types of permutation considering TYPE1-5 are in Appendix A.

## IV. PERMUTATION WITH THE BEST PERFORMANCE FOR LSH USING SIMD

In previous sections, we have shown how to find optimal permutations $\tau^{\prime}$ and $\sigma^{\prime}$ with TYPEs. The optimal permutation

TABLE 4. Performance results with an intel CPU (Cycle/Byte).

| Hash length (bits) | SIMD | implementation | The size of message |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  | $\begin{gathered} \text { long } \\ \text { message } \end{gathered}$ | $\begin{aligned} & 4096 \\ & \text { bytes } \end{aligned}$ | $\begin{gathered} 64 \\ \text { bytes } \end{gathered}$ |
| 256 | SSE2 | [7] | 6.67 | 6.88 | 15.25 |
|  |  | this paper | 4.59 | 4.74 | 11.06 |
|  | SSSE3 | [7] | 3.75 | 3.88 | 9.19 |
|  |  | this paper | 3.34 | 3.46 | 8.69 |
|  | AVX2 | [7] | 3.58 | 3.71 | 8.75 |
|  |  | this paper | 3.36 | 3.48 | 8.38 |
| 512 | SSE2 | [7] | 6.76 | 7.22 | 29.38 |
|  |  | this paper | 4.87 | 5.16 | 22.44 |
|  | SSSE3 | [7] | 5.57 | 5.89 | 24.44 |
|  |  | this paper | 4.84 | 5.17 | 22.19 |
|  | AVX2 | [7] | 2.44 | 2.61 | 11.38 |
|  |  | this paper | 2.18 | 2.33 | 10.31 |

* With CPU(Intel Core i7-4790K CPU @ 4.00GHz), RAM(5GB), OS(Windows 7 Enterprise K SP1 64bit),
Compiler(Visual Studio 2013, 64bit compiler)

TABLE 5. Performance results with an AMD CPU (Cycle/Byte).

| Hash length (bits) | SIMD | implementation | The size of message |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  | $\begin{gathered} \text { long } \\ \text { message } \end{gathered}$ | $\begin{aligned} & 4096 \\ & \text { bytes } \end{aligned}$ | $\begin{gathered} 64 \\ \text { bytes } \end{gathered}$ |
| 256 | SSE2 | [7] | 9.41 | 9.73 | 21.23 |
|  |  | this paper | 6.11 | 6.33 | 14.31 |
|  | SSSE3 | [7] | 5.10 | 5.26 | 12.39 |
|  |  | this paper | 4.73 | 4.90 | 11.36 |
|  | XOP | [7] | 3.88 | 4.00 | 9.97 |
|  |  | this paper | 3.23 | 3.35 | 8.39 |
| 512 | SSE2 | [7] | 6.62 | 7.07 | 29.72 |
|  |  | this paper | 4.63 | 4.95 | 21.11 |
|  | SSSE3 | [7] | 5.28 | 5.63 | 24.67 |
|  |  | this paper | 4.39 | 4.70 | 20.28 |
|  | XOP | [7] | 4.42 | 4.74 | 20.81 |
|  |  | this paper | 3.26 | 3.49 | 15.61 |
|  | With CP | AMD FX-8350 OS(Windows Compiler(Visu | @ 4.00 Profession Studio 20 | Hz), R | 1 M 8 (8bit |



FIGURE 10. $\tau^{\prime}$ for LSH-256 with AVX2.


FIGURE 11. $\sigma^{\prime}$ for LSH-256 with AVX2.
is selected as TYPE1-4 except for LSH-256 with AVX2. For the case of LSH-256 with AVX2, since there are two internal permutations in the register, the optimal permutation is selected as TYPE5.

## A. THE PERMUTATION P FOR LSH WITH SSE2, SSSE3, XOP, AND LSH-512 WITH AVX2

The permutation $P$ and its inverse $P^{-1}$ for SSE2, SSSE3, XOP, and AVX2 in SIMD are described in Fig. 6. For AVX2,

TABLE 6. Permutation representation with english alphabet.

| representation | A | B | C | D | E | F |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| permutation | $(0123)$ | $(0132)$ | $(0213)$ | $(0231)$ | $(0312)$ | $(0321)$ |
| representation | G | H | I | J | K | L |
| permutation | $(1023)$ | $(1032)$ | $(1203)$ | $(1230)$ | $(1302)$ | $(1320)$ |
| representation | M | N | O | P | Q | R |
| permutation | $(2013)$ | $(2031)$ | $(2103)$ | $(2130)$ | $(2301)$ | $(2310)$ |
| representation | S | T | U | V | W | X |
| permutation | $(3012)$ | $(3021)$ | $(3102)$ | $(3120)$ | $(3201)$ | $(3210)$ |

## TABLE 7. Permutations in TYPE1 and TYPE2.

| $\tau^{\prime}$ | $\sigma^{\prime}$ | Permutations |
| :---: | :---: | :---: |
| $(4,4)$ | $(3,1)$ | $\underset{(\mathrm{A}, \mathrm{L}),(\mathrm{A}, \mathrm{M}),(\mathrm{B}, \mathrm{N}),(\mathrm{B}, \mathrm{V}),(\mathrm{G}, \mathrm{C}),(\mathrm{G}, \mathrm{K}),(\mathrm{H}, \mathrm{D}),(\mathrm{H}, \mathrm{U}),(\mathrm{Q}, \mathrm{L}),(\mathrm{Q}, \mathrm{M}),(\mathrm{R}, \mathrm{C}),(\mathrm{R}, \mathrm{K}),}{(\mathrm{W}, \mathrm{N}),(\mathrm{W}, \mathrm{V}),(\mathrm{X}, \mathrm{D}),(\mathrm{X}, \mathrm{U})}($ |
|  | $(2,2)$ | (A,C), (A,K), (A,N), (A,V), (B,D), (B,L), (B,M), (B,U), (G,D), (G,L), (G,M), (G,U), $(\mathrm{H}, \mathrm{C}),(\mathrm{H}, \mathrm{K}),(\mathrm{H}, \mathrm{N}),(\mathrm{H}, \mathrm{V}),(\mathrm{Q}, \mathrm{C}),(\mathrm{Q}, \mathrm{K}),(\mathrm{Q}, \mathrm{N}),(\mathrm{Q}, \mathrm{V}),(\mathrm{R}, \mathrm{D}),(\mathrm{R}, \mathrm{L}),(\mathrm{R}, \mathrm{M}),(\mathrm{R}, \mathrm{U})$, (W,D), (W,L), (W,M), (W,U), (X,C), (X,K), (X,N), (X,V) |
|  | $(1,3)$ | $\begin{gathered} \text { (A,D), (A,U), (B,C), (B,K), (G,N), (G,V), (H,L), (H,M), (Q,D), (Q,U), (R,N), (R,V), } \\ (W, C),(W, K),(X, L),(X, M) \end{gathered}$ |
| $(2,2)$ | $(3,1)$ | (C,D), (C,U), (D,C), (D,K), (K,D), (K,U), (L,N), (L,V), (M,N), (M,V), (N,L), (N,M), <br> $(\mathrm{U}, \mathrm{C}),(\mathrm{U}, \mathrm{K}),(\mathrm{V}, \mathrm{L}),(\mathrm{V}, \mathrm{M})$ |
|  | $(2,2)$ | $\begin{gathered} \text { (C,C), (C,K), (C,N), (C,V), (D,D), (D,L), (D,M), (D,U), (K,C), (K,K), (K,N), (K,V), } \\ (\mathrm{L}, \mathrm{D}),(\mathrm{L}, \mathrm{~L}),(\mathrm{L}, \mathrm{M}),(\mathrm{L}, \mathrm{U}),(\mathrm{M}, \mathrm{D}),(\mathrm{M}, \mathrm{~L}),(\mathrm{M}, \mathrm{M}),(\mathrm{M}, \mathrm{U}),(\mathrm{N}, \mathrm{C}),(\mathrm{N}, \mathrm{~K}),(\mathrm{N}, \mathrm{~N}), \\ (\mathrm{N}, \mathrm{~V}),(\mathrm{U}, \mathrm{D}),(\mathrm{U}, \mathrm{~L}),(\mathrm{U}, \mathrm{M}),(\mathrm{U}, \mathrm{U}),(\mathrm{V}, \mathrm{C}),(\mathrm{V}, \mathrm{~K}),(\mathrm{V}, \mathrm{~N}),(\mathrm{V}, \mathrm{~V}) \end{gathered}$ |
|  | $(2,0)$ | $(\mathrm{A}, \mathrm{F}),(\mathrm{A}, \mathrm{O}),(\mathrm{B}, \mathrm{E}),(\mathrm{B}, \mathrm{I}),(\mathrm{G}, \mathrm{P}),(\mathrm{G}, \mathrm{T}),(\mathrm{H}, \mathrm{J}),(\mathrm{H}, \mathrm{S}),(\mathrm{Q}, \mathrm{F}),(\mathrm{Q}, \mathrm{O}),(\mathrm{R}, \mathrm{P}),(\mathrm{R}, \mathrm{T})$, |
|  | $(1,3)$ | (C,L), (C,M), (D,N), (D,V), (K,L), (K,M), (L,C), (L,K), (M,C), (M,K), (N,D), (N,U), (U,N), (U,V), (V,D), (V,U) |
|  | $(1,1)$ | (A,E), (A,I), (A,P), (A,T), (B,F), (B,J), (B,O), (B,S), (E,C), (E,D), (E,K), (E,L), <br> (E,M), (E,N), (E,U), (E,V), (FC), (FD), (F,K), (FL), (F,M), (FN), (FUU), $(\mathrm{F}, \mathrm{V}),(\mathrm{G}, \mathrm{F}),(\mathrm{G}, \mathrm{J}),(\mathrm{G}, \mathrm{O}),(\mathrm{G}, \mathrm{S}),(\mathrm{H}, \mathrm{E}),(\mathrm{H}, \mathrm{I}),(\mathrm{H}, \mathrm{P}),(\mathrm{H}, \mathrm{T}),(\mathrm{I}, \mathrm{C}),(\mathrm{I}, \mathrm{D}),(\mathrm{I}, \mathrm{K}),(\mathrm{I}, \mathrm{L})$ (I,M), (I,N), (I,U), (I,V), (J,C), (J,D), (J, ( $),(\mathrm{J}, \mathrm{L}),(\mathrm{J}, \mathrm{M}),(\mathrm{J}, \mathrm{N}),(\mathrm{J}, \mathrm{U}),(\mathrm{J}, \mathrm{V}),(\mathrm{O}, \mathrm{C})$, ( $\mathrm{O}, \mathrm{D}$ ), ( $\mathrm{O}, \mathrm{K}$ ), ( $\mathrm{O}, \mathrm{L}$ ), ( $\mathrm{O}, \mathrm{M}$ ), ( $\mathrm{O}, \mathrm{N}$ ), ( $\mathrm{O}, \mathrm{U}$ ), ( $\mathrm{O}, \mathrm{V}$ ), ( $\mathrm{P}, \mathrm{C}),(\mathrm{P}, \mathrm{D}),(\mathrm{P}, \mathrm{K}),(\mathrm{PL}),(\mathrm{P}, \mathrm{M})$, (P,N), (P,U), (P, $),(\mathrm{Q}, \mathrm{E}),(\mathrm{Q}, \mathrm{I}),(\mathrm{Q}, \mathrm{P}),(\mathrm{Q}, \mathrm{T}),(\mathrm{R}, \mathrm{F}),(\mathrm{R}, \mathrm{J}),(\mathrm{R}, \mathrm{O}),(\mathrm{R}, \mathrm{S}),(\mathrm{S}, \mathrm{C}),(\mathrm{S}, \mathrm{D})$, $(\mathrm{S}, \mathrm{K}),(\mathrm{S}, \mathrm{L}),(\mathrm{S}, \mathrm{M}),(\mathrm{S}, \mathrm{N}),(\mathrm{S}, \mathrm{U}),(\mathrm{S}, \mathrm{V}),(\mathrm{T}, \mathrm{C}),(\mathrm{T}, \mathrm{D}),(\mathrm{T}, \mathrm{K}),(\mathrm{T}, \mathrm{L}),(\mathrm{T}, \mathrm{M}),(\mathrm{T}, \mathrm{N})$, (T,U), (T,V), (W,F), (W,J), (W,O), (W,S), (X,E), (X,I), (X,P), (X,T) |
|  | $(0,2)$ | $(\mathrm{A}, \mathrm{J}),(\mathrm{A}, \mathrm{S}),(\mathrm{B}, \mathrm{P}),(\mathrm{B}, \mathrm{T}),(\mathrm{G}, \mathrm{E}),(\mathrm{G}, \mathrm{I}),(\mathrm{H}, \mathrm{F}),(\mathrm{H}, \mathrm{O}),(\mathrm{Q}, \mathrm{J}),(\mathrm{Q}, \mathrm{S}),(\mathrm{R}, \mathrm{E}),(\mathrm{R}, \mathrm{I}),(\mathrm{W}, \mathrm{P})$, |
|  | $(0,0)$ | (A,A), (A,B), (A,G), (A,H), (A,Q), (A,R), (A,W), (A,X), (B,A), (B,B), (B,G), (B,H), (B,Q), (B,R), (B,W), (B,X), (G,A), (G,B), (G,G), (G,H), (G,Q), (G,R), (G,W), (G,X), $(H, A),(H, B),(H, G),(H, H),(H, Q),(H, R),(H, W),(H, X),(Q, A),(Q, B),(Q, G),(Q, H)$, $(\mathrm{Q}, \mathrm{Q}),(\mathrm{Q}, \mathrm{R}),(\mathrm{Q}, \mathrm{W}),(\mathrm{Q}, \mathrm{X}),(\mathrm{R}, \mathrm{A}),(\mathrm{R}, \mathrm{B}),(\mathrm{R}, \mathrm{G}),(\mathrm{R}, \mathrm{H}),(\mathrm{R}, \mathrm{Q}),(\mathrm{R}, \mathrm{R}),(\mathrm{R}, \mathrm{W}),(\mathrm{R}, \mathrm{X})$, $(W, A),(W, B),(W, G),(W, H),(W, Q),(W, R),(W, W),(W, X),(X, A),(X, B),(X, G)$, $(\mathrm{X}, \mathrm{H}),(\mathrm{X}, \mathrm{Q}),(\mathrm{X}, \mathrm{R}),(\mathrm{X}, \mathrm{W}),(\mathrm{X}, \mathrm{X})$ |
| $(0,0)$ | $(3,1)$ | (E,A), (E,B), (E,Q), (E,W), (F,A), (F, ( G$),(\mathrm{F}, \mathrm{Q}),(\mathrm{F}, \mathrm{R}),(\mathrm{I}, \mathrm{A}),(\mathrm{I}, \mathrm{B}),(\mathrm{I}, \mathrm{Q}),(\mathrm{I}, \mathrm{W}),(\mathrm{J}, \mathrm{B})$, $(\mathrm{J}, \mathrm{H}),(\mathrm{J}, \mathrm{W}),(\mathrm{J}, \mathrm{X}),(\mathrm{O}, \mathrm{A}),(\mathrm{O}, \mathrm{G}),(\mathrm{O}, \mathrm{Q}),(\mathrm{O}, \mathrm{R}),(\mathrm{P}, \mathrm{G}),(\mathrm{P}, \mathrm{H}),(\mathrm{P}, \mathrm{R}),(\mathrm{P}, \mathrm{X}),(\mathrm{S}, \mathrm{B})$, (S,H), (S,W), (S, X), (T,G), (T,H), (T,R), (T,X) |
|  | $(2,0)$ | $\left.\begin{array}{c}(\mathrm{C}, \mathrm{F}),(\mathrm{C}, \mathrm{O}),(\mathrm{D}, \mathrm{P}),(\mathrm{D}, \mathrm{T}),(\mathrm{K}, \mathrm{F}),(\mathrm{K}, \mathrm{O}),(\mathrm{L}, \mathrm{E}),(\mathrm{L}, \mathrm{I}),(\mathrm{M}, \mathrm{E}),(\mathrm{M}, \mathrm{I}),(\mathrm{N}, \mathrm{J}),(\mathrm{N}, \mathrm{S}), \\ (\mathrm{U}, \mathrm{P}),(\mathrm{U}, \mathrm{T}),(\mathrm{V}, \mathrm{J}),(\mathrm{V}, \mathrm{S})\end{array}\right)$ <br> (U,P), (U,T), (V,J), (V,S) |
|  | $(1,3)$ | (E,G), (E,H), (E,R), (E,X), (F,B), (F,H), (F,W), (F,X), (I,G), (I,H), (I,R), (I,X), (J,A), (J,G), (J,Q), (J,R), (O,B), (O,H), (O,W), (O,X), (P,A), (P,B), (P,Q), (P,W), (S,A), (S,G), (S,Q), (S,R), (T,A), (T,B), (T,Q), (T,W) |
|  | $(1,1)$ | (C,E), (C,I), (C,P), (C,T), (D,F), (D, J), (D,O), (D,S), (K,E), (K,I), (K,P), (K,T), (L,F), (L, J), (L,O), (L,S), (M,F), (M,J), (M,O), (M,S), (N,E), (N,I), (N,P), (N,T), (U,F), (U, J$),(\mathrm{U}, \mathrm{O}),(\mathrm{U}, \mathrm{S}),(\mathrm{V}, \mathrm{E}),(\mathrm{V}, \mathrm{I}),(\mathrm{V}, \mathrm{P}),(\mathrm{V}, \mathrm{T})$ |
|  | $(0,2)$ | $(\mathrm{C}, \mathrm{J}),(\mathrm{C}, \mathrm{S}),(\mathrm{D}, \mathrm{E}),(\mathrm{D}, \mathrm{I}), \underset{(\mathrm{K}, \mathrm{J}), \mathrm{E}),(\mathrm{K}, \mathrm{S}),(\mathrm{I}, \mathrm{P}),(\mathrm{V}, \mathrm{F}),(\mathrm{L}, \mathrm{T}), \mathrm{O})}{(\mathrm{M}, \mathrm{P}),(\mathrm{M}, \mathrm{T}),(\mathrm{N}, \mathrm{F}),(\mathrm{N}, \mathrm{O}),}$ |
|  | $(0,0)$ | the rest |

the permutations in Fig. 6 are applied to only LSH-512 by the above argument.

The permutation $P$ is $(0123,2013,0123,2013)$, which is represented by the symbol (A,M) in Appendix A. If $P$ is used for $\tau^{\prime}$, then there are four TYPE1 and four TYPE2. If $P$ is used for $\sigma^{\prime}$, then there are three TYPE1, one TYPE2, and four TYPE3. Note that $\tau$ has two TPYE1, two TPYE2, and four TYPE4, and $\sigma$ has four TYPE3 and four TYPE4.
$\tau^{\prime}=P \circ \tau \circ P^{-1}$ is described in Fig. 7, and $\sigma^{\prime}$ is described in Fig. 8.

Four words are assigned in one register, as shown in Fig. 7 and Fig. 8. In this case, the number of SIMD operations of $\tau^{\prime}$ is not reduced. However, there is a reduction in $\sigma^{\prime}$. Because there is no need for SIMD instructions in the second internal permutation of $\sigma^{\prime}$, the identity permutation is considered.

TABLE 8. Permutations in TYPE3 and TYPE4.

| $\tau^{\prime}$ | $\sigma^{\prime}$ | Permutations |
| :---: | :---: | :---: |
| $(8,0)$ | $(8,0)$ | (D,B), (D,G), (D,R), (D,W), (L,B), (L,G), (L,R), (L,W), (M,B), (M,G), (M,R), (M,W), (U,B), (U,G), (U,R), (U,W) |
|  | $(4,4)$ | (F,E), (F,I), (F,P), (F,T), (J,E), (J,I), (J,P), (J,T), (O,E), (O,I), (O,P), (O,T), (8,0), (S,E), (S,I), (S,P), (S,T) |
|  | $(4,2)$ | (D,E), (D,I), (D,P), (D,T), (L,E), (L,I), (L,P), (L,T), (M,E), (M,I), (M,P), (M,T), <br> (U,E), (U,I), (U,P), (U,T) |
|  | $(2,2)$ | (F,B), (F,G), (F,R), (F,W), (J,B), (J,G), (J,R), (J,W), (O,B), (O,G), (O,R), (O,W), (S,B), (S,G), (S,R), (S,W) |
| $(4,4)$ | $(8,0)$ | $(\mathrm{E}, \mathrm{E}),(\mathrm{E}, \mathrm{I}),(\mathrm{E}, \mathrm{P}),(\mathrm{E}, \mathrm{T}),(\mathrm{I}, \mathrm{E}),(\mathrm{I}, \mathrm{I}),(\mathrm{I}, \mathrm{P}),(\mathrm{I}, \mathrm{T}),(\mathrm{P}, \mathrm{E}),(\mathrm{P}, \mathrm{I}),(\mathrm{P}, \mathrm{P}),(\mathrm{P}, \mathrm{T}),(\mathrm{T}, \mathrm{E}),(\mathrm{T}, \mathrm{I})$, |
|  | $(4,4)$ | (C,B), (C,G), (C,R), (C,W), (D,A), (D,H), (D,Q), (D,X), (K,B), (K,G), (K,R), (K,W), (L,A), (L,H), (L,Q), (L,X), (M,A), (M,H), (M,Q), (M,X), (N,B), (N,G), (N,R), (N,W), (U,A), (U,H), (U,Q), (U,X), (V,B), (V,G), (V,R), (V,W) |
|  | $(4,2)$ | $(\mathrm{C}, \mathrm{E}),(\mathrm{C}, \mathrm{I}),(\mathrm{C}, \mathrm{P}),(\mathrm{C}, \mathrm{T}),(\mathrm{K}, \mathrm{E}), \underset{(\mathrm{V}, \mathrm{I}}{(\mathrm{I}),}(\mathrm{K}, \mathrm{P}),(\mathrm{K}, \mathrm{P}),(\mathrm{T}, \mathrm{T}),(\mathrm{N}, \mathrm{E}),(\mathrm{N}, \mathrm{I}),(\mathrm{N}, \mathrm{P}),(\mathrm{N}, \mathrm{T}),(\mathrm{V}, \mathrm{E})$, $(\mathrm{l}$ |
|  | $(4,0)$ | (E,B), (E,G), (E,R), (E,W), (I,B), (I,G), (I,R), (I,W), (P,B), (P,G), (P,R), (P,W), (T,B), <br> (T,G), (T,R), (T,W) |
|  | $(2,4)$ | $(\mathrm{D}, \mathrm{F}),(\mathrm{D}, \mathrm{J}),(\mathrm{D}, \mathrm{O}),(\mathrm{D}, \mathrm{S}),(\mathrm{L}, \mathrm{F}),(\mathrm{L}, \mathrm{J}),(\mathrm{L}, \mathrm{O}),(\mathrm{L}, \mathrm{S}),(\mathrm{M}, \mathrm{F}),(\mathrm{M}, \mathrm{J}),(\mathrm{M}, \mathrm{O}),(\mathrm{M}, \mathrm{S})$, $(\mathrm{U}, \mathrm{F}),(\mathrm{U}, \mathrm{J}),(\mathrm{U}, \mathrm{O}),(\mathrm{U}, \mathrm{S})$ |
|  | $(0,8)$ | (F,F), (F,J), (F,O), (F,S), (J,F), (J,J), (J,O), (J,S), (O,F), (O,J), (O,O), (O,S), (S,F), (S,J), (S,O), (S,S) |
|  | $(0,4)$ | $\begin{gathered} \text { (F,A), (F,H), (F,Q), (F,X), (J,A), (J,H), (J,Q), (J,X), (O,A), (O,H), (O,Q), (O,X), } \\ (\mathrm{S}, \mathrm{~A}),(\mathrm{S}, \mathrm{H}),(\mathrm{S}, \mathrm{Q}),(\mathrm{S}, \mathrm{X}) \end{gathered}$ |
| $(4,0)$ | $(4,4)$ | (A,B), (A,G), (A,R), (A,W), (B,B), (B,G), (B,R), (B,W), (G,B), (G,G), (G,R), (G,W), $(\mathrm{H}, \mathrm{B}),(\mathrm{H}, \mathrm{G}),(\mathrm{H}, \mathrm{R}),(\mathrm{H}, \mathrm{W}),(\mathrm{Q}, \mathrm{B}),(\mathrm{Q}, \mathrm{G}),(\mathrm{Q}, \mathrm{R}),(\mathrm{Q}, \mathrm{W}),(\mathrm{R}, \mathrm{B}),(\mathrm{R}, \mathrm{G}),(\mathrm{R}, \mathrm{R}),(\mathrm{R}, \mathrm{W})$, (W,B), (W,G), (W,R), (W,W), (X,B), (X,G), (X,R), (X,W) |
|  | $(4,2)$ | (A,E), (A,I), (A,P), (A,T), (F,C), (F,K), (F,N), (F,V), (H,E), (H,I), (H,P), (H,T), (J,C), (J,K), (J,N), (J,V), (O,C), (O,K), (O,N), (O,V), (Q,E), (Q,I), (Q,P), (Q,T), (S,C), (S,K), (S,N), (S,V), (X,E), (X,I), (X,P), (X,T) |
|  | $(4,0)$ | $(\mathrm{D}, \mathrm{D}),(\mathrm{D}, \mathrm{L}),(\mathrm{D}, \mathrm{M}),(\mathrm{D}, \mathrm{U}),(\mathrm{L}, \mathrm{D}),(\mathrm{L}, \mathrm{L}),(\mathrm{L}, \mathrm{M}),(\mathrm{L}, \mathrm{U}),(\mathrm{M}, \mathrm{D}),(\mathrm{M}, \mathrm{L}),(\mathrm{M}, \mathrm{M})$, $(\mathrm{M}, \mathrm{U}),(\mathrm{U}, \mathrm{D}),(\mathrm{U}, \mathrm{L}),(\mathrm{U}, \mathrm{M}),(\mathrm{U}, \mathrm{U})$ |
|  | $(2,4)$ | (B,E), (B,I), (B,P), (B,T), (F,D), (F,L), (F,M), (F,U), (G,E), (G,I), (G,P), (G,T), (J,D), (J,L), (J,M), (J,U), (O,D), (O,L), (O,M), (O,U), (R,E), (R,I), (R,P), (R,T), (S,D), (S,L), (S,M), (S, U), (W,E), (W,I), (W,P), (W,T) |
|  | $(2,2)$ | $\begin{gathered} \text { (D,C), (D,K), (D,N), (D,V), (L,C), (L,K), (L,N), (L,V), (M,C), (M,K), (M,N), (M,V), } \\ \text { (U,C), (U,K), (U,N), (U,V) } \end{gathered}$ |
| $(0,8)$ | $(4,4)$ | $(\mathrm{E}, \mathrm{F}),(\mathrm{E}, \mathrm{J}),(\mathrm{E}, \mathrm{O}),(\mathrm{E}, \mathrm{S}),(\mathrm{I}, \mathrm{F}),(\mathrm{I}, \mathrm{J}),(\mathrm{I}, \mathrm{O}),(\mathrm{I}, \mathrm{S}),(\mathrm{P}, \mathrm{F}),(\mathrm{P}, \mathrm{J}),(\mathrm{P}, \mathrm{O}),(\mathrm{P}, \mathrm{S}),(\mathrm{T}, \mathrm{F}),(\mathrm{T}, \mathrm{J})$, $(\mathrm{T}, \mathrm{O}),(\mathrm{T}, \mathrm{S})$ |
|  | $(2,4)$ | $\begin{gathered} \hline(\mathrm{C}, \mathrm{~F}),(\mathrm{C}, \mathrm{~J}),(\mathrm{C}, \mathrm{O}),(\mathrm{C}, \mathrm{~S}),(\mathrm{K}, \mathrm{~F}),(\mathrm{K}, \mathrm{~J}),(\mathrm{K}, \mathrm{O}),(\mathrm{K}, \mathrm{~S}),(\mathrm{N}, \mathrm{~F}),(\mathrm{N}, \mathrm{~J}),(\mathrm{N}, \mathrm{O}),(\mathrm{N}, \mathrm{~S}), \\ (\mathrm{V}, \mathrm{~F}),(\mathrm{V}, \mathrm{~J}),(\mathrm{V}, \mathrm{O}),(\mathrm{V}, \mathrm{~S}) \end{gathered}$ |
|  | $(2,2)$ | $\begin{gathered} (\mathrm{E}, \mathrm{~A}),(\mathrm{E}, \mathrm{H}),(\mathrm{E}, \mathrm{Q}),(\mathrm{E}, \mathrm{X}),(\mathrm{I}, \mathrm{~A}),(\mathrm{I}, \mathrm{H}),(\mathrm{I}, \mathrm{Q}),(\mathrm{I}, \mathrm{X}),(\mathrm{P}, \mathrm{~A}),(\mathrm{P}, \mathrm{H}),(\mathrm{P}, \mathrm{Q}),(\mathrm{P}, \mathrm{X}),(\mathrm{T}, \mathrm{~A}), \\ (\mathrm{T}, \mathrm{H}),(\mathrm{T}, \mathrm{Q}),(\mathrm{T}, \mathrm{X}) \end{gathered}$ |
|  | $(0,8)$ | $\begin{gathered} (\mathrm{C}, \mathrm{~A}),(\mathrm{C}, \mathrm{H}),(\mathrm{C}, \mathrm{Q}),(\mathrm{C}, \mathrm{X}),(\mathrm{K}, \mathrm{~A}),(\mathrm{K}, \mathrm{H}),(\mathrm{K}, \mathrm{Q}),(\mathrm{K}, \mathrm{X}),(\mathrm{N}, \mathrm{~A}),(\mathrm{N}, \mathrm{H}),(\mathrm{N}, \mathrm{Q}),(\mathrm{N}, \mathrm{X}), \\ (\mathrm{V}, \mathrm{~A}),(\mathrm{V}, \mathrm{H}),(\mathrm{V}, \mathrm{Q}),(\mathrm{V}, \mathrm{X}) \end{gathered}$ |
| $(0,4)$ | $(4,4)$ | (A,A), (A,H), (A,Q), (A,X), (B,A), (B,H), (B,Q), (B,X), (G,A), (G,H), (G,Q), (G,X), $(H, A),(H, H),(H, Q),(H, X),(Q, A),(Q, H),(Q, Q),(Q, X),(R, A),(R, H),(R, Q),(R, X)$, (W,A), (W,H), (W,Q), (W,X), (X,A), (X,H), (X,Q), (X,X) |
|  | $(4,2)$ | (A,F), (A,J), (A,O), (A,S), (E,C), (E,K), (E,N), (E,V), (H,F), (H,J), (H,O), (H,S), (I,C), (I,K), (I,N), (I,V), (P,C), (P,K), (P,N), (P,V), (Q,F), (Q,J), (Q,O), (Q,S), (T,C), (T,K), (T,N), (T,V), (X,F), (X,J), (X,O), (X,S) |
|  | $(2,4)$ | (B,F), (B,J), (B,O), (B,S), (E,D), (E,L), (E,M), (E,U), (G,F), (G,J), (G,O), (G,S), (I,D), (I,L), (I,M), (I,U), (P,D), (P,L), (P,M), (P,U), (R,F), (R,J), (R,O), (R,S), (T,D), (T,L), (T,M), (T,U), (W,F), (W,J), (W,O), (W,S) |
|  | $(2,2)$ | $\begin{gathered} \hline(\mathrm{C}, \mathrm{D}),(\mathrm{C}, \mathrm{~L}),(\mathrm{C}, \mathrm{M}),(\mathrm{C}, \mathrm{U}),(\mathrm{K}, \mathrm{D}),(\mathrm{K}, \mathrm{~L}),(\mathrm{K}, \mathrm{M}),(\mathrm{K}, \mathrm{U}),(\mathrm{N}, \mathrm{D}),(\mathrm{N}, \mathrm{~L}),(\mathrm{N}, \mathrm{M}),(\mathrm{N}, \mathrm{U}), \\ (\mathrm{V}, \mathrm{D}),(\mathrm{V}, \mathrm{~L}),(\mathrm{V}, \mathrm{M}),(\mathrm{V}, \mathrm{U}) \end{gathered}$ |
|  | $(0,4)$ | $\begin{gathered} (\mathrm{C}, \mathrm{C}),(\mathrm{C}, \mathrm{~K}),(\mathrm{C}, \mathrm{~N}),(\mathrm{C}, \mathrm{~V}),(\mathrm{K}, \mathrm{C}),(\mathrm{K}, \mathrm{~K}),(\mathrm{K}, \mathrm{~N}),(\mathrm{K}, \mathrm{~V}),(\mathrm{N}, \mathrm{C}),(\mathrm{N}, \mathrm{~K}),(\mathrm{N}, \mathrm{~N}),(\mathrm{N}, \mathrm{~V}), \\ (\mathrm{V}, \mathrm{C}),(\mathrm{V}, \mathrm{~K}),(\mathrm{V}, \mathrm{~N}),(\mathrm{V}, \mathrm{~V}) \end{gathered}$ |
| $(0,0)$ | $(4,0)$ | (A,C), (A,D), (A,K), (A,L), (A,M), (A,N), (A,U), (A,V), (H,C), (H,D), (H,K), (H,L), $(\mathrm{H}, \mathrm{M}),(\mathrm{H}, \mathrm{N}),(\mathrm{H}, \mathrm{U}),(\mathrm{H}, \mathrm{V}),(\mathrm{Q}, \mathrm{C}),(\mathrm{Q}, \mathrm{D}),(\mathrm{Q}, \mathrm{K}),(\mathrm{Q}, \mathrm{L}),(\mathrm{Q}, \mathrm{M}),(\mathrm{Q}, \mathrm{N}),(\mathrm{Q}, \mathrm{U}),(\mathrm{Q}, \mathrm{V})$, ( $\mathrm{X}, \mathrm{C}$ ), (X,D), (X,K), (X,L), (X,M), (X,N), (X,U), (X,V) |
|  | $(0,4)$ | (B,C), (B,D), (B,K), (B,L), (B,M), (B,N), (B,U), (B,V), (G,C), (G,D), (G,K), (G,L), (G,M), (G,N), (G,U), (G,V), (R,C), (R,D), (R,K), (R,L), (R,M), (R,N), (R,U), (R,V), (W,C), (W,D), (W,K), (W,L), (W,M), (W,N), (W,U), (W,V) |

## B. THE PERMUTATION P FOR LSH-256 WITH AVX2

AVX2 uses a 256-bit register. Thus, the register contains eight words. Then by $\sigma^{\prime}$, the left-half and right-half of the register are divided into two different registers. Thus, there is no advantage to using $\sigma^{\prime}$ instead of $\sigma$. However, if $\tau^{\prime}$ is of TYPE5, then we can implement $\tau^{\prime}$ with "_mm256_shuffle_epi32" instead of "_mm256_permutevar8x32_ps". This reduces the latency by one third. The permutations $P$ and $P^{-1}$ used to implement LSH-256 using AVX2 are shown in Fig. 9. This permutation $P=(0132,3120,0132,3120)$ is described as $(\mathrm{B}, \mathrm{V})$ in Appendix A.
$\tau^{\prime}$ and $\sigma^{\prime}$ are described in Fig. 10 and Fig. 11. Fig. 10 shows that permutations of eight words in a register are of TYPE5.

TABLE 9. Permutations in TYPE5.

|  | Permutations |
| :---: | :---: |
| $(2,0)$ | (A,C), (A,L), (A,N), (A,U), (B,D), (B,K), (B,M), (B,V), (C,A), (C,J), (C,Q), (C,S), (D,B), (D,I), (D,R), (D,T), (E,F), (E,H), (E,O), (E,X), (F,E), (F,G), (F,P), (F,W), (G,D), (G,K), (G,M), (G,V), (H,C), (H,L), (H,N), (H,U), (I,F), (I,H), (I,O), (I,X), (J,E), (J,G), (J,P), (J,W), (K,A), (K,J), (K,Q), (K,S), (L,B), (L,I), (L,R), (L,T), (M,E), ( $\mathrm{M}, \mathrm{G}$ ), ( $\mathrm{M}, \mathrm{P}$ ), ( $\mathrm{M}, \mathrm{W}$ ), ( $\mathrm{N}, \mathrm{F}),(\mathrm{N}, \mathrm{H}),(\mathrm{N}, \mathrm{O}),(\mathrm{N}, \mathrm{X}),(\mathrm{O}, \mathrm{B}),(\mathrm{O}, \mathrm{I}),(\mathrm{O}, \mathrm{R}),(\mathrm{O}, \mathrm{T}),(\mathrm{P}, \mathrm{A})$, (P,J), (P,Q), (P,S), (Q,D), (Q,K), (Q,M), (Q,V), (R,C), (R,L), (R,N), (R,U), (S,B), (S,I), (S,R), (S,T), (T,A), (T,J), (T,Q), (T,S), (U,E), (U,G), (U,P), (U,W), (V,F), (V,H), (V,O), (V,X), (W,C), (W,L), (W,N), (W,U), (X,D), (X,K), (X,M), (X,V) |
| $(0,2)$ | (A,A), (B,B), (C,C), (D,D), (E,E), (F,F), (G,G), (H,H), (I,I), (J,J), (K,K), (L,L), (M,M), (N,N), (O,O), (P,P), (Q,Q), (R,R), (S,S), (T,T), (U,U), (V,V), (W,W), (X,X) |
| $(0,0)$ | the rest |

There is source code for a modified LSH implementation using (A,M) and (B,V) in GIT-HUB [23].

## V. CONCLUSION

We have shown how to implement LSH efficiently with SIMD using permutations. The performance results are summarized in Tables 4 and 5. On average, there is a 5\% improved performance when using our method of permutations. Note that there is an additional 5\% performance improvement after applying other optimization methods, such as the deployment of SIMD instructions; this is because the code in [7] has already been optimized by the creator of LSH. There is no security vulnerability when applying the proposed method, since the only changes are in orderings of words, circular rotations, and step constants in registers. Furthermore, we have defined five permutation types for LSH and SIMD, which classify all permutations of LSH. This classification can be used to implement LSH with a new SIMD instruction set for various register sizes or platforms.

## APPENDIX

## A. PERMUTATIONS

An internal permutation is represented using the English alphabet for readability. This representation is as shown in Table 6.

The different types of permutations are defined in IV. In Table 7, each pair in the first and second column represents the number of TYPE1 and TYPE2 permutations. Similarly in Table 8, each pair in the first and second column represents the number of TYPE3 and TYPE4 permutations. In Table 9, the first column represents the number of TYPE5 permutations in $\tau^{\prime}$ and $\sigma^{\prime}$ respectively.

## REFERENCES

[1] B. V. Sundaram, M. Ramnath, M. Prasanth, and J. V. Sundaram, "Encryption and hash based security in Internet of Things," in Proc. 3rd Int. Conf. Signal Process., Commun. Netw. (ICSCN), Mar. 2015, pp. 1-6.
[2] M. H. Afifi, L. Zhou, S. Chakrabartty, and J. Ren, "HPMAP: A hash-based privacy-preserving mutual authentication protocol for passive iot devices using self-powered timers," in Proc. IEEE Int. Conf. Commun. (ICC), Kansas City, MO, USA, May 2018, pp. 1-6.
[3] T. D. Dang, D. Hoang, and P. Nanda, "A novel hash-based file clustering scheme for efficient distributing, storing, and retrieving of large scale health records," in Proc. IEEE Trustcom/BigDataSE/ISPA, Tianjin, China, Aug. 2016, pp. 1485-1491.
[4] R. Hamza, K. Muhammad, A. N, and G. Ramírez-González, "Hash based encryption for keyframes of diagnostic hysteroscopy," IEEE Access, vol. 6, pp. 60160-60170, 2018.
[5] Information Technology-Security Techniques-Hash-Functions-Part 1: General, 3rd ed., Standard ISO/IEC 10118-1:2016, 2016.
[6] D.-C. Kim, D. Hong, J.-K. Lee, W.-H. Kim, and D. Kwon, "LSH: A new fast secure hash function family," in Information Security and Cryptology-ICISC (Lecture Notes in Computer Science), vol. 8949. Cham, Switzerland: Springer, 2014, pp. 286-313.
[7] The Affiliated Institute of ETRI in Korea. (2015). Korea Cryptography Forum. [Online] Available: http://www.kcryptoforum.or.kr
[8] D. A. Patterson and J. L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 5th ed. San Mateo, CA, USA: Morgan Kaufmann, 2013.
[9] B. Lac, A. Canteaut, J. J. A. Fournier, and R. Sirdey, "Thwarting fault attacks against lightweight cryptography using SIMD instructions," in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Florence, Italy, May 2018, pp. 1-5.
[10] A. Miyajan, Z. Shi, C.-H. Huang, and T. F. Al-Somani, "An efficient highorder masking of AES using SIMD," in Proc. 10th Int. Conf. Comput. Eng. Syst. (ICCES), Cairo, Egypt, Dec. 2015, pp. 363-368.
[11] S. Neves and J.-P. Aumasson, "Implementing BLAKE with AVX, AVX2, and XOP," IACR Cryptol. ePrint Arch., Tech. Rep. 2012/275, 2012, p. 275.
[12] T. Park, H. Seo, and H. Kim, "Parallel implementations of SIMON and SPECK," in Proc. Int. Conf. Platform Technol. Service (PlatCon), Jeju, South Korea, Feb. 2016, pp. 1-6.
[13] H. Seo, T. Park, S. Heo, G. Seo, B. Bae, Z. Hu, L. Zhou, Y. Nogami, Y. Zhu, and H. Kim, "Parallel implementations of LEA, revisited," in Information Security Applications (Lecture Notes in Computer Science), vol. 10144. Cham, Switzerland: Springer, 2016, pp. 318-330.
[14] K. C. Pabbuleti, D. H. Mane, A. Desai, C. Albert, and P. Schaumont, "SIMD acceleration of modular arithmetic on contemporary embedded platforms," in Proc. IEEE High Perform. Extreme Comput. Conf. (HPEC), Waltham, MA, USA, Sep. 2013, pp. 1-6.
[15] C. Du, G. Bai, and H. Chen, "Towards efficient implementation of latticebased public-key encryption on modern CPUs," in Proc. IEEE Trustcom/BigDataSE/ISPA, Helsinki, Finland, Aug. 2015, pp. 1230-1236.
[16] W. Wang, J. Han, J. Wang, and X. Zeng, "A SIMD multiplier-accumulator design for pairing cryptography," in Proc. IEEE 11th Int. Conf. ASIC (ASICON), Chengdu, China, Nov. 2015, pp. 1-4.
[17] J.-P. Aumasson, S. Neves, Z. Wilcox-O'Hearn, and C. Winnerlein, "BLAKE2: Simpler, smaller, fast as MD5," in Applied Cryptography and Network Security (Lecture Notes in Computer Science), vol. 7954. Berlin, Germany: Springer, 2013, pp. 119-135.
[18] R. Beaulieu, S. Treatman-Clark, D. Shors, B. Weeks, J. Smith, and L. Wingers, "The SIMON and SPECK lightweight block ciphers," in Proc. 52nd ACM/EDAC/IEEE Design Autom. Conf. (DAC), San Francisco, CA, USA, Jun. 2015, pp. 1-6.
[19] D. Hong, J.-K. Lee, D.-C. Kim, D. Kwon, K. H. Ryu, and D.-G. Lee, "LEA: A 128-bit block cipher for fast encryption on common processors," in Information Security Applications (Lecture Notes in Computer Science), vol. 8267. Cham, Switzerland: Springer, 2013.
[20] S. Lucks, "A failure-friendly design principle for hash functions," in Advances in Cryptology-ASIACRYPT (Lecture Notes in Computer Science) vol. 3788. Berlin, Germany: Springer, 2005.
[21] R. P. Weinmann, "AXR-crypto made from modular additions, XORs, and word rotations," Dagstuhl Seminar 09031, 2009. [Online]. Available: http://www.dagstuhl.de/Materials/Files/09/09031/09031.WeinmannRalf Philipp.Slides.pdf
[22] Y. Jung, "Research on word permutation of ARX based cryptographic algorithm," Ph.D. dissertation, Dept. Maths, Hanyang Univ., Seoul, South Korea, 2016.
[23] GitHub. (2019). LSH-SIMD-Implementation. [Online]. Available: http://github.com/JuYoungjin/LSH-SIMD-Implementation


DONGYEONG KIM received the B.S. degree in mathematics from Hanyang University, Seoul, South Korea, in 2013, where he is currently pursuing the Ph.D. degree in mathematics, under the supervision of Prof. J. Song. His research interests include cryptanalysis of symmetric-key cyrptography, channel coding theory, and post quantum cryptography.

YOUNGHOON JUNG received the B.S. and Ph.D. degrees in mathematics from Hanyang University, Seoul, South Korea, in 2011 and 2016, respectively. He is currently with the Affiliated Institute of ETRI, as a Senior Researcher. His research interests include design, cryptanalysis, and implementation of symmetric key cryptography.


YOUNGJIN JU received the B.S. degree in mathematics from Hanyang University, Seoul, South Korea, in 2018, where he is currently pursuing the Ph.D. degree in mathematics, under the supervision of Prof. J. Song. His research interests include cryptanalysis of symmetric-key cryptography and post quantum cryptography.


JUNGHWAN SONG received the B.S. degree from Hanyang University, Seoul, South Korea, in 1984, the M.S. degree from Syracuse University, NY, USA, in 1989, and the Ph.D. degree from Rensselaer Polytechnic Institute, NY, USA, in 1993, all in mathematics. He was the Chairman of Korea Cryptographic Forum. He is currently a Professor with the Department of Mathematics, Hanyang University, Seoul. His current research interests include cryptanalysis of symmetric-key cryptography, mathematical optimization, and post quantum cryptography.

$$
\bullet \bullet
$$


[^0]:    The associate editor coordinating the review of this manuscript and approving it for publication was Xiangxue Li.

