I was trying to reproduce the statement in https://developer.nvidia.com/blog/nvidia-announces-cuquantum-beta-availability-record-quantum-benchmark-and-quantum-container/, in particular
Quantum Fourier Transform – accelerated from 29 mins down to 19 secs
I'm running the following minimal reproducer on Python 3.8, cuQuantum 22.11, NVIDIA A100 40 GB (on a GCP instance)
import time
import cirq
import qsimcirq
import cupy
from cuquantum import contract
from cuquantum import CircuitToEinsum
simulator = qsimcirq.QSimSimulator()
# See https://quantumai.google/cirq/experiments/textbook_algorithms
def make_qft(qubits):
"""Generator for the QFT on a list of qubits."""
qreg = list(qubits)
while len(qreg) > 0:
q_head = qreg.pop(0)
yield cirq.H(q_head)
for i, qubit in enumerate(qreg):
yield (cirq.CZ ** (1 / 2 ** (i + 1)))(qubit, q_head)
def simulate_and_measure(nqubits):
qubits = cirq.LineQubit.range(nqubits)
qft = cirq.Circuit(make_qft(qubits))
myconverter = CircuitToEinsum(qft, backend=cupy)
tic = time.time()
simulator.simulate(qft)
elapsed_qsim = time.time() - tic
out = {"qsim": elapsed_qsim}
# CUDA expectation
pauli_string = {qubits[0]: 'Z'}
expression, operands = myconverter.expectation(pauli_string, lightcone=True)
tic = time.time()
contract(expression, *operands)
elapsed = time.time() - tic
out["cu_expectation"] = elapsed
# CUDA Batched amplitudes
# Fix everything but last qubit
fixed_states = "0" * (nqubits - 1)
fixed_index = tuple(map(int, fixed_states))
num_fixed = len(fixed_states)
fixed = dict(zip(myconverter.qubits[:num_fixed], fixed_states))
expression, operands = myconverter.batched_amplitudes(fixed)
tic = time.time()
contract(expression, *operands)
elapsed = time.time() - tic
out["cu_batched"] = elapsed
return out
for i in [10, 15, 20, 25, 30]:
print(i, simulate_and_measure(i))
Output (the numbers are elapsed in seconds; 10, 15, ... are number of qubits for QFT):
10 {'qsim': 0.9677999019622803, 'cu_expectation': 0.29337143898010254, 'cu_batched': 0.07590365409851074}
15 {'qsim': 0.023270368576049805, 'cu_expectation': 0.019628524780273438, 'cu_batched': 0.3687710762023926}
20 {'qsim': 0.03504538536071777, 'cu_expectation': 0.023822784423828125, 'cu_batched': 0.9347813129425049}
25 {'qsim': 0.14235782623291016, 'cu_expectation': 0.02486586570739746, 'cu_batched': 2.39030122756958}
30 {'qsim': 3.4044816493988037, 'cu_expectation': 0.028923749923706055, 'cu_batched': 4.6819908618927}
35 {'cu_expectation': 1.0615959167480469, 'cu_batched': 10.964831829071045}
40 {'cu_expectation': 0.03381609916687012, 'cu_batched': 82.43729209899902}
I wasn't able to go to 35 qubits for qsim, because I got CUDA OOM for qsim. The much reduced memory usage alone is sufficient to prefer cuQuantum for this use case.
But, I was hoping that batched_amplitudes
is going to be faster than a full statevector simulation, because some qubits are fixed. But it doesn't seem to be the case. I have also tried reduced_density_matrix
(not shown, so that the code snippet is short). The only one that is consistently fast is expectation
. I wonder if I did it wrongly?