Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add twirl_circuit function #13331

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

Conversation

mtreinish
Copy link
Member

Summary

This commit adds a new function twirl_circuit to apply Pauli twirling to a given circuit. This function only works with a fixed set of two qubit gates and is not a general solution. For performance this new function is written in rust and uses static lookup tables for the twirling sets around the fixed two qubit gates that are randomly sampled. There is an option to perform twirling multiple times and return a list of circuits instead of just doing it once. Ideally we'd do this in parallel, but we're currently blocked on the use of OnceCell in the PackedInstruction and the ParameterTable from parallelizing with CircuitData. We can further improve the performance of this new function by using a parallel iterator once #13219 is resolved. The function is written in a way to make this simple in the future.

Details and comments

Fixes #13325

This commit adds a new function twirl_circuit to apply Pauli twirling to
a given circuit. This function only works with a fixed set of two qubit
gates and is not a general solution. For performance this new function
is written in rust and uses static lookup tables for the twirling sets
around the fixed two qubit gates that are randomly sampled. There is
an option to perform twirling multiple times and return a list of
circuits instead of just doing it once. Ideally we'd do this in
parallel, but we're currently blocked on the use of OnceCell in the
PackedInstruction and the ParameterTable from parallelizing with
CircuitData. We can further improve the performance of this new
function by using a parallel iterator once Qiskit#13219 is resolved. The
function is written in a way to make this simple in the future.

Fixes Qiskit#13325

Co-authored-by: Paul Nation <[email protected]>
@mtreinish mtreinish added Changelog: New Feature Include in the "Added" section of the changelog Rust This PR or issue is related to Rust code in the repository mod: circuit Related to the core of the `QuantumCircuit` class or the circuit library labels Oct 15, 2024
@mtreinish mtreinish added this to the 1.3.0 milestone Oct 15, 2024
@mtreinish mtreinish requested a review from a team as a code owner October 15, 2024 21:47
@qiskit-bot
Copy link
Collaborator

One or more of the following people are relevant to this code:

  • @Qiskit/terra-core

@mtreinish
Copy link
Member Author

I ran an asv comparison (since the benchmarks needed to change to show the improvement here it's not a direct comparison) and this is showing a improvement of the previous twirling benchmark written in python from taking 254±4ms to 6.38±0.3ms with the twirl_circuit function in this PR:

| Change   | Before [54fe09b2] <lint_incr_latest~1>   | After [33b4bde2] <twirl-for-paul>   |   Ratio | Benchmark (Parameter)                                 |
|----------|------------------------------------------|-------------------------------------|---------|-------------------------------------------------------|
| x        | 254±4ms                                  | 6.38±0.3ms                          |    0.03 | manipulate.TestCircuitManipulate.time_DTC100_twirling |

The "Change" X is because asv detected the benchmark changed and the results may not be comparable.

@coveralls
Copy link

coveralls commented Oct 15, 2024

Pull Request Test Coverage Report for Build 11479816454

Details

  • 398 of 410 (97.07%) changed or added relevant lines in 8 files are covered.
  • 13 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.05%) to 88.731%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/circuit/src/circuit_data.rs 26 29 89.66%
qiskit/circuit/twirling.py 57 60 95.0%
crates/accelerate/src/twirling.rs 289 295 97.97%
Files with Coverage Reduction New Missed Lines %
qiskit/synthesis/two_qubit/xx_decompose/decomposer.py 1 95.38%
crates/accelerate/src/unitary_synthesis.rs 1 92.24%
crates/qasm2/src/lex.rs 5 92.48%
crates/qasm2/src/parse.rs 6 97.62%
Totals Coverage Status
Change from base Build 11476806715: 0.05%
Covered Lines: 75293
Relevant Lines: 84855

💛 - Coveralls

Copy link
Contributor

@Cryoris Cryoris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a pretty neat speedup for something used a lot 🙂

qiskit/circuit/twirling.py Outdated Show resolved Hide resolved
qiskit/circuit/twirling.py Outdated Show resolved Hide resolved
crates/accelerate/src/twirling.rs Outdated Show resolved Hide resolved
crates/accelerate/src/twirling.rs Outdated Show resolved Hide resolved
crates/circuit/src/circuit_data.rs Outdated Show resolved Hide resolved
if gate == twirled_gate {
let qubits: Vec<Qubit> = out_circ.get_qargs(inst.qubits).to_vec();
let (twirl, twirl_phase) = twirl_set.choose(rng).unwrap();
out_circ.push_standard_gate(twirl[0], &[], &[qubits[0]])?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a performance benefit to omitting identity gates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it wouldn't make much of a difference for the performance of this function. The two ways to do it would be to condition the push call on the value of twirl[index] which adds a bunch of branching logic around each push call or somehow tweak the static twirling sets which would be tricky because they work well because everything is fixed sizes. There is a potential benefit downstream of this function because we've potentially used less gates

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On average, 25% of the gates we'd be adding to the circuit will be identities, and that's a fair amount of data to reduce by not outputting it, which I'd expect to have a measurable impact on the runtime of this function. I wouldn't imagine that the branching logic will make any huge amount of difference, because there's already a decent amount of branching and hashing work hiding within each function call here.

Copy link
Contributor

@nonhermitian nonhermitian Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nice thing is that id gates explicitly show the added twirling gates. Moreover, in practice one should run a 1Q gate cleanup after twirling that should take care of this e,g

post_pm = PassManager([PauliTwirling('ecr'), 
                       Optimize1qGatesDecomposition(backend.target.operation_names)])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those kinds of nice things have a very poor effect on performance. We've just made a big deal about Qiskit performance, and if we want to maintain it, we need to keep putting performance first, and designing our interfaces around that.

We can potentially allow it to be opt-in to show the twirling gates, but it's fairly trivial to write the twirling algorithm yourself if that's what you care about seeing, and that's always going to give you the most control over the output. If what we're after is that kind of "utility" function rather than a high-performance one, I'm not sure it fits in Qiskit - we already shipped off bunches of qiskit_algorithms and the applications libraries out of the core library because they didn't fit the "highest performance core data structures and compilation" model, and decided to do that with documentation instead. If the purpose of the function is educational, then it feels to me like it fits in that vein, and should be handled with guide-level documentation on "introduction to Pauli twirling" rather than being presented in the core library as "this is an efficient way to do it".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok if id gates are a performance blocker than so be it. It also does not change the fact that I need to run a 1Q cleanup at the end, so either way, it is more important to have a function that gives users the ability to do this easily.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - I commented on that in my main review. It's not really clear to me what the purpose of this function is at all from a performance perspective.

If the purpose is to run it post-compilation, then we should design the interface of the function to be smarter about twirls in the first place, so it outputs at least hardware-compatible circuits. Technically there'll still be possibilities for post-optimisation, since 1q runs of gates will likely be possible to be resynthesised, which again is something we might want to consider.

Copy link
Member

@jakelishman jakelishman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is going to be generally useful to people, I'm ok with adding it, but I think we need to consider the interfaces a bit more, and how it fits in with the rest of Qiskit. The benchpress version was one very specific implementation for a specific type of circuit that only works for one particular way of building the input circuit, and as long as it doesn't include quite a few of the possible features of Qiskit circuits.

In addition to the review comments, one more top-level one: I'm not really sure where in a user's compilation pipeline this is meant to be called, which also affects what I'd expect to have in the function. The twirling is adding Pauli X/Y/Z gates (obviously), which implies to me that it needs to come before compilation, but then it also makes assumptions that everything's going to be decomposed in terms of one single 2q gate, which, aside from being an awkward assumption for the future, means that we might need a better way to handle things than to output non-translated 1q gates? If we instead want to push it more in the direction of "apply this to a virtual circuit, but still one that's decomposed into 2q operations", I think we need to consider interfaces to control the known twirling sets for additional gates.

I'm concerned that we're committing to an interface acting on QuantumCircuit that doesn't really act on either virtual or physical circuits, but somewhere in the middle. That's a transpiler pass, except that the num_twirls thing makes that awkward, because any sort of multi-in / multi-out branching within the transpiler needs a significant rework of the pass-manager framework to become safe (speaking as somebody who already wrote such a modification once).

qiskit/circuit/twirling.py Outdated Show resolved Hide resolved
qiskit/circuit/twirling.py Outdated Show resolved Hide resolved
qiskit/circuit/twirling.py Outdated Show resolved Hide resolved
qiskit/circuit/twirling.py Outdated Show resolved Hide resolved
fn add_global_phase(py: Python, phase: &Param, other: &Param) -> PyResult<Param> {
pub(crate) fn add_global_phase(py: Python, phase: &Param, other: &Param) -> PyResult<Param> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to use this elsewhere, can we think about where it should live, and how we should be organising the crate? Imports like crate::dag_circuit::add_global_phase are awkward to work with internally if you weren't involved with the PR that originally put the function there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move it somewhere, I only left it here for brevity. Where would you like me to put it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't know right now. It's probably part of a separate issue.

crates/accelerate/src/twirling.rs Outdated Show resolved Hide resolved
crates/circuit/src/circuit_data.rs Outdated Show resolved Hide resolved
crates/accelerate/src/twirling.rs Outdated Show resolved Hide resolved
crates/accelerate/src/twirling.rs Outdated Show resolved Hide resolved
qiskit/circuit/twirling.py Outdated Show resolved Hide resolved
@nonhermitian
Copy link
Contributor

@jakelishman PT is not something that fits into our current circuit transformation pipelines. It basically has to be a middle step between two passmanagers. One to map to HW, followed by PT, and then another that does cleanup of the 1Q gates and casts back into the HW basis set (usually over a set of circuits generated by PT). This is one of the motivations for #13326

@jakelishman
Copy link
Member

If it doesn't really fit in Qiskit's models right now (and I agree, it doesn't particularly), then it feels like we're trying to merge such a function into the core library that'll never really fit in. If we're asking users to start using this now as part of the core library, they're going to find it far more awkward in the future if we say "actually, don't do this anymore and instead use this transpiler pass". That'll alienate our users, who already don't like the amount of changes we make.

Making the compiler multi-in, multi-out is interesting, and something we already do internally in the IBM backends (I wrote it), but it's hugely fragile in the current pass-manager system, and it's tricky to make it better without compromising performance. We can get away with it in the backends because there's far tighter control over what passes are in the pipelines and when a pass might trigger such fan-out behaviour.

We'll have a chance to do that better when we move more of the pass-manager infrastructure into Rust; we'll have more opportunities to make prior state in the PropertySet copy-on-write, rather than potentially eagerly copying large amounts of data at the point of fan-out, or being smarter about invalidating stale data in the PropertySet. Both of those would make an internal IR that's fan-out aware easier to make safe (the main concern) and performant (the next most important), but we're not there yet.


If this is only ever meant to be an intermediate-term thing, I'm most worried about how we'll present this narratively to users, especially when we need to shift them to something else.

On another narrative note: are we intending to present twirling as something that all users will need to do, and not the domain of a primitive implementer? Adding twirling as something we do on a single circuit, outside the context of the estimator observables and potential lightcone analysis might accidentally cause primitives implementations to have to do more work and have the user do more work to reconstruct their final results, since now a user might well end up passing a much larger broadcast matrix as their estimator inputs.

@nonhermitian
Copy link
Contributor

If it doesn't really fit in Qiskit's models right now (and I agree, it doesn't particularly), then it feels like we're trying to merge such a function into the core library that'll never really fit in

But this is equiv to kicking the can down the road to a undetermined time frame, and features that have value will likely never see the light of day, waiting for the ideal conditions for implementation. Perhaps latter a pass can just call this same function that remains exposed to users as well, who knows.

Twirling has value for Qiskit outside of the IBM primitives. Note that no other provider outside of IBM has a primitives interface that supports twirling. So there is value to other HW providers, and in addition there is value for users who want to see (and perhaps save) the exact circuits they are executing.

I am not sure there is any major downside on the primitives side save for perhaps extra work to be done that otherwise didn't need to be done, e.g. twirl twice if that was an automatic option. However, I am pretty sure there is an option for that.

This commit expands the twirling_gate argument to work with strings
instead of classes and also specify a list of gates instead of a single
gate type. It also changes the default to `None` to twirl all supported
gates in the circuit.
@jakelishman
Copy link
Member

A later pass can never call this function, because it's defined on the wrong types of inputs. Besides, that doesn't help the user, who would still have to switch what they're doing.

Yes, I am proposing step back and think through what a joined-up interface to Qiskit from the user right through to hardware looks like, rather than us adding things in that implement one overly-specific microbenchmark more efficiently. We can't really claim either performance and stability if that's how we go about additions to the core library. I'm not saying twirling isn't important, or even that we shouldn't put it in right now, I'm saying that software engineering is more than just throwing interfaces at a wall to see what sticks. That works well in the early days of open-source work, but not for a product whose guiding principles are meant to be stability, performance and utility.

For example, if the purpose is that this will always be used as a post-compilation pipeline where it's immediately followed by an Optimize1qGates pass, then the interface of this function probably should change in some manner that allows the twirling itself to inline the optimal 1q gate sequence for a particular backend. That will be more performant than this current implementation, because it won't need an additional transpiler pass to come later. Using the current interface over optimises for the microbenchmark, which isn't representative of the work that actually needs to be done - it elevates a subroutine of one particular (inefficient) choice of implementation to be a standard we should hit.

I am not sure there is any major downside on the primitives side save for perhaps extra work to be done that otherwise didn't need to be done.

If we promote that you should do your twirling from within Qiskit, this takes away important information that a primitives implementation can use to reduce the number of QPU hours and shot-loops necessary, because now the user is supplying many different Estimator inputs (one per twirl). That means that the same light-cone analysis now has to be done for each separate input, whereas with twirling done in the primitive implementations, it can be done before the twirling is. If the primitive is doing any other additional error learning or error mitigation, it'll likely now end up doing the same learning for each of the different input circuits, which could severely impact the number of QPU hours spent. It also means that the user has more data come back that they now have to merge together, which means more data the user has to send over the wire to us, and more data we have to send over the wire back to the user.

There are ways around these problems, but again, they mean stepping back and thinking through a joined-up interface, in conjunction with implementers of the primitives. For example, my problem of knowing when two multi-qubit potentially parametric blocks implement the same operation can be eased by additional boxes with annotations, which is something I'm currently in active discussions with the primitives teams about. However, if that's the way we go, then this function needs to be designed in such a way that it outputs those boxes, and we need to build the interface around that.

I do totally agree that this could have utility outside IBM primitives implementations as well, but if so, we need to really think through how those other users should use this, and how we'll design our interfaces so that users will, by default and by using the most natural forms, get the experience that is most performant and most pleasant end-to-end.

@nonhermitian
Copy link
Contributor

I guess if you want to future proof it a bit then you should implement as a transpiler pass that only does a single twirl. In that case the current usage would require two pass managers still, but the input to the second would be something like post_pm.run([qc]*num_twirls). Then you could probably extend it to one-in many-out in a fairly seamless manner. That would also leave the 1Q optimization separate, which is more in the spirit of the passes doing basic things

@jakelishman
Copy link
Member

This is just why I wanted to talk about things - if the intention of this function is to just be a short-term, low-level post-transpile thing, then we can still have a function like what's currently implemented, it just means we tweak the interface so it's more efficient, and we need to take care about how we teach users about it.

If I'm understanding right, then the two proposed use cases for this function are:

  • illustrate what Pauli twirling does
  • multiply twirl a post-compilation circuit, without concern for how efficiently that will be executable on a primitive (possibly because it'll be given to a non-primitive direct-access BackendV2.run, which might not be an IBM one).

In that case, maybe we work in an interface where we have an optional argument that's either None or an instance of Optimize1qDecomposition. If we're given the optimiser, the twirler can run along the 1q runs ahead of each twirled gate, then give that run and the twirling gate to the optimiser, and inline the output of that. That allows both use cases: for use case one, you don't pass the optimiser, and for use case two, you do. In the second use case, we're now more efficient than anything else proposed here, because there's no need for a follow-up passmanager at all.

Also if that's the case, then that impacts the interface promises we should make about the output circuit w.r.t. propagating any stored TranspileLayout, the register structure, and so on. Again, easy enough to do, we just needed to really hone in on what the purpose and use-case of the function is in order to know that.

The concerns about this interacting poorly in general with how efficient Estimators will want to work are still there, but that can generally be handled by documentation. That of course wouldn't be the case if this was intended to be used before input to a primitive, but that's what I want to make sure we're thinking about.

@nonhermitian
Copy link
Contributor

The workflow is the following:

  1. Transpile circuit as best as possible

  2. Take this best circuit and generate num_twirls versions that have been twirled, and then do a 1Q cleanup on the return set

  3. Execute (perhaps with things like DD added as well after twirling) using whatever execution pipeline a provider has

  4. Aggregate counts and compute desired quantities of interest.

The twirling could be a function, or a TransformationPass with the latter generating only a single twirl at the moment. Then one could combine the 1Q optimization pass in a post_pm as done above and get the desired output.

This commit updates the twirl_circuits logic to recurse into a control
flow operation and twirl any gates that are potentially in the block.
@jakelishman
Copy link
Member

Step three is the big question to me: is this not something that narratively we promote is what a primitives implementer might do for you? How does adding this function interact with that? Can you expand on why you have to do this manually, when this is something that IBM primitives definitely can do?

Knowing your step 2 is important, because then we can make the function better - having "twirl" be a standalone step isn't necessarily even the best interface for your workflow, because you have to do a full 1q post-optimize, while we can do that more efficiently on-the-fly (since 1q optimisation is pretty simple and a peephole optimisation).

I'd like to know more about what other users' use-cases for manual twirling are - you're presenting one use case, but it might not be the only reason people are twirling themselves, and I wouldn't want the interface to get in the way of that. I know that there's a few people out there who've written things approximately similar to what's here, but I don't know for what purposes they've done that.

@nonhermitian
Copy link
Contributor

I am not sure that Qiskit has to promote anything that a singular primitive does as its preferred workflow. However, a quick glance at the Qiskit docs shows only Runtime examples, so perhaps Qiskit SDK IS now promoting a singular providers interface for execution.

That being said, there are several reasons why one would like to have circuits in hand. 1) Reproducibility - I want to run exactly the same circuits over again. 2) Time saving - why waste time repeatedly twirling on every call to a primitive. 3) I want to run twirled circuits on something other than IBM hardware. 4) I want to mix error suppression techniques in a manner not available within a primitive.

If making a function version, as done here, then sure one can combine the steps. The twirling would not be explicitly clear, but it would be there logically and in a form better suited for further manipulation or execution.

Previously for each twirled gate we were calling the interner 4 times
once for each new gate. However, there are only 2 qargs being used and
we only need half of the interning calls by reusing the interned qubits
between the two gates. This commit reworks the logic to make this
optimization.
This commit removes the final logic branch for when we're generating > 4
circuit outputs. That was previously left in place to prepare for
building the output circuits in parallel using a rayon iterator.
However, we're currently blocked form doing that and having different
seeding behavior for a fixed seed when you move from 3 to 5 output
circuits is probably unexpected. This standardizes on the lower overhead
sequential iterator path, if/when we make it parallel we can switch the
seeding behavior over to the other form.
This commit adds a new option to the function to enable running the 1q
optimization pass as part of the twirling function. The typical workflow
is to run twirling after transpilation to generate a lot of random
twirled circuits, in this workflow we need to adjust the pauli gates
generated during twirling to match the backend's target. The
Optimize1qGatesDecomposition pass that's passed in is used as a
container for the arguments, we call the pass internally via rust for
lower overhead.
}
if run_pass {
let mut dag = DAGCircuit::from_circuit_data(py, out_circ, false)?;
optimize_1q_gates_decomposition(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to just call Optimize1qGatesDecomposition from rust here. You could potentially avoid scanning the entire dag to search for runs, because you can search from where we inject the pauli gates to find new runs created by the insertions, but the do that efficiently we'd have to convert to a dag anyway and the point we're doing that it doesn't make a huge difference and it's a lot simpler to just call the existing function.

This commit expands the twirling function to work with any gate object
that has a matrix. For the gates outside the previosuly supported set of
CX, ECR, CZ, and iSwap we compute the twirling set on demand for any
gate when the function is called now. This lets the function work for
any gate even custom ones as long as they define a matrix.
Copy link
Member

@jakelishman jakelishman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the new interface, I was just commenting my nerd-sniped bits on the performance of the new twirl finder lol

crates/accelerate/src/twirling.rs Outdated Show resolved Hide resolved
crates/accelerate/src/twirling.rs Outdated Show resolved Hide resolved
Forcing users to instantiate a Optimize1qGatesDecomposition transpiler
pass was a bit heavy when the twirling function wasn't actually calling
the pass directly. All the pass was being used for was to pull the
constraints out and pass it to Rust. It is simpler for users to just
give the target to the function directly and also simplifies the
function's code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: New Feature Include in the "Added" section of the changelog mod: circuit Related to the core of the `QuantumCircuit` class or the circuit library Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend Pauli twirling to end users
6 participants