Introduction

Being able to achieve full phase control of optical fields is a central challenge in optical engineering, with diverse applications in imaging, sensing, augmented, and virtual reality systems1,2. The past decades have seen a rapid development of metasurface-based optical elements that exploit collective scattering properties of subwavelength structures for phase-sha** the incoming fields and are significantly more compact and integrable when compared to the conventional refractive optical elements3,4,5,6,7,8,9. The most commonly adopted metasurface-design strategy proceeds in two steps — first, a library of periodic meta-atoms with varying transmission amplitudes and phases is generated by varying a few geometric parameters specifying the meta-atom. Next, an aperiodic meta-surface is generated by laying out the periodic meta-atoms corresponding to the target spatially-varying phase profile10,11,12,13,14,15,16,17. This approach suffers from two major limitations — first, the resulting metasurface should be almost periodic, and thus this strategy cannot be used for reliably designing rapidly varying phase-profiles. Second, generating the metasurface library becomes increasingly difficult for multi-functional design problems. For instance, while it is usually not difficult to generate a library for designing a simple phase-mask operating at a few operating modes15,18,19, it becomes increasingly difficult to scale up the number of modes since the same metasurface is required to simultaneously satisfy multiple design conditions corresponding to the different input modes.

Fully automating design of metasurfaces can provide a potential solution to this problem. Gradient-based optimization has been successful in designing integrated optical elements that are more compact, robust and high performing than their classical counterparts20,21,22,23,24,25,26,27. A key ingredient in these approaches is the ability to rapidly simulate the full electromagnetic structure. This presents a challenge for metasurface designs, since practical metasurfaces could be ~102–103λ in the linear dimension, making it impractical to use general-purpose electromagnetic solvers such as Finite-Difference Time-Domain (FDTD)28, Finite-Difference Frequency-Domain (FDFD)29, or Finite Element Method (FEM)30. Inverse-design approaches that use discrete general-purpose electromagnetic solvers to simulate and design the full surface are limited to small design areas or a small number of optimization iterations31,32, or restrict the parameter space through a specific symmetry that allows for fast simulations33,34,35. Consequently, nearly all the current methods for inverse-designing large-scale 3D metasurfaces rely on approximate electromagnetic simulations of the metasurface locally using either periodic or radiation boundary conditions9,36,37,38,39,40,41,42,43,44,45,46,47, which do not accurately account for interactions between different meta-atoms. These approaches are thus fundamentally limited to designing metasurfaces with slow phase variations due to the implicit local approximation. A coupled-mode formalism can also be applied for metasurface simulation and optimization48 but this approach is not guaranteed to yield exact fields, particularly for metasurfaces with multiple low quality-factor modes.

In this paper, we propose and demonstrate a numerically accurate simulation strategy that can be used to design and analyze large-area metasurfaces. Our strategy relies on a distribution of the simulation method where the simulation time scales linearly with the compute resources. This is achieved by a Nyquist-sampling decomposition of the fields incident on the metasurface, similar to that used recently to characterize the discrete impulse response of aperiodic metasurfaces49. Our distribution strategy, by ensuring minimal communication between compute nodes, allows for a linear reduction in the simulation time with the number of compute nodes, indicating that arbitrarily large metasurfaces can be simulated in reasonable time with sufficiently large number of compute nodes. On each compute node, we implement a GPU-based transition-matrix (T-matrix) simulation50,51,52. Though there are GPU-optimized FDTD implementations that allow fast simulation of unit-cells up to 100λ × 100λ53, these approaches do not currently provide a low-overhead means of parallel simulation distribution. We demonstrate numerically accurate simulations of metasurfaces of size 1 mm × 1 mm at a wavelength of 1.55 μm (about 645λ × 645λ) on a cluster of 48 GPU nodes. Finally, we demonstrate the ability to efficiently compute the gradients with respect to both the geometry and the positions of the meta-atoms, thus enabling the application of optimization-based design to large-scale metasurfaces.

Results

Low-overhead multi-GPU simulation strategy

To simulate millimeter-scale metasurfaces, it is essential to parallelize the simulation method across multiple compute nodes. In order to be scalable, however, this parallelization scheme should introduce only a modest communication overhead in the simulation as this communication overhead can potentially offset any time savings achieved due to the parallelization (54,55,56).

For metasurface simulations, however, by utilizing the property that the incident fields generated by far-field sources will be within the light-cone in the k − space, a parallelization strategy can be devised that requires minimal communication between the compute nodes. The fundamental principle behind this parallelization is to represent the bandlimited incident field by its samples using the Nyquist-sampling theorem57. More precisely, consider an incident field propagating along the z − direction — the transverse polarization of this field, \({{{{\bf{E}}}}}_{inc}^{T}(x,y,z)\), at any z can be expressed as

$${{{{\bf{E}}}}}_{{{{\rm{inc}}}}}^{{{{\rm{T}}}}}(x,y,z)=\mathop{\sum}\limits_{i,j}{{{{\bf{E}}}}}_{{{{\rm{inc}}}}}^{{{{\rm{T}}}}}({x}_{i},{y}_{j},z){f}_{i,j}(x,y),$$
(1)

where xi, yj = iλ/2, jλ/2 with λ being the wavelength in the background medium, and fi,j(x, y) is a **c function58 centered at (xi, yj). Each term in the Nyquist decomposition can be considered to be an independent source, which falls off to zero with distance (Fig. 1a), and the response of a metasurface to these individual sources can be obtained by considering only a spatially-truncated portion of the metasurface in the simulation. This is numerically demonstrated in Fig. 1b, in which we consider the scattered power obtained on exciting a metasurface with a single **c source as a function of the size of the metasurface included in the simulation. As the size of the metasurface is increased, the scattered power converges, indicating that a local simulation is sufficient to capture the metasurface response. The portion of the metasurface required to achieve a particular accuracy in the simulation is governed by diffraction of the **c source as it propagates to the metasurface.

Fig. 1: Nyquist sampling of bandlimited incident field.
figure 1

a Schematic of Nyquist sampling of the incident electric field, which is bandlimited because it is propagating. b Percent error in scattered field power versus spatial-extent of metasurface included in the simulation for a single **c source placed 10 μm (green), 5 μm (blue), and 0.5 μm (black) from the metasurface. The full metasurface is a 25 μm × 25 μm metasurface with focal length of 10 μm, and the surface size on the x-axis of this convergence plot refers to the spatial-extent around the center of this metasurface that is included in the simulation. The y-axis relative error is computed assuming the simulation including the full metasurface is the converged result.

To parallelize the simulation, we can then divide up the **c sources that compose the incident electric fields into smaller groups, and simulate the local response of the metasurface for each source group by performing an independent solve on a single compute node (Fig. 2a). This parallelization strategy only requires communication between the compute nodes at the start and at the end of the simulation — once to distribute the simulation data corresponding to the local subregions, and then to consolidate the electric field data computed per subregion. On each compute node, we implement a GPU-parallelized transition-matrix (T-matrix) electromagnetic solver59,60 (See Supplementary Note 1 for details of the T-matrix method and our implementation of it). In order to accurately account for the diffraction of the **c source while computing the local response of the metasurface, we include a padding region around the group of sources for each compute node. While in principle, we should ensure that the performance metric being analyzed (e.g. metasurface efficiency) converges with respect to the padding, in practice and for typical metasurfaces, the thickness of padding required for accurate simulations can be estimated simply by studying the response of a local patch of the metasurface to one source (similar to the study performed in Fig. 1b). After having performed all the simulations, the electric fields obtained can be added together to compute the total electric field. Furthermore, because each compute node performs roughly the same amount of compute, the total simulation time scales as 1/Nnodes (Fig. 2b). Details of our **c source computation for the T-matrix method single-node simulation can be found in Supplementary Note 2.

Fig. 2: Low-overhead parallelization scheme for simulation of arbitrarily large metasurfaces.
figure 2

a Schematic of the simulation distribution scheme — the incident field is first sampled and represented as a superposition of **c sources, and then smaller groups of **c sources and the locally surrounding metasurface regions are simulated on independent GPUs. b Total simulation time versus number of V100 GPU’s used for simulation for a 50 μm (black), 100 μm (blue), and 300 μm (green) metasurface. All metasurfaces have focal length of 25 μm and are designed from a library of silicon cylinders with height 940 nm, radii range of 50–250 nm, lattice period of 1070 nm, air background, and source wavelength of 1550 nm (based on scatterer library from Arbabi et al.61). c Computation time for the key stages of the large-area 1 mm × 1 mm metasurface simulation (metalens with focal length 0.4 mm designed with the same scatterer library used in (b)): top row – computing the Look-Up Tables (LUT) used to efficiently perform T-matrix simulation (Supplementary Note 1); middle row – computing the T-matrices (Supplementary Note 1.b) and solving the resulting linear system of equations for the scattered field coefficients (Supplementary Note 1.c, Supplementary Eq. 23); bottom row – computing the E and H fields from the scattered field coefficients for each desired detector point (Supplementary Note 1.c, Supplementary Eq. 24).

Thus, given a sufficiently large number of compute nodes, we expect the simulation strategy to be able to handle large problems – on a compute cluster with 48 V100 GPU nodes, we were able to simulate a metasurface of size about 645λ × 645λ in ~10 hours. This total time is broken down into the compute times for the key simulation parts in Fig. 2c. The simulated surface is a 1 mm × 1 mm metalens with focal length 0.4 mm (NA = 0.78) designed from a library of silicon cylinders with height 940 nm, radii range of 50–250 nm, lattice period of 1070 nm, air background, and source wavelength of 1550 nm (based on scatterer library from Arbabi et al.61). The simulation is performed on 48 V100 GPUs and is distributed between these compute nodes using a subregion size of 20 μm × 20 μm and a padding of 6.5 μm, resulting in 2601 subregion simulations.

Comparison with locally periodic approximation

Approximate simulations of large-area metasurfaces often rely on the locally periodic approximation (LPA)19,37,61, wherein the local electromagnetic response of the metasurface is approximated with that of a periodic metasurface. To demonstrate that the full metasurface simulation approach captures meta-atom interactions beyond LPA, we compare the T-matrix simulation method with two commonly-used LPA approaches in Fig. 3. First is a simple transmission mask approximation, where we assume that the metasurface imparts a smooth position-dependent amplitude and phase to the incident field as determined by the periodic simulation61, and ignore the variation of the fields within a single unit cell. Second, we consider a more exact field stitching method9, where the field near the metasurface within each unit cell is approximated with the fields from the periodic simulation and then this stitched field is propagated. For high aspectratio scatterers, we find that while the transmission mask method significantly deviates from the T-matrix method, the field stitching method does not. However, for small aspect-ratio scatterers, which are expected to have larger inter meta-atom interactions, both the transmission mask and the field stitching LPA approximation methods significantly deviate from the T-matrix method62. These results are a strong indication of the ability of the T-matrix method to capture meta-atom interactions and accurately simulate the metasurface response.

Fig. 3: Comparison of T-matrix method simulations with locally periodic assumption (LPA) simulations.
figure 3

a Efficiency versus focal length for 25 μm × 25 μm metasurfaces designed from a library of high aspect-ratio scatterers with a large period (silicon cylinders with height 940 nm, radii range of 50–250 nm, lattice period of 1070 nm, and air background; source wavelength of 1550 nm – transmission and phase response shown in Supplementary Fig. 4a, based on scatterer library from Arbabi et al.61) — efficiencies are computed using the T-matrix approach (blue dots), the commonly-used LPA transmission mask phase sampling approach (black curve), and the LPA field-stitching method (green curve). The metalens efficiency is defined as the ratio of the power within a circle of radius 3 × FWHM in the focal plane to the power incident on the metasurface. The T-matrix and LPA-stitching methods agree fairly well here because the scatterers are high aspect-ratio and the lattice constant is large, hence the interactions between neighboring scatterers is negligible. b Efficiency versus focal length for 15 μm × 15 μm metasurfaces designed from a library of low aspect-ratio scatterers with a small period (silicon cylinders with height 220 nm, radii range of 175–280 nm, lattice period of 666 nm, and background refractive index 1.66; source wavelength of 1340 nm – using scatterer library from Gigli et al.62, scatterer transmission and phase response shown in Supplementary Fig. 4b) — efficiencies are computed using the T-matrix approach (blue dots), the commonly-used LPA transmission mask phase sampling approach (black curve), and the LPA field-stitching method (green curve). The metalens efficiency is defined as the ratio of the power within a circle of radius 3 × FWHM in the focal plane to the power incident on the metasurface. The T-matrix and LPA-stitching methods do not agree here because the scatterers are low aspect-ratio and the lattice constant is small, hence the interaction between neighboring scatterers is significant.

Distributed optimization-based design

An essential ingredient for optimization-based design of metasurfaces is an efficient evaluation of the gradient of the figure of merit with respect to the design parameters. A particularly useful method to evaluate gradients is based on adjoint-sensitivity analysis63,64 which analytically differentiates through Maxwell’s equations and computes the gradients with respect to all the design parameters with a cost proportional to only two electromagnetic simulations. The distributed T-matrix simulation method is also amenable to distributed adjoint sensitivity analysis and can allow for scalable evaluation of the gradient of a performance metric defined on the electric fields scattered from the metasurface with respect to both the meta-atom shape and positions (see Supplementary Note 4 for details). Figure 4 demonstrates a distributed gradient-based optimization with respect to the positions and radii of the cylindrical meta-atoms of a cost function evaluating the amount of power within a spot at the focal plane for a 30 μm × 30 μm metalens with focal-length 20 μm initially designed using the traditional periodic-approximation approach with the same scatterer library used in Fig. 3b. The distributed optimization was performed on 9 T4 GPUs with the metalens divided into 9 subregions (subregion size of 10 μm × 10 μm, and padding size of 6 μm). The forward simulations performed took an average of about 120 GPU-min and the gradient computations with respect to radius and position took an average of 150 GPU-min). The metalens has a very high NA over 0.996 and the optimization improves the efficiency of the metalens by ~2×, giving a final efficiency of about 24%. Although thin low aspect-ratio metasurfaces (Huygens metasurfaces) are of interest because they are more amenable to large-scale fabrication, they have not found widespread adoption due to their very limited efficiencies and angular responses62. Our ability to accurately model the scatterer-scatterer effects in our metasurface inverse-design may allow discovery of more practical Huygens metasurfaces65,66. Combining this multi-GPU gradient computation with the multi-GPU forward simulation, we have opened the door to gradient-based optimization over the many degrees of freedom afforded by arbitrarily large metasurfaces. In particular, our method allows optimizing both the shape and position of the scatterers composing the large-area metasurface — optimizing the scatterer positions is very difficult for any inverse-design approach that relies on a periodic assumption.

Fig. 4: Distributed Gradient-based optimization improvement of metalens design.
figure 4

a Lens efficiency versus optimization iteration, where lens efficiency is defined as the ratio of the power within a circle of radius 3 × FWHM in the focal plane to the power incident on the metasurface. The initial metasurface is a 30 μm × 30 μm metalens with focal-length 20 μm designed from the low aspect-ratio scatterer library used in Fig. 3b using the traditional periodic-approximation metasurface design approach. The metalens is designed and optimized for x-polarized light only. In 35 optimization iterations, the metalens efficiency is almost doubled. The inset shows the intensity of the X-component of the electric field in the focal plane before optimization (left) and after optimization (right). b Schematic of the cylindrical metasurface scatterers after optimization. c Histograms of the distance between the final scatterer positions and the initial scatterer positions (left) and the absolute radius difference between the final scatterer cylinders and the initial scatterer cylinders (right). As can be seen in these histograms, both the scatterer positions and radii change as a result of the optimization.

Discussion

We have demonstrated a scalable distribution method to accurately simulate arbitrarily large-area metasurfaces. Our method uses the Nyquist sampling theorem to allow parallel distribution of compute across multiple GPU nodes, on which a T-matrix method formulation is used to efficiently simulate the subregion. We show a roughly \(\frac{1}{{N}_{GPU}}\) scaling of the total simulation time and demonstrate that our method accurately accounts for all scatterer interactions. Finally, we demonstrate our ability to apply our distribution method to the computation of the gradient with respect to all design parameters. Our distributed simulation method provides a solution to the long-standing problem of simulating large-area metasurfaces and opens the door to gradient-based optimization of the full metasurface, taking advantage of all the design degrees of freedom.

Methods

Our low-overhead distribution strategy works by using the Nyquist sampling of the incident field to split the simulation into subregion simulations, each of which can be performed in parallel. We use RabbitMQ (https://www.rabbitmq.com) to create a queue of the metasurface subregion simulations and manage the distribution of these simulations to the available GPU compute nodes in a fault-tolerant manner. On each GPU compute node, we run our T-matrix method code implemented in C++ using CUDA for the single-node GPU parallelization of incident field, matrix-vector product, and electric and magnetic field computations. For our GPU compute nodes, we used Google Cloud V100 GPUs for the timing benchmarks and 1 mm × 1 mm metasurface simulation in Fig. 2, and T4 GPUs for the distributed inverse-design in Fig. 4. We interface our distributed metasurface solver with the photonic optimization framework software SPINS to perform the inverse-design.