Media architecture exploits interactive technology to encourage passers-by to engage with an architectural environment. Whereas most media architecture installations focus on visual stimulation, we developed a permanent media facade that rhythmically knocks xylophone blocks embedded beneath 11 window sills, according to the human actions constantly traced via an overhead camera. In an attempt to overcome its apparent limitations in engaging passers-by more enduringly and purposefully, our study investigates the impact of feedforward learning, a constructive interaction method that instructs passers-by about the results of their actions. Based on a comparative (n=25) and a one-month in-the-wild (n=1877) study, we propose how feedforward learning could empower passers-by to understand the interaction of more abstract types of media architecture, and how particular quantitative indicators capturing this learning could predict how enduringly and purposefully a passer-might engage. We believe these contributions could inspire more creative integrations of non-visual modalities in future public interactive interventions.