Sr. Site Reliability Engineer

Engineering & TechOps | Beijing Shi, China

Apply Now!

Zuora provides the leading cloud-based subscription management platform that functions as a system of record for subscription businesses across all industries. Powering the Subscription Economy®, the Zuora platform was architected specifically for dynamic, recurring subscription business models and acts as an intelligent subscription management hub that automates and orchestrates the entire subscription order-to-cash process, including billing and revenue recognition. 


At Zuora, every employee is the CEO of their career and leading our mission are over 1,200 passionate and innovative ZEOs who value freedom, responsibility and accountability in equal measure because they have the capacity to make shift happen. Our culture isn’t an empty branding effort – our ZEOs love working here and it shows in our 4+ rating on Glassdoor. We take it very seriously. We encourage our employees to be curious, creative, and stay focused on our shared mission of enabling our customers to be successful.


Zuora serves more than 1,000 companies around the world, including Box, Komatsu, Rogers, Schneider Electric, Xplornet and Zendesk. Headquartered in Silicon Valley, Zuora also operates offices in Atlanta, Boston, Frisco, Denver, San Francisco, London, Paris, Beijing, Sydney, Chennai and Tokyo. 


At Zuora, different perspectives, experiences and contributions matter. Everyone counts. Zuora is proud to be an equal opportunity employer committed to creating an inclusive environment for all. 


Responsibilities



  • Part of a Global SRE team, based in Beijing, China.

  • Improve and build upon our automation tools for systems provisioning, monitoring, trending, and management.

  • Communicate effectively with fellow SREs and other engineering teams, and describe problems succinctly with sufficient detail that you can hand-off an ongoing problem to another team or a peer for completion.

  • During a crisis, lead the effort to triage and mitigate

  • Manage real-time communications during outages with both technical and non- technical audiences

  • Independently learn new technologies and master the Zuora platform so that you can provide 'full stack' diagnostics, when necessary, to help determine the root cause of internal problems.

  • Perform periodic on-call duty as part of a global team maintaining the availability and performance of the New Relic site and APIs used by third-party services, as well as the various internal services and systems that these core interfaces depend on

  • Strategize with fellow SREs and other engineering teams on complex problems, and make decisions and recommendations about systems improvements after analyzing possible courses of conduct.

  • Perform performance analysis, proactive troubleshooting, continual improvement and capacity planning for production, virtualized environment

  • Administrating Web Servers, Application Servers and Databases running J2EE applications

  • Develop policies and procedures that improve overall product stability

  • Design and create tools to manage the site

  • Participate in reviews of outages in order to improve overall product stability

  • Build relationships with development teams and technology leaders across the company


Requirements



  • Over 7 years of experience operating and scaling services in a distributed, internet-scale environment

  • Strong knowledge of Linux operating systems and environment

  • Strong knowledge of Networking, Load balancers, DNS, and TCP/IP

  • Experience with databases (MySQL, RDS, Dynamo DB)

  • Experience with monitoring, trending, and logging tools such as SignalFX, Sensu, Grafana, Splunk, and SumoLogic.

  • Experience in handling production outages and root cause analysis

  • Strong crisis management leadership ability; Experience with Incident management.

  • Hands on operational experience in a high-volume or critical production service environment

  • Effective communication skills, whether talking to individual contributors or to executive management

  • Ultimate self-starter

  • Experience with Virtualization/Amazon AWS a plus

  • Solid scripting skills; Experience with Shell, Python, Go, Ruby, etc.

  • Strong troubleshooting and problem resolution skills

  • Experience creating tools for infrastructure (IaaS and PaaS) management and automation a plus

  • Experience with complex SaaS or Production, revenue critical web services environments is a strong plus.

  • Experience with setup, configuration (puppet, chef, ansible) and maintenance of DNS (Bind), LDAP, Postfix, Central Logging (syslog-Ng), SNMP and Monitoring systems (e.g. Nagios, Ganglia, Cacti) and other reporting tools

  • Experience with Unix/Linux system administration especially in RedHat Linux (CentOS) environment

  • Experience with environment configurations at network, OS and application levels

  • Experience with environment monitoring in a 24/7 web application and ecommerce environments

  • Experience working with applications build using Java / J2EE and SQL is a plus

  • Ability to use scripting languages to automate tasks and gather data

  • Demonstrate ability to use problem solving techniques such as root cause analysis to resolve issues

  • Demonstrate ability to write and present effective materials, including presentations, status reporting, technical diagrams and flowcharts

  • Ability to follow and adhere to policies, procedures and standards relating to Systems management. May recommend process improvements.

  • B.A./B.S. degree (required); M.S. degree or equivalent technical training

  • Ability to handle periodic on-call duty

Apply Now! back to search