NOC Engineer

Job #: 25-07892
Pay Rate: Not Specified
Job type: contractor
Location: Houston, TX

Key Responsibilities:
Monitor public cloud infrastructure (compute, storage, networking, and Kubernetes clusters) using observability tools like Prometheus, Grafana, and internal dashboards.
Identify, triage, and respond to real-time alerts and incidents to prevent or minimize customer impact.
Perform first-level troubleshooting of system issues, including host failures, degraded services, and latency incidents.
Escalate critical issues to CloudOps Engineering, Network Infrastructure, or Security teams following predefined runbooks and escalation paths.
Maintain clear documentation of incidents, resolutions, and system changes in the ticketing system (e.g., Jira, PagerDuty, or internal tooling).
Write and update operational playbooks to standardize response procedures for cloud infrastructure issues.
Collaborate in post-incident reviews with the Network Infrastructure and CloudOps teams to identify root causes and help implement long-term fixes.

Qualifications:
2+ years of experience in a NOC, cloud operations, or system monitoring role, preferably in a public cloud or SaaS environment.
Strong understanding of Linux systems, networking concepts (TCP/IP, DNS, VPN, BGP), and system administration basics.
Experience working with Juniper and Arista network equipment, including basic configuration and troubleshooting.
Familiarity with container orchestration and cloud-native tools (e.g., Kubernetes, Docker) is a plus.
Excellent troubleshooting skills and ability to work calmly in high-pressure, time-sensitive situations.
Strong communication skills with the ability to write clear incident reports and Cloud Operations playbooks.
Experience with services (e.g., Droplets, VPCs, Load Balancers, Spaces) is highly preferred.

Preferred Qualifications:
Certifications in Juniper (e.g., JNCIA, JNCIS) or Cisco (e.g., CCNA) technologies.
Familiarity with Infrastructure-as-Code tools (e.g., Terraform) and CI/CD pipelines.
Prior experience in high-availability cloud environments and large-scale incident management.

Apply Now Back to Search